Why Markov, and why now
Last-click attribution is finally dead in our practice. Three things killed it for our Bangkok clients in 2025-2026: GA4's data-driven model is now opaque-by-default for accounts under ~600 monthly conversions, Meta and TikTok's reported numbers double-count somewhere between 22% and 40% depending on category, and the average path-to-purchase for our SaaS clients is now 11.4 touchpoints over 28 days. Crediting the last one is malpractice.
We've ended up at Markov-chain attribution as the default for any client doing more than ฿2M/month in revenue. It's transparent (the SQL is auditable line-by-line), it's stable across small data shifts, and the removal-effect math has a clean intuition you can defend in a board meeting. This article is the working version we use — not a textbook explanation, but the SQL and Python that runs in production.
The model in one paragraph
Treat each user's journey as a sequence of channel visits ending in either conversion or null. Compute the transition probability matrix P(channel_j | channel_i) across all users — converters and non-converters alike. The probability of conversion under the full graph is your baseline. To attribute, remove one channel at a time (zero its outgoing edges, redistribute its in-edges to null) and recompute conversion probability. The drop is that channel's removal effect. Normalize the removal effects to sum to 1 and you have your data-driven attribution weights.
Step 1: Build the path table from GA4 export
If you've enabled BigQuery export on GA4, your data lands in analytics_XXXXXX.events_*. We standardize channel groupings up front using the traffic_source fields. Don't trust GA4's default channel grouping for attribution work — it conflates direct with not set, and bins paid social weirdly.
-- 01_paths.sql
-- Build ordered touchpoint sequences per user_pseudo_id, 28-day window
CREATE OR REPLACE TABLE `proj.attrib.paths_28d` AS
WITH events AS (
SELECT
user_pseudo_id,
TIMESTAMP_MICROS(event_timestamp) AS ts,
event_name,
-- Custom channel grouping: don't trust the default
CASE
WHEN traffic_source.source = '(direct)' THEN 'direct'
WHEN traffic_source.medium = 'cpc'
AND REGEXP_CONTAINS(traffic_source.source, r'google') THEN 'paid_search'
WHEN traffic_source.medium = 'cpc'
AND REGEXP_CONTAINS(traffic_source.source, r'facebook|instagram|meta') THEN 'paid_social_meta'
WHEN traffic_source.medium = 'cpc'
AND REGEXP_CONTAINS(traffic_source.source, r'tiktok') THEN 'paid_social_tiktok'
WHEN traffic_source.medium IN ('organic') THEN 'organic_search'
WHEN traffic_source.medium = 'email' THEN 'email'
WHEN traffic_source.medium IN ('referral') THEN 'referral'
WHEN traffic_source.source = 'line' THEN 'line_oa'
ELSE 'other'
END AS channel
FROM `proj.analytics_123456.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
AND event_name IN ('session_start','purchase')
),
sessions AS (
-- One row per session_start, marked with whether a purchase followed within 28d
SELECT
e.user_pseudo_id,
e.ts AS session_ts,
e.channel,
EXISTS (
SELECT 1 FROM events p
WHERE p.user_pseudo_id = e.user_pseudo_id
AND p.event_name = 'purchase'
AND p.ts BETWEEN e.ts AND TIMESTAMP_ADD(e.ts, INTERVAL 28 DAY)
) AS converted
FROM events e
WHERE e.event_name = 'session_start'
)
SELECT
user_pseudo_id,
ARRAY_AGG(channel ORDER BY session_ts) AS path,
LOGICAL_OR(converted) AS converted
FROM sessions
GROUP BY user_pseudo_id;
Two things we always check on a fresh client at this point. First: is your direct share over 35% of sessions? If yes, you've got broken UTMs and your attribution model will assign 35% of credit to a channel that's mostly mis-tagged paid social. Fix tagging before fitting the model. Second: are your purchase events firing once per order, not per item? We've seen agencies dump $50K into a Markov rebuild because of double-counting at the purchase event level.
Step 2: Build the transition matrix
We add a virtual start node before each path and a virtual conversion or null node after. This gives us a closed graph where conversion probability is well-defined.
-- 02_transitions.sql
CREATE OR REPLACE TABLE `proj.attrib.transitions` AS
WITH expanded AS (
SELECT
user_pseudo_id,
ARRAY_CONCAT(
['start'],
path,
[IF(converted, 'conversion', 'null')]
) AS full_path
FROM `proj.attrib.paths_28d`
),
pairs AS (
SELECT
full_path[OFFSET(i)] AS from_channel,
full_path[OFFSET(i+1)] AS to_channel
FROM expanded, UNNEST(GENERATE_ARRAY(0, ARRAY_LENGTH(full_path)-2)) AS i
),
counts AS (
SELECT from_channel, to_channel, COUNT(*) AS n
FROM pairs
GROUP BY 1, 2
)
SELECT
from_channel,
to_channel,
n,
SAFE_DIVIDE(n, SUM(n) OVER (PARTITION BY from_channel)) AS p
FROM counts;
You should now have a matrix where, for every state, the outgoing probabilities sum to 1.0. Spot-check this with a HAVING ABS(SUM(p) - 1) > 0.001 query before continuing. If you see channels with non-summing rows, you have a duplicate-event problem upstream.
Step 3: Compute the baseline conversion probability
For an absorbing Markov chain, the probability of reaching conversion from start is computed by iterating the transition matrix until convergence. BigQuery doesn't love iterative recursion, so we use a fixed depth (we've found 30 iterations sufficient for paths up to 28 days).
-- 03_baseline.sql — runs in ~14 sec on 4M sessions
DECLARE max_steps INT64 DEFAULT 30;
DECLARE step INT64 DEFAULT 0;
CREATE OR REPLACE TABLE `proj.attrib.state_prob` AS
SELECT 'start' AS state, 1.0 AS prob
UNION ALL SELECT DISTINCT to_channel, 0.0 FROM `proj.attrib.transitions` WHERE to_channel != 'start';
LOOP
IF step >= max_steps THEN LEAVE; END IF;
CREATE OR REPLACE TABLE `proj.attrib.state_prob` AS
WITH next_step AS (
SELECT
t.to_channel AS state,
SUM(s.prob * t.p) AS prob
FROM `proj.attrib.state_prob` s
JOIN `proj.attrib.transitions` t ON s.state = t.from_channel
-- Absorb at conversion and null
WHERE s.state NOT IN ('conversion','null')
GROUP BY 1
)
SELECT state, prob FROM next_step
UNION ALL
-- Carry absorbed mass forward
SELECT state, prob FROM `proj.attrib.state_prob`
WHERE state IN ('conversion','null');
SET step = step + 1;
END LOOP;
Your conversion row's prob after 30 iterations is the baseline conversion probability of the full graph. On a typical e-commerce client we see numbers in the 0.018 to 0.044 range, depending on traffic mix.
Step 4: Removal effect — the actual attribution
For each channel c, recompute the conversion probability with c's outgoing transitions zeroed and reallocated to null. The drop in probability is the channel's removal effect. We wrap this in a stored procedure so we can call it per-channel.
-- 04_removal_effect.sql
CREATE OR REPLACE PROCEDURE `proj.attrib.remove_channel`(remove STRING, OUT removed_prob FLOAT64)
BEGIN
DECLARE step INT64 DEFAULT 0;
CREATE OR REPLACE TEMP TABLE m AS
SELECT
from_channel,
CASE WHEN from_channel = remove THEN 'null' ELSE to_channel END AS to_channel,
CASE WHEN from_channel = remove THEN 1.0 ELSE p END AS p
FROM `proj.attrib.transitions`;
CREATE OR REPLACE TEMP TABLE sp AS
SELECT 'start' AS state, 1.0 AS prob
UNION ALL SELECT DISTINCT to_channel, 0.0 FROM m WHERE to_channel != 'start';
LOOP
IF step >= 30 THEN LEAVE; END IF;
CREATE OR REPLACE TEMP TABLE sp AS
SELECT t.to_channel AS state, SUM(s.prob * t.p) AS prob
FROM sp s JOIN m t ON s.state = t.from_channel
WHERE s.state NOT IN ('conversion','null')
GROUP BY 1
UNION ALL
SELECT state, prob FROM sp WHERE state IN ('conversion','null');
SET step = step + 1;
END LOOP;
SET removed_prob = (SELECT prob FROM sp WHERE state = 'conversion');
END;
Call this for every channel and store the deltas:
-- 05_attribute.sql — driver script
DECLARE channels ARRAY<STRING> DEFAULT (
SELECT ARRAY_AGG(DISTINCT from_channel)
FROM `proj.attrib.transitions`
WHERE from_channel NOT IN ('start','conversion','null')
);
DECLARE baseline FLOAT64;
DECLARE i INT64 DEFAULT 0;
DECLARE removed FLOAT64;
SET baseline = (SELECT prob FROM `proj.attrib.state_prob` WHERE state='conversion');
CREATE OR REPLACE TABLE `proj.attrib.removal_effects`(channel STRING, removal_effect FLOAT64);
LOOP
IF i >= ARRAY_LENGTH(channels) THEN LEAVE; END IF;
CALL `proj.attrib.remove_channel`(channels[OFFSET(i)], removed);
INSERT INTO `proj.attrib.removal_effects`
VALUES (channels[OFFSET(i)], baseline - removed);
SET i = i + 1;
END LOOP;
-- Final attribution: normalize removal effects to sum to 1
CREATE OR REPLACE TABLE `proj.attrib.weights` AS
SELECT
channel,
removal_effect,
SAFE_DIVIDE(removal_effect, SUM(removal_effect) OVER ()) AS weight
FROM `proj.attrib.removal_effects`
ORDER BY weight DESC;
A real worked example (anonymized)
Here are the numbers from a Bangkok D2C client with ~฿4.2M monthly revenue, after running the pipeline above on Q1 2026 data:
| Channel | Last-click % | Markov % | Delta |
|---|---|---|---|
| paid_search | 38.4% | 22.1% | −16.3 |
| organic_search | 11.7% | 21.4% | +9.7 |
| paid_social_meta | 21.0% | 16.8% | −4.2 |
| paid_social_tiktok | 4.2% | 9.6% | +5.4 |
| line_oa | 6.1% | 11.2% | +5.1 |
| 9.8% | 8.7% | −1.1 | |
| direct | 8.8% | 10.2% | +1.4 |
Three findings the client had not seen with last-click: organic search was massively under-credited (it's almost always an early touchpoint, never the closer), TikTok was a 2.3x better top-funnel channel than its last-click numbers suggested, and LINE OA was doing real assist work that justified increasing the broadcast budget. Within a quarter we re-allocated ฿380K/month away from paid search into TikTok and LINE based on this output.
Quarterly retraining
Channel mix and seasonality drift. A model fit on Q4 holiday data will systematically over-credit paid social if you use it through Q1. We schedule retraining as a Cloud Composer DAG that runs on the first Monday of each quarter, materializes the new weights to a versioned table, and posts a diff to a Slack channel for analyst review before promoting to production.
# dags/markov_retrain.py — abridged
from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from datetime import datetime
with DAG('markov_retrain', schedule='0 6 1 */3 *',
start_date=datetime(2026,1,1), catchup=False) as dag:
paths = BigQueryInsertJobOperator(task_id='paths',
configuration={'query':{'query':open('sql/01_paths.sql').read(),
'useLegacySql':False}})
transitions = BigQueryInsertJobOperator(task_id='transitions',
configuration={'query':{'query':open('sql/02_transitions.sql').read(),
'useLegacySql':False}})
baseline = BigQueryInsertJobOperator(task_id='baseline',
configuration={'query':{'query':open('sql/03_baseline.sql').read(),
'useLegacySql':False}})
attribute = BigQueryInsertJobOperator(task_id='attribute',
configuration={'query':{'query':open('sql/05_attribute.sql').read(),
'useLegacySql':False}})
paths >> transitions >> baseline >> attribute
Caveats and gotchas
Walled gardens. If a click is driven by paid social but tagged as direct because of in-app browser stripping, your model will under-credit the channel. We push for server-side click ID forwarding (we wrote about GA4 server-side on Cloudflare Workers covering exactly this).
Cross-device. Markov on user_pseudo_id alone misses cross-device journeys. If you have an authenticated user identifier, run the model twice — once on raw user_pseudo_id, once on stitched user_id — and reconcile.
Statistical noise. Below ~5,000 conversions per period, removal-effect estimates are unstable. We bootstrap (resample paths with replacement, refit 200 times, take the median weight) for any client below 8,000 monthly conversions. The sequential testing piece has more on confidence intervals at small n.
Compliance. Path-level data is personal under PDPA. We hash user_pseudo_id at ingestion and don't retain raw paths beyond 90 days. See our PDPA for analytics guide for the consent and retention setup.
What this unlocks downstream
The output of this pipeline isn't just a different way to slice last-click numbers. It feeds three downstream systems for our clients: budget optimization (we run a constrained reallocation against the Markov weights weekly), experimentation prioritization (channels with high removal effect but low spend are first targets for tests), and checkout-flow segmentation (we tag converted sessions with their dominant assist channel and feed it to the on-site personalization layer — see our Thai checkout patterns piece). All three are part of the standard CRO retainer.
For full-stack data engineering needs around this — instrumentation, warehouse modeling, dashboarding — we partner with Bluewich for the build and SEO Agency Bangkok for the upstream traffic mix, while SitPlay Media handles the content layer that drives organic share. The Markov model is most useful in a network where someone is paying attention to all four levers.
Get the working repo
If you want the full SQL + Airflow scripts in a single repo with a synthetic test dataset, email us and we'll send the GitHub invite. We don't open-source it because we keep evolving the channel-grouping logic per industry, but every client we work with gets the current version.
attribution bigquery markov ga4 analytics-engineering