Markov Attribution in BigQuery: A Working Example

Q: Can I trust GA4's default channel grouping for attribution?

No. If you've enabled BigQuery export on GA4, your data lands in analytics_XXXXXX.events_*, and we standardize channel groupings up front using the traffic_source fields. Don't trust GA4's default channel grouping for attribution work — it conflates direct with not set, and bins paid social weirdly.

Q: How do you build the transition matrix so conversion probability is well-defined?

We add a virtual start node before each path and a virtual conversion or null node after. This gives us a closed graph where conversion probability is well-defined.

Q: How many iterations do you need to compute the baseline conversion probability in BigQuery?

For an absorbing Markov chain, the probability of reaching conversion from start is computed by iterating the transition matrix until convergence. BigQuery doesn't love iterative recursion, so we use a fixed depth — we've found 30 iterations sufficient for paths up to 28 days.

Why Markov, and why now

Last-click attribution is finally dead in our practice. Three things killed it for our Bangkok clients in 2025-2026: GA4's data-driven model is now opaque-by-default for accounts under ~600 monthly conversions, Meta and TikTok's reported numbers double-count somewhere between 22% and 40% depending on category, and the average path-to-purchase for our SaaS clients is now 11.4 touchpoints over 28 days. Crediting the last one is malpractice.

We've ended up at Markov-chain attribution as the default for any client doing more than ฿2M/month in revenue. It's transparent (the SQL is auditable line-by-line), it's stable across small data shifts, and the removal-effect math has a clean intuition you can defend in a board meeting. This article is the working version we use — not a textbook explanation, but the SQL and Python that runs in production.

The model in one paragraph

Treat each user's journey as a sequence of channel visits ending in either conversion or null. Compute the transition probability matrix P(channel_j | channel_i) across all users — converters and non-converters alike. The probability of conversion under the full graph is your baseline. To attribute, remove one channel at a time (zero its outgoing edges, redistribute its in-edges to null) and recompute conversion probability. The drop is that channel's removal effect. Normalize the removal effects to sum to 1 and you have your data-driven attribution weights.

Step 1: Build the path table from GA4 export

If you've enabled BigQuery export on GA4, your data lands in analytics_XXXXXX.events_*. We standardize channel groupings up front using the traffic_source fields. Don't trust GA4's default channel grouping for attribution work — it conflates direct with not set, and bins paid social weirdly.

-- 01_paths.sql
-- Build ordered touchpoint sequences per user_pseudo_id, 28-day window
CREATE OR REPLACE TABLE `proj.attrib.paths_28d` AS
WITH events AS (
  SELECT
    user_pseudo_id,
    TIMESTAMP_MICROS(event_timestamp) AS ts,
    event_name,
    -- Custom channel grouping: don't trust the default
    CASE
      WHEN traffic_source.source = '(direct)' THEN 'direct'
      WHEN traffic_source.medium = 'cpc'
        AND REGEXP_CONTAINS(traffic_source.source, r'google') THEN 'paid_search'
      WHEN traffic_source.medium = 'cpc'
        AND REGEXP_CONTAINS(traffic_source.source, r'facebook|instagram|meta') THEN 'paid_social_meta'
      WHEN traffic_source.medium = 'cpc'
        AND REGEXP_CONTAINS(traffic_source.source, r'tiktok') THEN 'paid_social_tiktok'
      WHEN traffic_source.medium IN ('organic') THEN 'organic_search'
      WHEN traffic_source.medium = 'email' THEN 'email'
      WHEN traffic_source.medium IN ('referral') THEN 'referral'
      WHEN traffic_source.source = 'line' THEN 'line_oa'
      ELSE 'other'
    END AS channel
  FROM `proj.analytics_123456.events_*`
  WHERE _TABLE_SUFFIX BETWEEN
    FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
    AND event_name IN ('session_start','purchase')
),
sessions AS (
  -- One row per session_start, marked with whether a purchase followed within 28d
  SELECT
    e.user_pseudo_id,
    e.ts AS session_ts,
    e.channel,
    EXISTS (
      SELECT 1 FROM events p
      WHERE p.user_pseudo_id = e.user_pseudo_id
        AND p.event_name = 'purchase'
        AND p.ts BETWEEN e.ts AND TIMESTAMP_ADD(e.ts, INTERVAL 28 DAY)
    ) AS converted
  FROM events e
  WHERE e.event_name = 'session_start'
)
SELECT
  user_pseudo_id,
  ARRAY_AGG(channel ORDER BY session_ts) AS path,
  LOGICAL_OR(converted) AS converted
FROM sessions
GROUP BY user_pseudo_id;

Two things we always check on a fresh client at this point. First: is your direct share over 35% of sessions? If yes, you've got broken UTMs and your attribution model will assign 35% of credit to a channel that's mostly mis-tagged paid social. Fix tagging before fitting the model. Second: are your purchase events firing once per order, not per item? We've seen agencies dump $50K into a Markov rebuild because of double-counting at the purchase event level.

Step 2: Build the transition matrix

We add a virtual start node before each path and a virtual conversion or null node after. This gives us a closed graph where conversion probability is well-defined.

-- 02_transitions.sql
CREATE OR REPLACE TABLE `proj.attrib.transitions` AS
WITH expanded AS (
  SELECT
    user_pseudo_id,
    ARRAY_CONCAT(
      ['start'],
      path,
      [IF(converted, 'conversion', 'null')]
    ) AS full_path
  FROM `proj.attrib.paths_28d`
),
pairs AS (
  SELECT
    full_path[OFFSET(i)] AS from_channel,
    full_path[OFFSET(i+1)] AS to_channel
  FROM expanded, UNNEST(GENERATE_ARRAY(0, ARRAY_LENGTH(full_path)-2)) AS i
),
counts AS (
  SELECT from_channel, to_channel, COUNT(*) AS n
  FROM pairs
  GROUP BY 1, 2
)
SELECT
  from_channel,
  to_channel,
  n,
  SAFE_DIVIDE(n, SUM(n) OVER (PARTITION BY from_channel)) AS p
FROM counts;

You should now have a matrix where, for every state, the outgoing probabilities sum to 1.0. Spot-check this with a HAVING ABS(SUM(p) - 1) > 0.001 query before continuing. If you see channels with non-summing rows, you have a duplicate-event problem upstream.

Step 3: Compute the baseline conversion probability

For an absorbing Markov chain, the probability of reaching conversion from start is computed by iterating the transition matrix until convergence. BigQuery doesn't love iterative recursion, so we use a fixed depth (we've found 30 iterations sufficient for paths up to 28 days).

-- 03_baseline.sql — runs in ~14 sec on 4M sessions
DECLARE max_steps INT64 DEFAULT 30;
DECLARE step INT64 DEFAULT 0;

CREATE OR REPLACE TABLE `proj.attrib.state_prob` AS
SELECT 'start' AS state, 1.0 AS prob
UNION ALL SELECT DISTINCT to_channel, 0.0 FROM `proj.attrib.transitions` WHERE to_channel != 'start';

LOOP
  IF step >= max_steps THEN LEAVE; END IF;
  CREATE OR REPLACE TABLE `proj.attrib.state_prob` AS
  WITH next_step AS (
    SELECT
      t.to_channel AS state,
      SUM(s.prob * t.p) AS prob
    FROM `proj.attrib.state_prob` s
    JOIN `proj.attrib.transitions` t ON s.state = t.from_channel
    -- Absorb at conversion and null
    WHERE s.state NOT IN ('conversion','null')
    GROUP BY 1
  )
  SELECT state, prob FROM next_step
  UNION ALL
  -- Carry absorbed mass forward
  SELECT state, prob FROM `proj.attrib.state_prob`
  WHERE state IN ('conversion','null');
  SET step = step + 1;
END LOOP;

Your conversion row's prob after 30 iterations is the baseline conversion probability of the full graph. On a typical e-commerce client we see numbers in the 0.018 to 0.044 range, depending on traffic mix.

Step 4: Removal effect — the actual attribution

For each channel c, recompute the conversion probability with c's outgoing transitions zeroed and reallocated to null. The drop in probability is the channel's removal effect. We wrap this in a stored procedure so we can call it per-channel.

-- 04_removal_effect.sql
CREATE OR REPLACE PROCEDURE `proj.attrib.remove_channel`(remove STRING, OUT removed_prob FLOAT64)
BEGIN
  DECLARE step INT64 DEFAULT 0;
  CREATE OR REPLACE TEMP TABLE m AS
  SELECT
    from_channel,
    CASE WHEN from_channel = remove THEN 'null' ELSE to_channel END AS to_channel,
    CASE WHEN from_channel = remove THEN 1.0 ELSE p END AS p
  FROM `proj.attrib.transitions`;

  CREATE OR REPLACE TEMP TABLE sp AS
  SELECT 'start' AS state, 1.0 AS prob
  UNION ALL SELECT DISTINCT to_channel, 0.0 FROM m WHERE to_channel != 'start';

  LOOP
    IF step >= 30 THEN LEAVE; END IF;
    CREATE OR REPLACE TEMP TABLE sp AS
    SELECT t.to_channel AS state, SUM(s.prob * t.p) AS prob
    FROM sp s JOIN m t ON s.state = t.from_channel
    WHERE s.state NOT IN ('conversion','null')
    GROUP BY 1
    UNION ALL
    SELECT state, prob FROM sp WHERE state IN ('conversion','null');
    SET step = step + 1;
  END LOOP;
  SET removed_prob = (SELECT prob FROM sp WHERE state = 'conversion');
END;

Call this for every channel and store the deltas:

-- 05_attribute.sql — driver script
DECLARE channels ARRAY<STRING> DEFAULT (
  SELECT ARRAY_AGG(DISTINCT from_channel)
  FROM `proj.attrib.transitions`
  WHERE from_channel NOT IN ('start','conversion','null')
);
DECLARE baseline FLOAT64;
DECLARE i INT64 DEFAULT 0;
DECLARE removed FLOAT64;

SET baseline = (SELECT prob FROM `proj.attrib.state_prob` WHERE state='conversion');
CREATE OR REPLACE TABLE `proj.attrib.removal_effects`(channel STRING, removal_effect FLOAT64);

LOOP
  IF i >= ARRAY_LENGTH(channels) THEN LEAVE; END IF;
  CALL `proj.attrib.remove_channel`(channels[OFFSET(i)], removed);
  INSERT INTO `proj.attrib.removal_effects`
  VALUES (channels[OFFSET(i)], baseline - removed);
  SET i = i + 1;
END LOOP;

-- Final attribution: normalize removal effects to sum to 1
CREATE OR REPLACE TABLE `proj.attrib.weights` AS
SELECT
  channel,
  removal_effect,
  SAFE_DIVIDE(removal_effect, SUM(removal_effect) OVER ()) AS weight
FROM `proj.attrib.removal_effects`
ORDER BY weight DESC;

A real worked example (anonymized)

Here are the numbers from a Bangkok D2C client with ~฿4.2M monthly revenue, after running the pipeline above on Q1 2026 data:

Channel	Last-click %	Markov %	Delta
paid_search	38.4%	22.1%	−16.3
organic_search	11.7%	21.4%	+9.7
paid_social_meta	21.0%	16.8%	−4.2
paid_social_tiktok	4.2%	9.6%	+5.4
line_oa	6.1%	11.2%	+5.1
email	9.8%	8.7%	−1.1
direct	8.8%	10.2%	+1.4

Three findings the client had not seen with last-click: organic search was massively under-credited (it's almost always an early touchpoint, never the closer), TikTok was a 2.3x better top-funnel channel than its last-click numbers suggested, and LINE OA was doing real assist work that justified increasing the broadcast budget. Within a quarter we re-allocated ฿380K/month away from paid search into TikTok and LINE based on this output.

Quarterly retraining

Channel mix and seasonality drift. A model fit on Q4 holiday data will systematically over-credit paid social if you use it through Q1. We schedule retraining as a Cloud Composer DAG that runs on the first Monday of each quarter, materializes the new weights to a versioned table, and posts a diff to a Slack channel for analyst review before promoting to production.

# dags/markov_retrain.py — abridged
from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from datetime import datetime

with DAG('markov_retrain', schedule='0 6 1 */3 *',
         start_date=datetime(2026,1,1), catchup=False) as dag:
    paths = BigQueryInsertJobOperator(task_id='paths',
        configuration={'query':{'query':open('sql/01_paths.sql').read(),
                                'useLegacySql':False}})
    transitions = BigQueryInsertJobOperator(task_id='transitions',
        configuration={'query':{'query':open('sql/02_transitions.sql').read(),
                                'useLegacySql':False}})
    baseline = BigQueryInsertJobOperator(task_id='baseline',
        configuration={'query':{'query':open('sql/03_baseline.sql').read(),
                                'useLegacySql':False}})
    attribute = BigQueryInsertJobOperator(task_id='attribute',
        configuration={'query':{'query':open('sql/05_attribute.sql').read(),
                                'useLegacySql':False}})
    paths >> transitions >> baseline >> attribute

Caveats and gotchas

Walled gardens. If a click is driven by paid social but tagged as direct because of in-app browser stripping, your model will under-credit the channel. We push for server-side click ID forwarding (we wrote about GA4 server-side on Cloudflare Workers covering exactly this).

Cross-device. Markov on user_pseudo_id alone misses cross-device journeys. If you have an authenticated user identifier, run the model twice — once on raw user_pseudo_id, once on stitched user_id — and reconcile.

Statistical noise. Below ~5,000 conversions per period, removal-effect estimates are unstable. We bootstrap (resample paths with replacement, refit 200 times, take the median weight) for any client below 8,000 monthly conversions. The sequential testing piece has more on confidence intervals at small n.

Compliance. Path-level data is personal under PDPA. We hash user_pseudo_id at ingestion and don't retain raw paths beyond 90 days. See our PDPA for analytics guide for the consent and retention setup.

What this unlocks downstream

The output of this pipeline isn't just a different way to slice last-click numbers. It feeds three downstream systems for our clients: budget optimization (we run a constrained reallocation against the Markov weights weekly), experimentation prioritization (channels with high removal effect but low spend are first targets for tests), and checkout-flow segmentation (we tag converted sessions with their dominant assist channel and feed it to the on-site personalization layer — see our Thai checkout patterns piece). All three are part of the standard CRO retainer.

For full-stack data engineering needs around this — instrumentation, warehouse modeling, dashboarding — we partner with Bluewich for the build and SEO Agency Bangkok for the upstream traffic mix, while SitPlay Media handles the content layer that drives organic share. The Markov model is most useful in a network where someone is paying attention to all four levers.

Get the working repo

If you want the full SQL + Airflow scripts in a single repo with a synthetic test dataset, get in touch and we'll send the GitHub invite. We don't open-source it because we keep evolving the channel-grouping logic per industry, but every client we work with gets the current version.

Tags: attribution bigquery markov ga4 analytics-engineering

Markov Attribution in BigQuery: A Working Example

Why Markov, and why now

The model in one paragraph

Step 1: Build the path table from GA4 export

Step 2: Build the transition matrix

Step 3: Compute the baseline conversion probability

Step 4: Removal effect — the actual attribution

A real worked example (anonymized)

Quarterly retraining

Caveats and gotchas

What this unlocks downstream

Get the working repo

Frequently asked questions

Why use Markov-chain attribution instead of last-click in 2026?

How does Markov-chain removal-effect attribution actually work?

Can I trust GA4's default channel grouping for attribution?

How do you build the transition matrix so conversion probability is well-defined?

How many iterations do you need to compute the baseline conversion probability in BigQuery?

Stop Peeking: Sequential Testing for Real Experimentation

GA4 Server-Side on Cloudflare Workers

Thai Checkout Patterns: PromptPay · LINE Pay · TrueMoney

PDPA Thailand for Analytics: What You Actually Need

Want this pipeline running on your data?

Markov Attribution in BigQuery: A Working Example

Why Markov, and why now

The model in one paragraph

Step 1: Build the path table from GA4 export

Step 2: Build the transition matrix

Step 3: Compute the baseline conversion probability

Step 4: Removal effect — the actual attribution

A real worked example (anonymized)

Quarterly retraining

Caveats and gotchas

What this unlocks downstream

Get the working repo

Frequently asked questions

Why use Markov-chain attribution instead of last-click in 2026?

How does Markov-chain removal-effect attribution actually work?

Can I trust GA4's default channel grouping for attribution?

How do you build the transition matrix so conversion probability is well-defined?

How many iterations do you need to compute the baseline conversion probability in BigQuery?

Stop Peeking: Sequential Testing for Real Experimentation

GA4 Server-Side on Cloudflare Workers

Thai Checkout Patterns: PromptPay · LINE Pay · TrueMoney

PDPA Thailand for Analytics: What You Actually Need

Want this pipeline running on your data?

More from our scraper.

PDPA Thailand for Analytics: What You Actually Need

Stop Peeking: Sequential Testing for Real Experimentation

GA4 Server-Side on Cloudflare Workers: Cost & Latency Notes