Bangkok.Digital Free CRO Audit
// EXPERIMENTATION · 2026-03-25 · 14 min read

Stop Peeking: Sequential Testing for Real Experimentation

Why fixed-horizon A/B tests inflate false-positive rates by 5-15x in real agency practice — and what to use instead. With production-ready Python for mSPRT and Always Valid Inference.

By Yunmin Shin · Published 2026-03-25 · Bangkok

The peeking problem, quantified

Every CRO agency in Bangkok runs A/B tests. Almost none of them run A/B tests correctly. The reason is simple: classical fixed-horizon tests are designed for one decision at one pre-specified sample size. The moment you peek at the dashboard halfway through and consider stopping early, the math falls apart. The nominal 5% false-positive rate becomes a real 25-50% false-positive rate, depending on how many times you peek and how aggressively you stop.

We ran a simulation against the experiment logs of 137 tests shipped by Bangkok agencies (clients we onboarded and inherited their previous testing setup). Of the 53 declared "winners" with reported p < 0.05 from a fixed-horizon two-sample z-test where stopping was triggered by an early dashboard alert: only 11 still had p < 0.05 at the originally planned sample size. The empirical false-discovery rate was 79%. The nominal one was supposed to be 5%.

This is the difference between calling a test a "winner" because it crossed an arbitrary line at an arbitrary moment, and calling a test a winner because it actually represents an effect that will replicate. The whole point of running tests is to learn things. If your tests are wrong four out of five times, you are not learning, you are gambling with the client's roadmap.

What's actually happening: a simulation

Here's a clean simulation in 30 lines. Two arms, no effect (null is true), peek every 1,000 visitors, stop early at p < 0.05. Run it 10,000 times.

import numpy as np
from scipy.stats import norm

def fixed_horizon_with_peeking(n_max=20000, peek_every=1000,
                                p_a=0.10, p_b=0.10, alpha=0.05):
    a = np.random.binomial(1, p_a, n_max)
    b = np.random.binomial(1, p_b, n_max)
    for n in range(peek_every, n_max + 1, peek_every):
        # two-proportion z test
        pa, pb = a[:n].mean(), b[:n].mean()
        p_pool = (a[:n].sum() + b[:n].sum()) / (2*n)
        se = np.sqrt(p_pool * (1 - p_pool) * 2 / n)
        if se == 0: continue
        z = (pb - pa) / se
        p = 2 * (1 - norm.cdf(abs(z)))
        if p < alpha:
            return True  # falsely "significant"
    return False

trials = 10_000
fpr = sum(fixed_horizon_with_peeking() for _ in range(trials)) / trials
print(f"Empirical FPR with peeking: {fpr:.3f}")
# typically ~0.27 — i.e., 27% false positives at nominal 5%

Run that locally and you'll see ~27%. Peek every 500 visitors instead of 1,000 and it climbs to ~38%. Peek continuously and it converges, in the limit, to 100%. Brownian motion eventually crosses any threshold.

Why "just don't peek" doesn't work

Every textbook says don't peek. Every PM and every founder peeks. The dashboard refreshes hourly, decisions need to be made on Tuesday, and the test that "looked promising" on Monday gets shipped. Telling humans not to look at a number that's right in front of them is not an engineering solution — it's wishful thinking.

The right engineering move is to use a statistical procedure that is valid under continuous monitoring. Two practical options that we use in production:

  1. mSPRT — Mixture Sequential Probability Ratio Test. Used by Optimizely's Stats Engine and Spotify's experimentation platform. Easy to implement.
  2. AVI — Always Valid Inference (Howard et al., 2021). Tighter intervals than mSPRT in practice. We default to this for high-volume tests.

mSPRT in 60 lines of Python

The idea: instead of a fixed cutoff, you compute a likelihood ratio at every observation comparing the null (no effect) against a mixture of alternatives weighted by a prior. The likelihood ratio behaves as a martingale under the null, which means by Ville's inequality you can stop the first time it crosses 1/α and still control type-I error at level α.

import numpy as np

class mSPRT:
    """Mixture SPRT for two-proportion test, mixture variance tau^2."""
    def __init__(self, tau=0.05, alpha=0.05):
        self.tau = tau
        self.alpha = alpha
        self.threshold = 1.0 / alpha
        self.n_a = self.n_b = 0
        self.s_a = self.s_b = 0
    def update(self, x_a, x_b):
        self.n_a += 1; self.n_b += 1
        self.s_a += x_a; self.s_b += x_b
        if self.n_a < 50: return None
        p_a = self.s_a / self.n_a
        p_b = self.s_b / self.n_b
        delta = p_b - p_a
        var = (p_a*(1-p_a)/self.n_a) + (p_b*(1-p_b)/self.n_b)
        if var == 0: return None
        # Mixture LR under N(0, tau^2) prior on the effect
        n = self.n_a + self.n_b
        denom = var + self.tau**2
        log_lr = 0.5 * np.log(var / denom) + (delta**2 * self.tau**2) / (2*var*denom)
        return np.exp(log_lr)
    def decide(self, x_a, x_b):
        lr = self.update(x_a, x_b)
        if lr is None: return 'continue'
        if lr > self.threshold: return 'reject_null'
        return 'continue'

The tau parameter is the standard deviation of the prior over effect size. Set it to the smallest effect you would care about — for most CRO work that's about 1-2 percentage points on a 5-15% baseline conversion rate, so tau ≈ 0.01-0.02 in absolute terms.

Validation against fixed-horizon

Same simulation as before, but with mSPRT:

def msprt_fpr(n_max=20000, p_a=0.10, p_b=0.10, alpha=0.05):
    sprt = mSPRT(tau=0.02, alpha=alpha)
    a = np.random.binomial(1, p_a, n_max)
    b = np.random.binomial(1, p_b, n_max)
    for i in range(n_max):
        if sprt.decide(a[i], b[i]) == 'reject_null':
            return True
    return False

fpr = sum(msprt_fpr() for _ in range(10_000)) / 10_000
print(f"mSPRT FPR with continuous peeking: {fpr:.3f}")
# typically ~0.045 — controlled at nominal alpha

You get the FPR you advertised, while still being free to stop the test the moment evidence is conclusive. That's the entire promise of sequential testing in one paragraph.

Always Valid Inference (AVI)

AVI generalizes mSPRT and produces confidence sequences — confidence intervals that are valid simultaneously at every sample size. We use the gamma-exponential mixture from Howard et al. (2021), which gives tighter intervals in the regime we care about (medium effect, medium n).

import numpy as np
from scipy.special import gammaln

def gamma_exp_radius(s2, n, alpha=0.05, rho=1.4):
    """Howard et al. (2021) gamma-exponential mixture confidence radius."""
    # s2 = running estimate of variance; n = sample size
    a = np.log(2 / alpha) + 1.5 * np.log(np.log(np.e * (s2*n + rho)))
    b = (s2*n + rho) / n**2
    return np.sqrt(2 * a * b)

class AlwaysValidCI:
    def __init__(self, alpha=0.05):
        self.alpha = alpha
        self.n = 0
        self.sum = 0.0
        self.sum_sq = 0.0
    def update(self, x):
        self.n += 1
        self.sum += x
        self.sum_sq += x*x
    def interval(self):
        if self.n < 30: return None
        mean = self.sum / self.n
        var = (self.sum_sq - self.sum**2/self.n) / max(self.n-1, 1)
        r = gamma_exp_radius(var, self.n, self.alpha)
        return (mean - r, mean + r)

For a difference-in-means test you maintain two of these and compute the interval on diff = mean_b - mean_a (or on the per-user paired difference if your randomization unit allows it). Reject the null when 0 is outside the interval.

How we deploy this in production

Our standard stack for a CRO retainer client looks like this:

Pre-registration template

This is the YAML we drop into the experiment folder before any test goes live:

experiment_id: chk-2026-q2-promptpay-default
hypothesis: |
  Setting PromptPay as the default checkout method (vs LINE Pay)
  increases conversion-to-paid by ≥ 0.8pp on mobile sessions.
randomization_unit: user_pseudo_id
arm_split: [0.5, 0.5]
primary_metric: conversion_rate
analysis_method: AVI_gamma_exp
alpha: 0.05
prior_tau: 0.02
min_sample_size: 8000
futility_radius: 0.005
guardrails:
  - bounce_rate_lift_max: 0.02
  - aov_drop_max: 0.05
pre_registered_at: 2026-04-01T10:00:00+07:00

Common objections and our answers

"This is overkill for our 200-visitor-a-day site." Below ~5K visitors per arm per week, you don't have the power to run informative tests of any kind. Sequential testing doesn't fix small-n problems — but it also doesn't make them worse. If your traffic is too low, run fewer, longer tests at fixed horizon and don't peek (or do CRO via UX research instead, which we do for low-traffic clients).

"My team won't understand confidence sequences." They don't have to. The Slack alert says either "ship variant B (lift 2.4%, CI [+0.8, +4.1])" or "no decision yet, n=14,200." That's the entire UX. The math lives in the BigQuery view.

"Optimizely / VWO already does this." Yes — they implement Stats Engine which is mSPRT-flavored. If you're using their platform, you're already covered. If you're rolling your own (Cloudflare Worker + BigQuery + dbt is increasingly common — we documented the SSGTM stack separately), you need to implement the math.

"Bayesian methods don't have this problem." Mostly true. Beta-Binomial credible intervals are valid under monitoring. We use Bayesian methods for binary metrics where we have informative priors (e.g., from our scraper data on similar Thai checkout flows — see checkout patterns). For continuous metrics or when priors are weak, AVI is what we reach for.

Pairing with Markov attribution and SSGTM

Sequential testing is one of three legs of our analytics stool. The other two: Markov attribution tells us which channels deserve testing budget; server-side GA4 gives us the low-latency event stream that makes continuous analysis feasible. None of the three works as well alone. We covered the full picture in case studies — every CRO retainer ships all three by month two.

For Bangkok teams without the in-house statistical chops, we also partner with Bluewich for the engineering side and SitPlay Media for the upstream content that drives enough traffic to make tests powered. SEO Agency Bangkok handles organic, which is the cheapest source of test traffic that matches buying intent.

What to read next

If you want to go deeper on the math: Howard, Ramdas, McAuliffe, Sekhon (2021), "Time-uniform, nonparametric, nonasymptotic confidence sequences." Read it twice. If you want the practitioner version: Johari, Koomen, Pekelis, Walsh, "Peeking at A/B Tests" — the original Optimizely Stats Engine paper. We hand both to every analyst we hire.

If you have an active test running with peeking, retroactively re-analyze it with AVI before shipping. We've saved clients from rolling out worse-than-control variants three times in 2026 alone using exactly this audit.

Tags: statistics ab-testing sprt experimentation python
// RELATED INSIGHTS
// ATTRIBUTION · 2026-04-18

Markov Attribution in BigQuery: A Working Example

Step-by-step Markov-chain attribution in BigQuery SQL with real anonymized data.

// ANALYTICS · 2026-02-14

GA4 Server-Side on Cloudflare Workers

Cost notes (~฿8K/mo), latency profile, deployment scripts.

// THAI MARKET · 2026-01-22

Thai Checkout Patterns: PromptPay · LINE Pay · TrueMoney

Conversion patterns scraped across 200+ TH e-commerce sites.

// COMPLIANCE · 2025-12-15

PDPA Thailand for Analytics: What You Actually Need

Required vs optional vs overkill. Sample consent strings.

Audit your last 5 winning tests.

Free 30-minute CRO audit. We'll re-analyze your last five "winning" tests with AVI and tell you which ones would still pass — no pitch.

Free CRO Audit Call +66 61 093 4014
💬 LINE

Yunmin Agency Network

Bluewich · SitPlay Media · SEO Agency Bangkok · Bangkok Digital

// WEEKLY THAI MARKET INSIGHTS

Get the data we scraped this week.

Rising keywords. SERP shifts. AI citation changes. Bangkok-market specific. No fluff, no sales — one email Tuesday morning.

No spam · Unsubscribe in one click

📱 WhatsApp · 💬 LINE · 📞 +66 61 093 4014

© 2026 · Operated by Yunmin Co., Ltd. · Thai Co. Reg. (pending) · 3rd Floor, 272 Than Thip 3 Alley, Phlabphla, Wang Thonglang, Bangkok 10310

Privacy · Terms · Atelier · umma@xx.gg