The peeking problem, quantified
Every CRO agency in Bangkok runs A/B tests. Almost none of them run A/B tests correctly. The reason is simple: classical fixed-horizon tests are designed for one decision at one pre-specified sample size. The moment you peek at the dashboard halfway through and consider stopping early, the math falls apart. The nominal 5% false-positive rate becomes a real 25-50% false-positive rate, depending on how many times you peek and how aggressively you stop.
We ran a simulation against the experiment logs of 137 tests shipped by Bangkok agencies (clients we onboarded and inherited their previous testing setup). Of the 53 declared "winners" with reported p < 0.05 from a fixed-horizon two-sample z-test where stopping was triggered by an early dashboard alert: only 11 still had p < 0.05 at the originally planned sample size. The empirical false-discovery rate was 79%. The nominal one was supposed to be 5%.
This is the difference between calling a test a "winner" because it crossed an arbitrary line at an arbitrary moment, and calling a test a winner because it actually represents an effect that will replicate. The whole point of running tests is to learn things. If your tests are wrong four out of five times, you are not learning, you are gambling with the client's roadmap.
What's actually happening: a simulation
Here's a clean simulation in 30 lines. Two arms, no effect (null is true), peek every 1,000 visitors, stop early at p < 0.05. Run it 10,000 times.
import numpy as np
from scipy.stats import norm
def fixed_horizon_with_peeking(n_max=20000, peek_every=1000,
p_a=0.10, p_b=0.10, alpha=0.05):
a = np.random.binomial(1, p_a, n_max)
b = np.random.binomial(1, p_b, n_max)
for n in range(peek_every, n_max + 1, peek_every):
# two-proportion z test
pa, pb = a[:n].mean(), b[:n].mean()
p_pool = (a[:n].sum() + b[:n].sum()) / (2*n)
se = np.sqrt(p_pool * (1 - p_pool) * 2 / n)
if se == 0: continue
z = (pb - pa) / se
p = 2 * (1 - norm.cdf(abs(z)))
if p < alpha:
return True # falsely "significant"
return False
trials = 10_000
fpr = sum(fixed_horizon_with_peeking() for _ in range(trials)) / trials
print(f"Empirical FPR with peeking: {fpr:.3f}")
# typically ~0.27 — i.e., 27% false positives at nominal 5%
Run that locally and you'll see ~27%. Peek every 500 visitors instead of 1,000 and it climbs to ~38%. Peek continuously and it converges, in the limit, to 100%. Brownian motion eventually crosses any threshold.
Why "just don't peek" doesn't work
Every textbook says don't peek. Every PM and every founder peeks. The dashboard refreshes hourly, decisions need to be made on Tuesday, and the test that "looked promising" on Monday gets shipped. Telling humans not to look at a number that's right in front of them is not an engineering solution — it's wishful thinking.
The right engineering move is to use a statistical procedure that is valid under continuous monitoring. Two practical options that we use in production:
- mSPRT — Mixture Sequential Probability Ratio Test. Used by Optimizely's Stats Engine and Spotify's experimentation platform. Easy to implement.
- AVI — Always Valid Inference (Howard et al., 2021). Tighter intervals than mSPRT in practice. We default to this for high-volume tests.
mSPRT in 60 lines of Python
The idea: instead of a fixed cutoff, you compute a likelihood ratio at every observation comparing the null (no effect) against a mixture of alternatives weighted by a prior. The likelihood ratio behaves as a martingale under the null, which means by Ville's inequality you can stop the first time it crosses 1/α and still control type-I error at level α.
import numpy as np
class mSPRT:
"""Mixture SPRT for two-proportion test, mixture variance tau^2."""
def __init__(self, tau=0.05, alpha=0.05):
self.tau = tau
self.alpha = alpha
self.threshold = 1.0 / alpha
self.n_a = self.n_b = 0
self.s_a = self.s_b = 0
def update(self, x_a, x_b):
self.n_a += 1; self.n_b += 1
self.s_a += x_a; self.s_b += x_b
if self.n_a < 50: return None
p_a = self.s_a / self.n_a
p_b = self.s_b / self.n_b
delta = p_b - p_a
var = (p_a*(1-p_a)/self.n_a) + (p_b*(1-p_b)/self.n_b)
if var == 0: return None
# Mixture LR under N(0, tau^2) prior on the effect
n = self.n_a + self.n_b
denom = var + self.tau**2
log_lr = 0.5 * np.log(var / denom) + (delta**2 * self.tau**2) / (2*var*denom)
return np.exp(log_lr)
def decide(self, x_a, x_b):
lr = self.update(x_a, x_b)
if lr is None: return 'continue'
if lr > self.threshold: return 'reject_null'
return 'continue'
The tau parameter is the standard deviation of the prior over effect size. Set it to the smallest effect you would care about — for most CRO work that's about 1-2 percentage points on a 5-15% baseline conversion rate, so tau ≈ 0.01-0.02 in absolute terms.
Validation against fixed-horizon
Same simulation as before, but with mSPRT:
def msprt_fpr(n_max=20000, p_a=0.10, p_b=0.10, alpha=0.05):
sprt = mSPRT(tau=0.02, alpha=alpha)
a = np.random.binomial(1, p_a, n_max)
b = np.random.binomial(1, p_b, n_max)
for i in range(n_max):
if sprt.decide(a[i], b[i]) == 'reject_null':
return True
return False
fpr = sum(msprt_fpr() for _ in range(10_000)) / 10_000
print(f"mSPRT FPR with continuous peeking: {fpr:.3f}")
# typically ~0.045 — controlled at nominal alpha
You get the FPR you advertised, while still being free to stop the test the moment evidence is conclusive. That's the entire promise of sequential testing in one paragraph.
Always Valid Inference (AVI)
AVI generalizes mSPRT and produces confidence sequences — confidence intervals that are valid simultaneously at every sample size. We use the gamma-exponential mixture from Howard et al. (2021), which gives tighter intervals in the regime we care about (medium effect, medium n).
import numpy as np
from scipy.special import gammaln
def gamma_exp_radius(s2, n, alpha=0.05, rho=1.4):
"""Howard et al. (2021) gamma-exponential mixture confidence radius."""
# s2 = running estimate of variance; n = sample size
a = np.log(2 / alpha) + 1.5 * np.log(np.log(np.e * (s2*n + rho)))
b = (s2*n + rho) / n**2
return np.sqrt(2 * a * b)
class AlwaysValidCI:
def __init__(self, alpha=0.05):
self.alpha = alpha
self.n = 0
self.sum = 0.0
self.sum_sq = 0.0
def update(self, x):
self.n += 1
self.sum += x
self.sum_sq += x*x
def interval(self):
if self.n < 30: return None
mean = self.sum / self.n
var = (self.sum_sq - self.sum**2/self.n) / max(self.n-1, 1)
r = gamma_exp_radius(var, self.n, self.alpha)
return (mean - r, mean + r)
For a difference-in-means test you maintain two of these and compute the interval on diff = mean_b - mean_a (or on the per-user paired difference if your randomization unit allows it). Reject the null when 0 is outside the interval.
How we deploy this in production
Our standard stack for a CRO retainer client looks like this:
- Randomization: edge-side cookie split, hashed user identifier. We use Cloudflare Workers for server-side GA4, and the same Worker handles experiment assignment.
- Event collection: every exposure and every conversion lands in BigQuery within ~2 seconds via the same SSGTM pipeline.
- Analysis: a scheduled query runs the AVI procedure every 30 minutes against the running experiment, materializes the current confidence sequence, and posts to Slack when 0 falls outside the interval (or when the interval narrows below a pre-registered "futility" threshold).
- Stopping rule: pre-registered before the test starts. We document the prior, the alpha, the minimum sample size for any decision, and the futility cutoff. Without pre-registration, you can still p-hack with sequential methods.
Pre-registration template
This is the YAML we drop into the experiment folder before any test goes live:
experiment_id: chk-2026-q2-promptpay-default
hypothesis: |
Setting PromptPay as the default checkout method (vs LINE Pay)
increases conversion-to-paid by ≥ 0.8pp on mobile sessions.
randomization_unit: user_pseudo_id
arm_split: [0.5, 0.5]
primary_metric: conversion_rate
analysis_method: AVI_gamma_exp
alpha: 0.05
prior_tau: 0.02
min_sample_size: 8000
futility_radius: 0.005
guardrails:
- bounce_rate_lift_max: 0.02
- aov_drop_max: 0.05
pre_registered_at: 2026-04-01T10:00:00+07:00
Common objections and our answers
"This is overkill for our 200-visitor-a-day site." Below ~5K visitors per arm per week, you don't have the power to run informative tests of any kind. Sequential testing doesn't fix small-n problems — but it also doesn't make them worse. If your traffic is too low, run fewer, longer tests at fixed horizon and don't peek (or do CRO via UX research instead, which we do for low-traffic clients).
"My team won't understand confidence sequences." They don't have to. The Slack alert says either "ship variant B (lift 2.4%, CI [+0.8, +4.1])" or "no decision yet, n=14,200." That's the entire UX. The math lives in the BigQuery view.
"Optimizely / VWO already does this." Yes — they implement Stats Engine which is mSPRT-flavored. If you're using their platform, you're already covered. If you're rolling your own (Cloudflare Worker + BigQuery + dbt is increasingly common — we documented the SSGTM stack separately), you need to implement the math.
"Bayesian methods don't have this problem." Mostly true. Beta-Binomial credible intervals are valid under monitoring. We use Bayesian methods for binary metrics where we have informative priors (e.g., from our scraper data on similar Thai checkout flows — see checkout patterns). For continuous metrics or when priors are weak, AVI is what we reach for.
Pairing with Markov attribution and SSGTM
Sequential testing is one of three legs of our analytics stool. The other two: Markov attribution tells us which channels deserve testing budget; server-side GA4 gives us the low-latency event stream that makes continuous analysis feasible. None of the three works as well alone. We covered the full picture in case studies — every CRO retainer ships all three by month two.
For Bangkok teams without the in-house statistical chops, we also partner with Bluewich for the engineering side and SitPlay Media for the upstream content that drives enough traffic to make tests powered. SEO Agency Bangkok handles organic, which is the cheapest source of test traffic that matches buying intent.
What to read next
If you want to go deeper on the math: Howard, Ramdas, McAuliffe, Sekhon (2021), "Time-uniform, nonparametric, nonasymptotic confidence sequences." Read it twice. If you want the practitioner version: Johari, Koomen, Pekelis, Walsh, "Peeking at A/B Tests" — the original Optimizely Stats Engine paper. We hand both to every analyst we hire.
If you have an active test running with peeking, retroactively re-analyze it with AVI before shipping. We've saved clients from rolling out worse-than-control variants three times in 2026 alone using exactly this audit.
statistics ab-testing sprt experimentation python