Causal Inference & Experimentation

Did the new feature really move the needle?

Your team ships a new onboarding flow and the test shows +12% signups on day three. Do you roll it out? Most product A/B tests are read wrong, stopped too early, under-powered, or shipped on a "win" too small to matter. Here's the discipline that turns a test into a decision you can defend, on a flow with a 12% baseline signup rate.

1Design it first, how long must it run?

You can't decide duration after you peek; you commit before. The smaller the effect you want to catch, the more traffic you need, and it grows with the square of how small. Pick the minimum lift worth shipping and your traffic; the test tells you the sample size and runtime (80% power, 5% false-positive rate).

Smallest lift worth detecting

Visitors per day (both arms)

Sample / arm

Total visitors

Run time

Sample/arm needed vs. effect size● your current target

2Don't peek, early "significance" is mostly noise

Here's the trap that ruins more product tests than any other. These are A/A tests, both arms are identical, so the true effect is zero. Yet if you check every day and ship the moment p < 0.05, you'll declare a fake winner shockingly often. Each line is one A/A test's running p-value; the more times you look, the more chances to get unlucky.

Times you check the results

Gray lines = running p-value of A/A tests (true effect = 0)p = 0.05 "significance" line

3Read it right, "significant" isn't "worth shipping"

A test answers two different questions. Is the effect real? (does the interval clear zero?) and is it big enough to matter? (does it clear your ship bar?). A tiny, certain win and a huge, uncertain one are different decisions. Move the true effect and the sample size and watch the verdict change.

True lift

Sample / arm

no effectship bar (+5% rel)measured lift ± 95% CI

4From result to decision, size the bet

The last mile is business, not statistics: translate the lift (and its uncertainty) into the outcome leadership cares about, then make the call. Same test result as above; set how much traffic hits this flow in a year.

Annual visitors to this flow

projected extra signups / year (95% CI)

Work together on experimentation

What a typical engagement looks like, end to end.

What you get

A pre-registered experiment, geo-test, or quasi-experiment design
Power analysis so the test is sized to actually answer the question
An analysis plan agreed before the data comes in
A results readout with effect sizes, confidence intervals, and a ship / iterate call

How it works

Free 30-minute call to frame the decision
Design and power analysis, pre-registered
Run the test, with monitoring and no peeking
Analysis and a clear decision memo

Book Call →