A/B testing isn’t rocket science. But it’s merciless with people who cheat on statistical rules. After 10 years analyzing tracking data, here’s what I’ve learned about running tests that actually produce reliable results.
1. Calculate Your Sample Size BEFORE Starting
Never launch a test without knowing how many visitors you need. This is the most common mistake I see.
Why it matters: If your sample is too small, you won’t detect real differences. If you stop too early because you see “significance,” you’re likely seeing noise, not signal.
How to calculate it:
The formula is: n = 2 × [σ² × (Z₁₋α/₂ + Z₁₋β)²] / δ²
Where:
- σ² = p₀(1-p₀) where p₀ is your baseline conversion rate
- Z₁₋α/₂ = 1.96 for 95% confidence
- Z₁₋β = 0.84 for 80% power (or 1.28 for 90%)
- δ = minimum detectable effect (MDE)
Practical example:
- Baseline: 5% conversion rate
- MDE: you want to detect +1% (so 6%)
- Confidence: 95%, Power: 80%
You need approximately 7,400 visitors per group.
Tools to use:
- Evan Miller’s Sample Size Calculator (evanmiller.org/ab-testing/sample-size.html)
- Optimizely’s Sample Size Calculator
Critical point: The lower your expected improvement, the larger your sample size needs to be. If you’re testing subtle changes, you’ll need massive traffic.
2. Test for a Minimum of 2 Weeks
Even if you hit statistical significance after 3 days, keep going.
Why 2 weeks minimum?
- Business cycles: Users behave differently on weekdays vs weekends. You need to capture at least 2 complete weekly cycles to smooth out these patterns.
- Ghost conversions: A visitor might start their buying journey before your test begins, return during the test, and convert. This conversion gets counted as a “success” but it might have happened without your variation. The 14-day rule helps filter these false positives.
- Traffic source variations: Different traffic sources (organic, paid, email) arrive at different times and behave differently.
Maximum duration: 6-8 weeks
Beyond 6-8 weeks, data starts getting muddy:
- Cookies get deleted
- User patterns shift
- External factors accumulate
- You introduce new variables
Exception for high-traffic sites: If you have massive traffic and hit your sample size in 2-3 days, you still shouldn’t stop immediately. Even with sufficient sample size, you need to validate that the pattern holds across different days and user segments.
Sources confirm this:
- AB Tasty research shows that stopping tests early is one of the main reasons for false positives
- Kameleoon recommends 2-4 weeks as standard practice
- The consensus across the industry is clear: patience pays off
3. Verify Your Sample Ratio Mismatch (SRM)
This is a critical check that most people skip.
What is SRM?
Sample Ratio Mismatch occurs when your actual split doesn’t match your intended split. For example:
- You configure 50/50
- You observe: Group A = 5,000 visitors, Group B = 2,100 visitors
- This is a major SRM
Why it matters:
Microsoft Research found that approximately 6% of A/B tests have SRM issues. Even a small imbalance can:
- Distort conversion rates
- Make you detect false uplifts
- Mask real changes
- Completely reverse test outcomes
How to detect SRM:
Use a Chi-squared (χ²) goodness-of-fit test:
χ² = Σ [(Observed – Expected)² / Expected]
Threshold: If p-value < 0.01, you have a statistically significant SRM.
Example calculation:
- Expected: 50/50 split with 10,000 total visitors = 5,000 each
- Observed: Group A = 5,200, Group B = 4,800
- χ² = [(5200-5000)²/5000] + [(4800-5000)²/5000] = 16
- With 1 degree of freedom, if χ² > 3.84, you have SRM
- In this case: 16 > 3.84 → SRM detected
Common causes of SRM:
- Broken randomization: Users aren’t being assigned correctly to groups
- Timing issues: Variations don’t start at the same time
- Bot filtering: Automatic systems incorrectly removing real users
- Redirect problems: If variation B uses a redirect that crashes or is slow
- Cookie/tracking issues: Users blocking or deleting cookies
Tools:
- Lukas Vermeer’s SRM Checker: lukasvermeer.nl/srm/
- Chrome extension for automatic SRM detection
- Most A/B testing platforms now include built-in SRM alerts
What to do if SRM is detected:
Don’t trust your results. The standard approach is:
- Identify the root cause
- Fix it
- Rerun the test
Your data is compromised and any conclusions you draw are potentially invalid.
4. Never Stop a Test Just Because It’s “Significant”
This is called p-hacking (or data peeking) and it’s one of the worst mistakes in A/B testing.
The problem:
If you continuously monitor your test and stop as soon as you see p < 0.05, you’re dramatically increasing your false positive rate. You might see “significance” due to random fluctuations, not real effects.
Real-world example from ConversionXL:
They showed a test where Variation 1 appeared to be losing badly at the start. If they had stopped early, they would have declared it a loser. By waiting for the predetermined sample size, Variation 1 ended up winning by over 25%.
The rule:
Wait for BOTH:
- Your predetermined sample size
- Your minimum test duration (2 weeks)
Why this is hard:
Stakeholders get impatient. Business pressure builds. You see early “wins” and want to ship. Resist. The cost of implementing a change that doesn’t actually work (or worse, hurts conversion) is far higher than waiting a few more days.
Sequential testing alternative:
If you absolutely need to check results early, use proper sequential testing methods (like Sequential Probability Ratio Test) that are designed for continuous monitoring. But don’t just “peek” at standard A/B tests.
5. Aim for 95% Confidence and 80% Power Minimum
These are two different but equally important metrics.
Confidence Level (typically 95%):
This is the probability that if you see a difference, it’s not due to random chance.
- 95% confidence means: if there’s truly no difference, you’ll incorrectly detect one only 5% of the time (Type I error)
- 99% confidence is more conservative but requires larger samples
- 90% is sometimes acceptable for low-risk tests
Statistical Power (typically 80%):
This is the probability of detecting a real effect if one exists.
- 80% power means: if there IS a real difference, you’ll detect it 80% of the time
- Low power = high risk of false negatives (Type II error)
- 90% power is better but requires larger samples
The calculation:
For a conversion rate test with:
- Baseline: 5% conversion
- MDE: +1 percentage point
- Confidence: 95%
- Power: 80%
You need approximately 7,400 visitors per variant.
Common mistake:
Many people only check confidence and ignore power. This leads to “no difference” conclusions when there actually is a difference—you just didn’t have enough data to detect it.
6. Perform Sanity Checks on Invariant Metrics
Before trusting your results, verify that metrics that shouldn’t change haven’t changed.
What are invariant metrics?
Metrics that should theoretically be unaffected by your test variation.
Examples:
- If you’re testing a button color on a product page, your total pageviews shouldn’t change
- If you’re testing checkout flow, your add-to-cart rate (before checkout) shouldn’t change
- Number of sessions per user should remain stable
Why this matters:
If invariant metrics change, it’s a red flag that something is wrong with your test implementation:
- Technical bugs
- Tracking errors
- Targeting issues
- Cookie problems
What to check:
- No external shocks: Were there any campaigns, bugs, traffic spikes, or events during the test?
- Stable traffic sources: Did the mix of traffic sources remain consistent?
- No changes to other parts of the site: Nothing else was deployed that could affect results?
- Consistent user experience: Both variations load at the same speed? No errors?
Example of catching implementation bugs:
At one company, they noticed that pageviews increased in the test variation. This made no sense for a button color test. Investigation revealed that the variation was causing an extra page reload due to a JavaScript error. Without this sanity check, they would have trusted corrupted data.
7. Understand Your Test Statistics (But Keep It Simple)
You don’t need to be a statistician, but you should understand the basics.
For conversion rate tests:
The z-statistic is: z = (p₁ – p₂) / SE
Where SE (Standard Error) is: SE = √[p(1-p) × (1/n₁ + 1/n₂)]
- p = pooled conversion rate
- p₁, p₂ = conversion rates for each group
- n₁, n₂ = sample sizes
If |z| > 1.96 → significant at 95% confidence level
Chi-squared test alternative:
χ² = Σ[(Observed – Expected)² / Expected]
Both methods work. Chi-squared is often easier for contingency tables.
Most important principle:
Statistical significance ≠ business impact
A test can be statistically significant (p < 0.05) but have such a small effect that it’s not worth implementing. Always consider:
- Is the lift meaningful for the business?
- What’s the implementation cost?
- What are the risks?
Additional Best Practices
Avoid multiple testing problems:
If you test multiple variations or multiple metrics, adjust your significance threshold. The Bonferroni correction is conservative but simple: divide your alpha by the number of comparisons.
One change at a time:
Test one variable at a time when possible. If you change both the button color AND the copy, you won’t know which drove the change.
Document everything:
- Hypothesis
- Sample size calculation
- Test duration
- Implementation details
- Results and decision
Pre-register your analysis plan:
Decide ahead of time:
- What metrics you’ll measure
- What sample size you need
- How long you’ll run the test
- What will constitute success
This prevents post-hoc rationalization and HARKing (Hypothesizing After Results are Known).
Common Pitfalls to Avoid
- Testing too many things at once: Multiple concurrent tests can interact
- Changing the test mid-flight: Don’t modify variations or targeting during a test
- Trusting small samples: Even if “significant,” small samples are unreliable
- Ignoring seasonality: Don’t run tests that span major holidays or events
- Not documenting learnings: Failed tests teach you as much as winners
The Bottom Line
A/B testing isn’t complicated, but it’s unforgiving with those who cut corners on statistical rules.
The four most critical rules:
- Calculate sample size before starting
- Test for minimum 2 weeks
- Check for Sample Ratio Mismatch
- Never stop early just because it’s “significant”
Get these right, and your tests will be reliable. Skip them, and you’re just generating noise.
I’m not a statistician—I used AI to verify my sources and be as precise as possible. But these principles come from 10 years of practical experience in tracking and analytics across companies like SFR, TheFork, Expedia Group, and Amex GBT.
Sources
- AB Tasty – How Long Should You Run an A/B Test? https://www.abtasty.com/blog/how-long-run-ab-test/
- AB Tasty – The Truth Behind the 14-Day A/B Test Period https://www.abtasty.com/blog/truth-behind-the-14-day-ab-test-period/
- AB Tasty – Sample Ratio Mismatch https://www.abtasty.com/blog/sample-ratio-mismatch/
- Kameleoon – Are You Stopping Your A/B Tests Too Early? https://www.kameleoon.com/blog/stopping-ab-tests-too-early
- Microsoft Research – Diagnosing Sample Ratio Mismatch in A/B Testing https://www.microsoft.com/en-us/research/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/
- Convert.com – How to Prevent Sample Ratio Mismatch in Your A/B Tests https://www.convert.com/blog/a-b-testing/sample-ratio-mismatch-srm-guide
- GuessTheTest – The Ultimate Guide to Correctly Calculating A/B Testing Sample Size and Test Duration https://guessthetest.com/calculating-sample-size-in-a-b-testing-everything-you-need-to-know/
- VWO – Sample Ratio Mismatch Glossary https://vwo.com/glossary/sample-ratio-mismatch/
- Analytics Toolkit – What is Sample Ratio Mismatch (SRM)? https://www.analytics-toolkit.com/glossary/sample-ratio-mismatch/
- Evan Miller – Sample Size Calculator https://www.evanmiller.org/ab-testing/sample-size.html
- Wikipedia – Sample Ratio Mismatch https://en.wikipedia.org/wiki/Sample_ratio_mismatch