How to Run A/B Tests Without Screwing Up: A Complete Guide

A/B testing isn’t rocket science. But it’s merciless with people who cheat on statistical rules. After 10 years analyzing tracking data, here’s what I’ve learned about running tests that actually produce reliable results.

1. Calculate Your Sample Size BEFORE Starting

Never launch a test without knowing how many visitors you need. This is the most common mistake I see.

Why it matters: If your sample is too small, you won’t detect real differences. If you stop too early because you see “significance,” you’re likely seeing noise, not signal.

How to calculate it:

The formula is: n = 2 × [σ² × (Z₁₋α/₂ + Z₁₋β)²] / δ²

Where:

σ² = p₀(1-p₀) where p₀ is your baseline conversion rate
Z₁₋α/₂ = 1.96 for 95% confidence
Z₁₋β = 0.84 for 80% power (or 1.28 for 90%)
δ = minimum detectable effect (MDE)

Practical example:

Baseline: 5% conversion rate
MDE: you want to detect +1% (so 6%)
Confidence: 95%, Power: 80%

You need approximately 7,400 visitors per group.

Tools to use:

Evan Miller’s Sample Size Calculator (evanmiller.org/ab-testing/sample-size.html)
Optimizely’s Sample Size Calculator

Critical point: The lower your expected improvement, the larger your sample size needs to be. If you’re testing subtle changes, you’ll need massive traffic.

2. Test for a Minimum of 2 Weeks

Even if you hit statistical significance after 3 days, keep going.

Why 2 weeks minimum?

Business cycles: Users behave differently on weekdays vs weekends. You need to capture at least 2 complete weekly cycles to smooth out these patterns.
Ghost conversions: A visitor might start their buying journey before your test begins, return during the test, and convert. This conversion gets counted as a “success” but it might have happened without your variation. The 14-day rule helps filter these false positives.
Traffic source variations: Different traffic sources (organic, paid, email) arrive at different times and behave differently.

Maximum duration: 6-8 weeks

Beyond 6-8 weeks, data starts getting muddy:

Cookies get deleted
User patterns shift
External factors accumulate
You introduce new variables

Exception for high-traffic sites: If you have massive traffic and hit your sample size in 2-3 days, you still shouldn’t stop immediately. Even with sufficient sample size, you need to validate that the pattern holds across different days and user segments.

Sources confirm this:

AB Tasty research shows that stopping tests early is one of the main reasons for false positives
Kameleoon recommends 2-4 weeks as standard practice
The consensus across the industry is clear: patience pays off

3. Verify Your Sample Ratio Mismatch (SRM)

This is a critical check that most people skip.

What is SRM?

Sample Ratio Mismatch occurs when your actual split doesn’t match your intended split. For example:

You configure 50/50
You observe: Group A = 5,000 visitors, Group B = 2,100 visitors
This is a major SRM

Why it matters:

Microsoft Research found that approximately 6% of A/B tests have SRM issues. Even a small imbalance can:

Distort conversion rates
Make you detect false uplifts
Mask real changes
Completely reverse test outcomes

How to detect SRM:

Use a Chi-squared (χ²) goodness-of-fit test:

χ² = Σ [(Observed – Expected)² / Expected]

Threshold: If p-value < 0.01, you have a statistically significant SRM.

Example calculation:

Expected: 50/50 split with 10,000 total visitors = 5,000 each
Observed: Group A = 5,200, Group B = 4,800
χ² = [(5200-5000)²/5000] + [(4800-5000)²/5000] = 16
With 1 degree of freedom, if χ² > 3.84, you have SRM
In this case: 16 > 3.84 → SRM detected

Common causes of SRM:

Broken randomization: Users aren’t being assigned correctly to groups
Timing issues: Variations don’t start at the same time
Bot filtering: Automatic systems incorrectly removing real users
Redirect problems: If variation B uses a redirect that crashes or is slow
Cookie/tracking issues: Users blocking or deleting cookies

Tools:

Lukas Vermeer’s SRM Checker: lukasvermeer.nl/srm/
Chrome extension for automatic SRM detection
Most A/B testing platforms now include built-in SRM alerts

What to do if SRM is detected:

Don’t trust your results. The standard approach is:

Identify the root cause
Fix it
Rerun the test

Your data is compromised and any conclusions you draw are potentially invalid.

4. Never Stop a Test Just Because It’s “Significant”

This is called p-hacking (or data peeking) and it’s one of the worst mistakes in A/B testing.

The problem:

If you continuously monitor your test and stop as soon as you see p < 0.05, you’re dramatically increasing your false positive rate. You might see “significance” due to random fluctuations, not real effects.

Real-world example from ConversionXL:

They showed a test where Variation 1 appeared to be losing badly at the start. If they had stopped early, they would have declared it a loser. By waiting for the predetermined sample size, Variation 1 ended up winning by over 25%.

The rule:

Wait for BOTH:

Your predetermined sample size
Your minimum test duration (2 weeks)

Why this is hard:

Stakeholders get impatient. Business pressure builds. You see early “wins” and want to ship. Resist. The cost of implementing a change that doesn’t actually work (or worse, hurts conversion) is far higher than waiting a few more days.

Sequential testing alternative:

If you absolutely need to check results early, use proper sequential testing methods (like Sequential Probability Ratio Test) that are designed for continuous monitoring. But don’t just “peek” at standard A/B tests.

5. Aim for 95% Confidence and 80% Power Minimum

These are two different but equally important metrics.

Confidence Level (typically 95%):

This is the probability that if you see a difference, it’s not due to random chance.

95% confidence means: if there’s truly no difference, you’ll incorrectly detect one only 5% of the time (Type I error)
99% confidence is more conservative but requires larger samples
90% is sometimes acceptable for low-risk tests

Statistical Power (typically 80%):

This is the probability of detecting a real effect if one exists.

80% power means: if there IS a real difference, you’ll detect it 80% of the time
Low power = high risk of false negatives (Type II error)
90% power is better but requires larger samples

The calculation:

For a conversion rate test with:

Baseline: 5% conversion
MDE: +1 percentage point
Confidence: 95%
Power: 80%

You need approximately 7,400 visitors per variant.

Common mistake:

Many people only check confidence and ignore power. This leads to “no difference” conclusions when there actually is a difference—you just didn’t have enough data to detect it.

6. Perform Sanity Checks on Invariant Metrics

Before trusting your results, verify that metrics that shouldn’t change haven’t changed.

What are invariant metrics?

Metrics that should theoretically be unaffected by your test variation.

Examples:

If you’re testing a button color on a product page, your total pageviews shouldn’t change
If you’re testing checkout flow, your add-to-cart rate (before checkout) shouldn’t change
Number of sessions per user should remain stable

Why this matters:

If invariant metrics change, it’s a red flag that something is wrong with your test implementation:

Technical bugs
Tracking errors
Targeting issues
Cookie problems

What to check:

No external shocks: Were there any campaigns, bugs, traffic spikes, or events during the test?
Stable traffic sources: Did the mix of traffic sources remain consistent?
No changes to other parts of the site: Nothing else was deployed that could affect results?
Consistent user experience: Both variations load at the same speed? No errors?

Example of catching implementation bugs:

At one company, they noticed that pageviews increased in the test variation. This made no sense for a button color test. Investigation revealed that the variation was causing an extra page reload due to a JavaScript error. Without this sanity check, they would have trusted corrupted data.

7. Understand Your Test Statistics (But Keep It Simple)

You don’t need to be a statistician, but you should understand the basics.

For conversion rate tests:

The z-statistic is: z = (p₁ – p₂) / SE

Where SE (Standard Error) is: SE = √[p(1-p) × (1/n₁ + 1/n₂)]

p = pooled conversion rate
p₁, p₂ = conversion rates for each group
n₁, n₂ = sample sizes

If |z| > 1.96 → significant at 95% confidence level

Chi-squared test alternative:

χ² = Σ[(Observed – Expected)² / Expected]

Both methods work. Chi-squared is often easier for contingency tables.

Most important principle:

Statistical significance ≠ business impact

A test can be statistically significant (p < 0.05) but have such a small effect that it’s not worth implementing. Always consider:

Is the lift meaningful for the business?
What’s the implementation cost?
What are the risks?

Additional Best Practices

Avoid multiple testing problems:

If you test multiple variations or multiple metrics, adjust your significance threshold. The Bonferroni correction is conservative but simple: divide your alpha by the number of comparisons.

One change at a time:

Test one variable at a time when possible. If you change both the button color AND the copy, you won’t know which drove the change.

Document everything:

Hypothesis
Sample size calculation
Test duration
Implementation details
Results and decision

Pre-register your analysis plan:

Decide ahead of time:

What metrics you’ll measure
What sample size you need
How long you’ll run the test
What will constitute success

This prevents post-hoc rationalization and HARKing (Hypothesizing After Results are Known).

Common Pitfalls to Avoid

Testing too many things at once: Multiple concurrent tests can interact
Changing the test mid-flight: Don’t modify variations or targeting during a test
Trusting small samples: Even if “significant,” small samples are unreliable
Ignoring seasonality: Don’t run tests that span major holidays or events
Not documenting learnings: Failed tests teach you as much as winners

The Bottom Line

A/B testing isn’t complicated, but it’s unforgiving with those who cut corners on statistical rules.

The four most critical rules:

Calculate sample size before starting
Test for minimum 2 weeks
Check for Sample Ratio Mismatch
Never stop early just because it’s “significant”

Get these right, and your tests will be reliable. Skip them, and you’re just generating noise.

I’m not a statistician—I used AI to verify my sources and be as precise as possible. But these principles come from 10 years of practical experience in tracking and analytics across companies like SFR, TheFork, Expedia Group, and Amex GBT.

Sources

AB Tasty – How Long Should You Run an A/B Test? https://www.abtasty.com/blog/how-long-run-ab-test/
AB Tasty – The Truth Behind the 14-Day A/B Test Period https://www.abtasty.com/blog/truth-behind-the-14-day-ab-test-period/
AB Tasty – Sample Ratio Mismatch https://www.abtasty.com/blog/sample-ratio-mismatch/
Kameleoon – Are You Stopping Your A/B Tests Too Early? https://www.kameleoon.com/blog/stopping-ab-tests-too-early
Microsoft Research – Diagnosing Sample Ratio Mismatch in A/B Testing https://www.microsoft.com/en-us/research/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/
Convert.com – How to Prevent Sample Ratio Mismatch in Your A/B Tests https://www.convert.com/blog/a-b-testing/sample-ratio-mismatch-srm-guide
GuessTheTest – The Ultimate Guide to Correctly Calculating A/B Testing Sample Size and Test Duration https://guessthetest.com/calculating-sample-size-in-a-b-testing-everything-you-need-to-know/
VWO – Sample Ratio Mismatch Glossary https://vwo.com/glossary/sample-ratio-mismatch/
Analytics Toolkit – What is Sample Ratio Mismatch (SRM)? https://www.analytics-toolkit.com/glossary/sample-ratio-mismatch/
Evan Miller – Sample Size Calculator https://www.evanmiller.org/ab-testing/sample-size.html
Wikipedia – Sample Ratio Mismatch https://en.wikipedia.org/wiki/Sample_ratio_mismatch