How to Run A/B Tests Without Screwing Up: A Complete Guide

A/B testing isn’t rocket science. But it’s merciless with people who cheat on statistical rules. After 10 years analyzing tracking data, here’s what I’ve learned about running tests that actually produce reliable results.

1. Calculate Your Sample Size BEFORE Starting

Never launch a test without knowing how many visitors you need. This is the most common mistake I see.

Why it matters: If your sample is too small, you won’t detect real differences. If you stop too early because you see “significance,” you’re likely seeing noise, not signal.

How to calculate it:

The formula is: n = 2 × [σ² × (Z₁₋α/₂ + Z₁₋β)²] / δ²

Where:

  • σ² = p₀(1-p₀) where p₀ is your baseline conversion rate
  • Z₁₋α/₂ = 1.96 for 95% confidence
  • Z₁₋β = 0.84 for 80% power (or 1.28 for 90%)
  • δ = minimum detectable effect (MDE)

Practical example:

  • Baseline: 5% conversion rate
  • MDE: you want to detect +1% (so 6%)
  • Confidence: 95%, Power: 80%

You need approximately 7,400 visitors per group.

Tools to use:

  • Evan Miller’s Sample Size Calculator (evanmiller.org/ab-testing/sample-size.html)
  • Optimizely’s Sample Size Calculator

Critical point: The lower your expected improvement, the larger your sample size needs to be. If you’re testing subtle changes, you’ll need massive traffic.

2. Test for a Minimum of 2 Weeks

Even if you hit statistical significance after 3 days, keep going.

Why 2 weeks minimum?

  1. Business cycles: Users behave differently on weekdays vs weekends. You need to capture at least 2 complete weekly cycles to smooth out these patterns.
  2. Ghost conversions: A visitor might start their buying journey before your test begins, return during the test, and convert. This conversion gets counted as a “success” but it might have happened without your variation. The 14-day rule helps filter these false positives.
  3. Traffic source variations: Different traffic sources (organic, paid, email) arrive at different times and behave differently.

Maximum duration: 6-8 weeks

Beyond 6-8 weeks, data starts getting muddy:

  • Cookies get deleted
  • User patterns shift
  • External factors accumulate
  • You introduce new variables

Exception for high-traffic sites: If you have massive traffic and hit your sample size in 2-3 days, you still shouldn’t stop immediately. Even with sufficient sample size, you need to validate that the pattern holds across different days and user segments.

Sources confirm this:

  • AB Tasty research shows that stopping tests early is one of the main reasons for false positives
  • Kameleoon recommends 2-4 weeks as standard practice
  • The consensus across the industry is clear: patience pays off

3. Verify Your Sample Ratio Mismatch (SRM)

This is a critical check that most people skip.

What is SRM?

Sample Ratio Mismatch occurs when your actual split doesn’t match your intended split. For example:

  • You configure 50/50
  • You observe: Group A = 5,000 visitors, Group B = 2,100 visitors
  • This is a major SRM

Why it matters:

Microsoft Research found that approximately 6% of A/B tests have SRM issues. Even a small imbalance can:

  • Distort conversion rates
  • Make you detect false uplifts
  • Mask real changes
  • Completely reverse test outcomes

How to detect SRM:

Use a Chi-squared (χ²) goodness-of-fit test:

χ² = Σ [(Observed – Expected)² / Expected]

Threshold: If p-value < 0.01, you have a statistically significant SRM.

Example calculation:

  • Expected: 50/50 split with 10,000 total visitors = 5,000 each
  • Observed: Group A = 5,200, Group B = 4,800
  • χ² = [(5200-5000)²/5000] + [(4800-5000)²/5000] = 16
  • With 1 degree of freedom, if χ² > 3.84, you have SRM
  • In this case: 16 > 3.84 → SRM detected

Common causes of SRM:

  1. Broken randomization: Users aren’t being assigned correctly to groups
  2. Timing issues: Variations don’t start at the same time
  3. Bot filtering: Automatic systems incorrectly removing real users
  4. Redirect problems: If variation B uses a redirect that crashes or is slow
  5. Cookie/tracking issues: Users blocking or deleting cookies

Tools:

  • Lukas Vermeer’s SRM Checker: lukasvermeer.nl/srm/
  • Chrome extension for automatic SRM detection
  • Most A/B testing platforms now include built-in SRM alerts

What to do if SRM is detected:

Don’t trust your results. The standard approach is:

  1. Identify the root cause
  2. Fix it
  3. Rerun the test

Your data is compromised and any conclusions you draw are potentially invalid.

4. Never Stop a Test Just Because It’s “Significant”

This is called p-hacking (or data peeking) and it’s one of the worst mistakes in A/B testing.

The problem:

If you continuously monitor your test and stop as soon as you see p < 0.05, you’re dramatically increasing your false positive rate. You might see “significance” due to random fluctuations, not real effects.

Real-world example from ConversionXL:

They showed a test where Variation 1 appeared to be losing badly at the start. If they had stopped early, they would have declared it a loser. By waiting for the predetermined sample size, Variation 1 ended up winning by over 25%.

The rule:

Wait for BOTH:

  1. Your predetermined sample size
  2. Your minimum test duration (2 weeks)

Why this is hard:

Stakeholders get impatient. Business pressure builds. You see early “wins” and want to ship. Resist. The cost of implementing a change that doesn’t actually work (or worse, hurts conversion) is far higher than waiting a few more days.

Sequential testing alternative:

If you absolutely need to check results early, use proper sequential testing methods (like Sequential Probability Ratio Test) that are designed for continuous monitoring. But don’t just “peek” at standard A/B tests.

5. Aim for 95% Confidence and 80% Power Minimum

These are two different but equally important metrics.

Confidence Level (typically 95%):

This is the probability that if you see a difference, it’s not due to random chance.

  • 95% confidence means: if there’s truly no difference, you’ll incorrectly detect one only 5% of the time (Type I error)
  • 99% confidence is more conservative but requires larger samples
  • 90% is sometimes acceptable for low-risk tests

Statistical Power (typically 80%):

This is the probability of detecting a real effect if one exists.

  • 80% power means: if there IS a real difference, you’ll detect it 80% of the time
  • Low power = high risk of false negatives (Type II error)
  • 90% power is better but requires larger samples

The calculation:

For a conversion rate test with:

  • Baseline: 5% conversion
  • MDE: +1 percentage point
  • Confidence: 95%
  • Power: 80%

You need approximately 7,400 visitors per variant.

Common mistake:

Many people only check confidence and ignore power. This leads to “no difference” conclusions when there actually is a difference—you just didn’t have enough data to detect it.

6. Perform Sanity Checks on Invariant Metrics

Before trusting your results, verify that metrics that shouldn’t change haven’t changed.

What are invariant metrics?

Metrics that should theoretically be unaffected by your test variation.

Examples:

  • If you’re testing a button color on a product page, your total pageviews shouldn’t change
  • If you’re testing checkout flow, your add-to-cart rate (before checkout) shouldn’t change
  • Number of sessions per user should remain stable

Why this matters:

If invariant metrics change, it’s a red flag that something is wrong with your test implementation:

  • Technical bugs
  • Tracking errors
  • Targeting issues
  • Cookie problems

What to check:

  1. No external shocks: Were there any campaigns, bugs, traffic spikes, or events during the test?
  2. Stable traffic sources: Did the mix of traffic sources remain consistent?
  3. No changes to other parts of the site: Nothing else was deployed that could affect results?
  4. Consistent user experience: Both variations load at the same speed? No errors?

Example of catching implementation bugs:

At one company, they noticed that pageviews increased in the test variation. This made no sense for a button color test. Investigation revealed that the variation was causing an extra page reload due to a JavaScript error. Without this sanity check, they would have trusted corrupted data.

7. Understand Your Test Statistics (But Keep It Simple)

You don’t need to be a statistician, but you should understand the basics.

For conversion rate tests:

The z-statistic is: z = (p₁ – p₂) / SE

Where SE (Standard Error) is: SE = √[p(1-p) × (1/n₁ + 1/n₂)]

  • p = pooled conversion rate
  • p₁, p₂ = conversion rates for each group
  • n₁, n₂ = sample sizes

If |z| > 1.96 → significant at 95% confidence level

Chi-squared test alternative:

χ² = Σ[(Observed – Expected)² / Expected]

Both methods work. Chi-squared is often easier for contingency tables.

Most important principle:

Statistical significance ≠ business impact

A test can be statistically significant (p < 0.05) but have such a small effect that it’s not worth implementing. Always consider:

  • Is the lift meaningful for the business?
  • What’s the implementation cost?
  • What are the risks?

Additional Best Practices

Avoid multiple testing problems:

If you test multiple variations or multiple metrics, adjust your significance threshold. The Bonferroni correction is conservative but simple: divide your alpha by the number of comparisons.

One change at a time:

Test one variable at a time when possible. If you change both the button color AND the copy, you won’t know which drove the change.

Document everything:

  • Hypothesis
  • Sample size calculation
  • Test duration
  • Implementation details
  • Results and decision

Pre-register your analysis plan:

Decide ahead of time:

  • What metrics you’ll measure
  • What sample size you need
  • How long you’ll run the test
  • What will constitute success

This prevents post-hoc rationalization and HARKing (Hypothesizing After Results are Known).

Common Pitfalls to Avoid

  1. Testing too many things at once: Multiple concurrent tests can interact
  2. Changing the test mid-flight: Don’t modify variations or targeting during a test
  3. Trusting small samples: Even if “significant,” small samples are unreliable
  4. Ignoring seasonality: Don’t run tests that span major holidays or events
  5. Not documenting learnings: Failed tests teach you as much as winners

The Bottom Line

A/B testing isn’t complicated, but it’s unforgiving with those who cut corners on statistical rules.

The four most critical rules:

  1. Calculate sample size before starting
  2. Test for minimum 2 weeks
  3. Check for Sample Ratio Mismatch
  4. Never stop early just because it’s “significant”

Get these right, and your tests will be reliable. Skip them, and you’re just generating noise.

I’m not a statistician—I used AI to verify my sources and be as precise as possible. But these principles come from 10 years of practical experience in tracking and analytics across companies like SFR, TheFork, Expedia Group, and Amex GBT.

Sources

Laurent Fidahoussen
Laurent Fidahoussen

Ads & Tracking & Analytics & Dataviz for better Data Marketing and boost digital performance

25 years in IT, 10+ in digital data projects — I connect the dots between tech, analytics, reporting & media (not a pure Ads expert—but I’ll make your campaigns work for you)
- Finding it hard to launch, track, or measure your digital campaigns?
- Not sure if your marketing budget is working—or how your audiences behave?
- Messy tracking makes reporting a nightmare, and fast decisions impossible?
- Still wrestling with Excel to build dashboards, without real actionable insights?

I can help you:
- Launch and manage ad campaigns (Google, Meta, LinkedIn…)
- Set up robust, clean tracking—so you know what every euro gives you back
- Build and optimize events: visits, product views, carts, checkout, purchases, abandons
- Create dashboards and analytics tools that turn your data into real growth drivers
- Streamline reporting and visualization for simple, fast decisions

Curious? Let’s connect—my promise: clear, no jargon, just better results.

Stack:
Ads (Google Ads, Meta Ads, LinkedIn Ads) | Analytics (Adobe Analytics, GA4, GTM client & server-side) | Dataviz (Looker Studio, Power BI, Python/Jupyter)

Articles: 34