A/B testing isn’t rocket science. But it’s merciless with people who cheat on statistical rules. After 10 years analyzing tracking data, here’s what I’ve learned about running tests that actually produce reliable results.

1. Calculate Your Sample Size BEFORE Starting

Never launch a test without knowing how many visitors you need. This is the most common mistake I see.

Why it matters: If your sample is too small, you won’t detect real differences. If you stop too early because you see “significance,” you’re likely seeing noise, not signal.

How to calculate it:

The formula is: n = 2 × [σ² × (Z₁₋α/₂ + Z₁₋β)²] / δ²

Where:

Practical example:

You need approximately 7,400 visitors per group.

Tools to use:

Critical point: The lower your expected improvement, the larger your sample size needs to be. If you’re testing subtle changes, you’ll need massive traffic.

2. Test for a Minimum of 2 Weeks

Even if you hit statistical significance after 3 days, keep going.

Why 2 weeks minimum?

  1. Business cycles: Users behave differently on weekdays vs weekends. You need to capture at least 2 complete weekly cycles to smooth out these patterns.
  2. Ghost conversions: A visitor might start their buying journey before your test begins, return during the test, and convert. This conversion gets counted as a “success” but it might have happened without your variation. The 14-day rule helps filter these false positives.
  3. Traffic source variations: Different traffic sources (organic, paid, email) arrive at different times and behave differently.

Maximum duration: 6-8 weeks

Beyond 6-8 weeks, data starts getting muddy:

Exception for high-traffic sites: If you have massive traffic and hit your sample size in 2-3 days, you still shouldn’t stop immediately. Even with sufficient sample size, you need to validate that the pattern holds across different days and user segments.

Sources confirm this:

3. Verify Your Sample Ratio Mismatch (SRM)

This is a critical check that most people skip.

What is SRM?

Sample Ratio Mismatch occurs when your actual split doesn’t match your intended split. For example:

Why it matters:

Microsoft Research found that approximately 6% of A/B tests have SRM issues. Even a small imbalance can:

How to detect SRM:

Use a Chi-squared (χ²) goodness-of-fit test:

χ² = Σ [(Observed – Expected)² / Expected]

Threshold: If p-value < 0.01, you have a statistically significant SRM.

Example calculation:

Common causes of SRM:

  1. Broken randomization: Users aren’t being assigned correctly to groups
  2. Timing issues: Variations don’t start at the same time
  3. Bot filtering: Automatic systems incorrectly removing real users
  4. Redirect problems: If variation B uses a redirect that crashes or is slow
  5. Cookie/tracking issues: Users blocking or deleting cookies

Tools:

What to do if SRM is detected:

Don’t trust your results. The standard approach is:

  1. Identify the root cause
  2. Fix it
  3. Rerun the test

Your data is compromised and any conclusions you draw are potentially invalid.

4. Never Stop a Test Just Because It’s “Significant”

This is called p-hacking (or data peeking) and it’s one of the worst mistakes in A/B testing.

The problem:

If you continuously monitor your test and stop as soon as you see p < 0.05, you’re dramatically increasing your false positive rate. You might see “significance” due to random fluctuations, not real effects.

Real-world example from ConversionXL:

They showed a test where Variation 1 appeared to be losing badly at the start. If they had stopped early, they would have declared it a loser. By waiting for the predetermined sample size, Variation 1 ended up winning by over 25%.

The rule:

Wait for BOTH:

  1. Your predetermined sample size
  2. Your minimum test duration (2 weeks)

Why this is hard:

Stakeholders get impatient. Business pressure builds. You see early “wins” and want to ship. Resist. The cost of implementing a change that doesn’t actually work (or worse, hurts conversion) is far higher than waiting a few more days.

Sequential testing alternative:

If you absolutely need to check results early, use proper sequential testing methods (like Sequential Probability Ratio Test) that are designed for continuous monitoring. But don’t just “peek” at standard A/B tests.

5. Aim for 95% Confidence and 80% Power Minimum

These are two different but equally important metrics.

Confidence Level (typically 95%):

This is the probability that if you see a difference, it’s not due to random chance.

Statistical Power (typically 80%):

This is the probability of detecting a real effect if one exists.

The calculation:

For a conversion rate test with:

You need approximately 7,400 visitors per variant.

Common mistake:

Many people only check confidence and ignore power. This leads to “no difference” conclusions when there actually is a difference—you just didn’t have enough data to detect it.

6. Perform Sanity Checks on Invariant Metrics

Before trusting your results, verify that metrics that shouldn’t change haven’t changed.

What are invariant metrics?

Metrics that should theoretically be unaffected by your test variation.

Examples:

Why this matters:

If invariant metrics change, it’s a red flag that something is wrong with your test implementation:

What to check:

  1. No external shocks: Were there any campaigns, bugs, traffic spikes, or events during the test?
  2. Stable traffic sources: Did the mix of traffic sources remain consistent?
  3. No changes to other parts of the site: Nothing else was deployed that could affect results?
  4. Consistent user experience: Both variations load at the same speed? No errors?

Example of catching implementation bugs:

At one company, they noticed that pageviews increased in the test variation. This made no sense for a button color test. Investigation revealed that the variation was causing an extra page reload due to a JavaScript error. Without this sanity check, they would have trusted corrupted data.

7. Understand Your Test Statistics (But Keep It Simple)

You don’t need to be a statistician, but you should understand the basics.

For conversion rate tests:

The z-statistic is: z = (p₁ – p₂) / SE

Where SE (Standard Error) is: SE = √[p(1-p) × (1/n₁ + 1/n₂)]

If |z| > 1.96 → significant at 95% confidence level

Chi-squared test alternative:

χ² = Σ[(Observed – Expected)² / Expected]

Both methods work. Chi-squared is often easier for contingency tables.

Most important principle:

Statistical significance ≠ business impact

A test can be statistically significant (p < 0.05) but have such a small effect that it’s not worth implementing. Always consider:

Additional Best Practices

Avoid multiple testing problems:

If you test multiple variations or multiple metrics, adjust your significance threshold. The Bonferroni correction is conservative but simple: divide your alpha by the number of comparisons.

One change at a time:

Test one variable at a time when possible. If you change both the button color AND the copy, you won’t know which drove the change.

Document everything:

Pre-register your analysis plan:

Decide ahead of time:

This prevents post-hoc rationalization and HARKing (Hypothesizing After Results are Known).

Common Pitfalls to Avoid

  1. Testing too many things at once: Multiple concurrent tests can interact
  2. Changing the test mid-flight: Don’t modify variations or targeting during a test
  3. Trusting small samples: Even if “significant,” small samples are unreliable
  4. Ignoring seasonality: Don’t run tests that span major holidays or events
  5. Not documenting learnings: Failed tests teach you as much as winners

The Bottom Line

A/B testing isn’t complicated, but it’s unforgiving with those who cut corners on statistical rules.

The four most critical rules:

  1. Calculate sample size before starting
  2. Test for minimum 2 weeks
  3. Check for Sample Ratio Mismatch
  4. Never stop early just because it’s “significant”

Get these right, and your tests will be reliable. Skip them, and you’re just generating noise.

I’m not a statistician—I used AI to verify my sources and be as precise as possible. But these principles come from 10 years of practical experience in tracking and analytics across companies like SFR, TheFork, Expedia Group, and Amex GBT.

Sources