A/B Test Sample Size Calculator

Calculate how many visitors and how long you need to run an A/B test for statistically significant results.

Test Parameters

Baseline Conversion Rate

Your current conversion rate (control group)

Minimum Detectable Effect (MDE)

10%

Relative improvement you want to detect

Statistical Significance

Statistical Power

Daily Visitors (optional — for duration estimate)

Results

Sample Size Per Variation

—

Total Visitors Needed

—

Estimated Duration

—

Conversion Rate Comparison

Control

5.00%

→

Variation (target)

5.50%

Absolute Diff

+0.50pp

Tip: A 10% relative improvement from 5% to 5.5% requires a meaningful sample to detect.

Common Scenarios

A/B Testing Best Practices

Decide sample size before starting

Never peek at results and stop early when you see significance. This inflates false positive rates.

Run for full weeks

User behavior varies by day of week. Always run tests in full-week increments to avoid day-of-week bias.

Test one change at a time

Changing multiple elements makes it impossible to attribute which change caused the result.

Use feature flags for safe rollouts

Feature flags let you gradually roll out winning variants and instantly roll back if metrics degrade.

Run A/B tests with feature flags

FlagBit gives you percentage-based rollouts, targeting rules, and instant rollbacks. Perfect for controlled A/B testing.

Try FlagBit — Free Tier

Frequently Asked Questions

Statistical significance means the observed difference between control and variation is unlikely due to random chance. At 95% significance, there's only a 5% probability the result is a false positive. Higher significance levels require larger sample sizes but give you more confidence in the result.

MDE is the smallest relative improvement you want to reliably detect. A 10% MDE on a 5% baseline means you want to detect a change from 5.0% to 5.5%. Smaller MDEs require much larger sample sizes. Choose an MDE that would be meaningful for your business — a 1% improvement on a high-traffic page might be worth millions.

Power is the probability of detecting a real effect when one exists. 80% power means there's a 20% chance of a false negative — concluding no effect when there actually is one. Most tests use 80% power as the standard. Increase to 90% if the cost of missing a real improvement is high.

Feature flags let you split traffic between control and variation at the code level. With FlagBit's percentage rollouts, you assign a consistent percentage of users to each variant using deterministic hashing. If the variant wins, increase the rollout to 100%. If it loses, roll back instantly — no deploy needed.

Peeking at results and stopping when you see significance dramatically inflates false positive rates — from 5% to over 30% in many cases. This is called the "peeking problem." Calculate your sample size upfront, commit to it, and only analyze results after the test is complete. If you need to peek, use sequential testing methods.