A/B Test Sample Size Calculator

Calculate how many visitors and how long you need to run an A/B test for statistically significant results.

Test Parameters

5%

Your current conversion rate (control group)

10%

Relative improvement you want to detect

Results

Sample Size Per Variation
Total Visitors Needed
Estimated Duration
Conversion Rate Comparison
Control
5.00%
Variation (target)
5.50%
Absolute Diff
+0.50pp
Tip: A 10% relative improvement from 5% to 5.5% requires a meaningful sample to detect.

Common Scenarios

A/B Testing Best Practices

1
Decide sample size before starting
Never peek at results and stop early when you see significance. This inflates false positive rates.
2
Run for full weeks
User behavior varies by day of week. Always run tests in full-week increments to avoid day-of-week bias.
3
Test one change at a time
Changing multiple elements makes it impossible to attribute which change caused the result.
4
Use feature flags for safe rollouts
Feature flags let you gradually roll out winning variants and instantly roll back if metrics degrade.

Run A/B tests with feature flags

FlagBit gives you percentage-based rollouts, targeting rules, and instant rollbacks. Perfect for controlled A/B testing.

Try FlagBit — Free Tier

Frequently Asked Questions

Statistical significance means the observed difference between control and variation is unlikely due to random chance. At 95% significance, there's only a 5% probability the result is a false positive. Higher significance levels require larger sample sizes but give you more confidence in the result.

MDE is the smallest relative improvement you want to reliably detect. A 10% MDE on a 5% baseline means you want to detect a change from 5.0% to 5.5%. Smaller MDEs require much larger sample sizes. Choose an MDE that would be meaningful for your business — a 1% improvement on a high-traffic page might be worth millions.

Power is the probability of detecting a real effect when one exists. 80% power means there's a 20% chance of a false negative — concluding no effect when there actually is one. Most tests use 80% power as the standard. Increase to 90% if the cost of missing a real improvement is high.

Feature flags let you split traffic between control and variation at the code level. With FlagBit's percentage rollouts, you assign a consistent percentage of users to each variant using deterministic hashing. If the variant wins, increase the rollout to 100%. If it loses, roll back instantly — no deploy needed.

Peeking at results and stopping when you see significance dramatically inflates false positive rates — from 5% to over 30% in many cases. This is called the "peeking problem." Calculate your sample size upfront, commit to it, and only analyze results after the test is complete. If you need to peek, use sequential testing methods.