The runtime of an experiment is probably one of the more difficult things to get right. If you’re running it too short, you’ll risk have insufficient significance or power, and if you run it too long, you’re wasting unnecessary resources, amongst other problems.
How long to run?
With the older frequentist testing approach, the most important thing used to be that you should always estimate the runtime of an experiment upfront. Using a tool such as the A/B test duration calculator you could see how long your test should run. These tools take into account parameters such as your current conversion rate and the amount of visitors that are taking the desired action.
Whenever possible you should try to run your experiments for a minimum of 7+1 days. That means for a full week, plus and extra day just to be sure. By doing this you will rule out any effects that might only happen on certain weekdays (or weekend days). If you want to be even more safe, try using 14+1 days to account for any specific events happening during the first week, and also a higher number of conversions per variation.
Something to keep in mind is that it’s also possible to have a test run too long. Not only could this potentially waste valuable resources, it might also cause your testing results to become useless. As outlined by Ton Wesseling, about 10% of your visitors will delete their cookies during an experiment with a runtime of two weeks. So, if you decide to run an experiment that is projected to run for about 6 weeks until reaching significance, this means you’ve already lost a significant part of your initial visitors along the way.
Most of the A/B testing tools have now implemented Bayesian statistical models to evaluate the reliability of the results that they show. This newer statistical approach mostly eliminates the need to guess a correct testing duration before you run a test. However, it can still help to check upfront if you have enough conversions per variation to run a test within a certain timeframe. After all, other departments might rely on a test to start or finish at a given date.
Experiments are often stopped early because a testing tool claims it has already reached significance or a high enough reliability. As outlined by Evan Miller this can cause false positives (also called Type I errors). With the new Bayesian statistical models, the best way to avoid such an error is to get at least 100 conversions per variation (although, preferably this number is at least 250+). With that number of conversions the chances of facing any low sample size problems are sufficiently minified.
Dutch translation: Hoe Lang Moet Een A/B Test Draaien?
He is the founder of ConversionReview. He has been building and optimizing websites for 15+ years now, and doing so with great success.
On top of his digital skills, Theo is also a trained psychologist and frequent speaker at events around the world.