A/B Test Results Reporting Checklist

Results

So, you’ve set up an A/B test, had it run for some time, and notice a significant different in conversion rates between two of your variations, now what do you do?

Managing expectations

At this point it’s most important that you manage the expectations of the recipient of your results well. As outlined in the Conversion XL blog 5 uncomfortable testing questions many clients are more than a bit curious about the results of your conversion experiments. However, failing to provide them with the conditions in which the experiment was ran (and the implications this has for the results) can lead to great losses of time and money.

Poorly conducted tests

For example, last year Qubit has estimated that poorly conducted A/B tests could be costing online retailers up to $ 13.6 billion per year. Brooks Bell even went as far to include their last busted test as a KPI for their organization.

A/B test reporting checklist

Before you contemplate reporting the results of your latest successful (or failed) A/B test to your client or boss, run it past this checklist first. By verifying that you can check all the boxes (or at least the vast majority of them), you greatly decrease the chances of reporting a winner that actually wouldn’t be a winner if the experiment had been run properly.

Runtime

All weekdays

winner Make sure a test has run on every weekday (Monday - Sunday) at least once. This should be done in order to make sure that any special effects that particular weekdays might have (for example, higher conversions in the weekend) on your experiment become visible. When you’re not running your test on every weekday, you results might become skewed, which could lead to incorrect conclusions.

If you want to be extra safe, make sure to keep a test running at least 14 days to flatten out any potential anomalies on a given weekday in one of the weekly runs.

Complete buying cycle

Your visitors go through a process before making a purchase on your website. This process is called a buying cycle. While there are multiple ways of defining the steps included in a buying cycle, it often includes: awareness, consideration, intent and purchase.

By making sure your test runs at least one full buying cycle, you’ll be measuring the effects of your changes fully. If, on the other hand, you’ll stop your test before letting the visitors that are currently at the ‘intent’ step proceed, you won’t give them a chance to convert. Therefore you can’t see how their behavior differs based on the changes you’ve made to the variations.

Not too long

Don’t just wait until your test results become significant. As estimated by Ton Wesseling, about 10% of your visitors will have deleted their testing cookies after two weeks of test runtime. This means that these visitors will re-enter your test and potentially skew the results.

Another benefit of not running tests too long is that it will likely lead to fewer simultaneous tests. Because there is always a chance that tests might influence each other (experts adopt an ‘it depends’ standpoint on this), it’s advised to keep the number of simultaneous tests on the same page to a minimum. Keep in mind that the impact of a so-called interaction effect is larger when you have a small number of visitors in each test.

Traffic

Excluded segments

audiences It’s very important to make sure the segments (or audiences) that you’re including in your test closely mirror those in of all your visitors. For instance, when you’re excluding mobile visitors from your test and find a big uplift, this uplift might well be negated fully by any problems on mobile devices when the test is put live. The same logic applies to visitors excluding visitors using certain browsers (such as Internet Explorer) from your experiment. Be sure to have a sample of users that is as representative as possible as your entire population of visitors.

Traffic source/medium mix

Another aspect of your traffic to keep in mind are your traffic sources/mediums. When you’re running your experiment during a period that has a skewed traffic mix (compared to the traffic mix you normally have), this might skew your results as well. For instance, if you’re running AdWords campaigns, get a lot of affiliate traffic or have a viral content hit, the experiment will show results based on those types of visitors.

Internal IPs

Internal traffic is often irrelevant for testing and can greatly skew your results. Therefore you should take care to exclude as many known IP addresses as possible. This includes the IP addresses of your offices (including satellite offices) as well as your customer services centers and those of remote workers. Keep in mind that these IP addresses should be excluded in both your testing tools and your web analytics software in order to keep the data synchronized.

Setting up the filters

ip filter For most testing tools this process involves changing a setting in the configuration, which is pretty straightforward. Do keep in mind though to label the IP addresses, so that you still know which IP addresses belongs to whom. For Google Analytics this process involves setting up a custom filter. The steps to do this are outlined in this Google Help article.

Sales or holiday season

When running A/B tests during holidays, visitors are often behaving differently. Therefore, whichever results you are to get from your A/B tests can only be generalized to a similar group of visitors. Because the behavior of the visitors often differs so radically from the baseline, it unfortunately isn’t advised to generalize the finding to the rest of the year. In order to stay up-to-date with any holidays or sales activities, you should stay in close contact with the marketing department during the planning of your A/B tests.

Setup

Compatibility

devices In order to prevent skewing your results, it’s important that your control and your variations work equally well on all relevant devices and browsers. Keep in mind that this doesn’t necessarily means that the variations should be compatible with all devices and browsers, but particularly on the ones that your visitors are using. You can find out which browsers and devices this are by going to the Google Analytics reports ‘Audience > Technology > Browser & OS’ and ‘Audience > Technology > Mobile > Overview’.

Goal URLs

Setting up the correct goal URLs for an A/B test can be challenging for some websites. However, when done incorrectly, the results of your experiments will be either useless or at the very least skewed. In order to make sure that your goals are set up correctly, you might want to pay a visit to both your developer and your web analyst. They have very likely dealt with goal completion URLs in the past and will be able to supply you with the correct regex values to set up your goals.

Loading time

pingdom Sometimes any additional scripts or styling for a variation can slow its loading time performance. Considering that performance is an important factor in conversion optimization, you should make every effort to have the control and each of the variations load at the same speed. If the difference in loading time between the variations is noticeable, then it’s advised to either speed up or slow down the other variations to make all of them load at the same time. You can test the load performance of a page using tools such as Pingdom or WebPageTest.

Interactions

Be sure that you’re testing only one change at a time. For instance, if you’re running at test that changes both the color and the shape of a CTA button, this test should be set up as an MVT (read more about A/B or MVT). By running separate changes in multiple A/B tests, you know which change is causing what effect, instead of potentially looking at interactions between multiple changes.

Statistics

Enough significance

Statistical significance determines the probability that the difference between two variations happened by chance. A significance of 95% therefore means that there is a 5% chance that the difference in conversion rate between two variations has occurred by chance, rather than by a change in behavior of your visitors. Keep in mind that when your trying to find out whether a variation is a losing variation, you should use a two-sided (or two-tailed) significance test.

Changes

Website

If one or more elements on the website have changed, this might influence the results of your A/B test. These changes include code releases, promotions or changes in the user interface. Be sure to keep yourself up-to-date regarding any of these changes in order to prevent unwelcome surprises when presenting the results of your A/B tests to your boss or client.

Experiment

Just a little change to the copy text used in the variation? A slight modification in color perhaps to match the box to your corporate identity? Fixing a bug that causes the experiment not to work in a particular browser? Each of these changes to running experiments can influence the results and should therefore be avoided.

Traffic allocation

traffic allocation Changing the traffic allocation for variations is an often-requested modification by clients and bosses alike. While changing the overall amount of traffic that is distributed to all variations is a powerful way to decrease the risk in A/B testing, changing the traffic allocation during runtime can really mess up your results. Because when your traffic mix changes, or the behavior of your visitors changes after the modified traffic allocation, different visitors will have flowed through the variations, which skews the numbers. If you want to change the traffic allocation of variations: stop the experiment, clone it, and change the new experiment to reflect your newly desired traffic allocation.

Pausing or downtime

Your don’t want to pause your A/B tests or have your tests stop running otherwise. While the risk of pausing is smaller than the risks involved in changing the traffic allocation, you can still potentially mess up your results. This smaller risk is caused by the fact that there is still an even amount of traffic (not) flowing through the variations. But, in this case, there are days worth of visitor data missing, in which visitors might have behaved differently.

Volume

Number of visitors and conversions

low numbers A general rule of thumb for the minimum number of visitors in each variation is 5,000, while the minimum number of conversions per variation is 100.

These numbers are big in order to prevent large differences based on a small number of visitors from having an overly large influence. The exact sample size is, with the onset of Bayesian analysis techniques by Smart Stats (VWO) and Stats Engine (Optimizely) not that relevant anymore.

Verified

Replicated

Have the results of this A/B test been replicated by another A/B test? Perhaps one or more of the factors outlined above weren’t set up perfectly, which caused your win to be a fluke. If you’re able to replicate the results (which means a different timeframe, different experiments running next to it, different visitors, etc.) this would make the results far more robust. Especially for tests that would cost a lot of resources to implement, or would otherwise ‘shake things up’, you should be careful to report those without having replicated the results first.

Hypothesis

Are you sure the results you’re about to report are in accordance to the hypothesis you’ve created before you started the A/B test? Caution should be taken when you’re about to report results that have shown up in the results just now, but you hadn’t anticipated.

Checked in Google Analytics

To rule out any problems with your testing tool, you might want to verify your finding in Google Analytics as well. Peep Laja has written up an excellent article that teaching you have to analyze your results there.

Conclusion

Setting up A/B tests properly is difficult, and requires you to keep in mind a lot of different factors. Many of these same factors play a critical role in whether a winning variation is in fact a winning variation or a fluke. By using the checklist above, you can rule out the majority of causes that might lead you to report a winner that actually isn’t a winner, or perhaps even is a loser.


Dutch translation: A/B Test Resulaten Rapporteren Checklist


Theo van der Zee

Author: Theo van der Zee

He is the founder of ConversionReview. He has been building and optimizing websites for 15+ years now, and doing so with great success.

On top of his digital skills, Theo is also a trained psychologist and frequent speaker at events around the world.