A/B Tests and the Pitfalls of Bad Data Quality

by Georgi Georgiev

Many business professionals think of statistics as somewhat of a magic box: no matter what they feed it, statistics can process it and spew out exactly what they need. Namely, it would give them the probability that they would be wrong or right in making a certain decision. Moreover, they assume that statistical magic can figure out what is wrong with their data and compensate for it automatically.

Neither of those assumptions is true. And that's a good thing, or our artificial-intelligence overlords would have already taken over!

Indeed, the quality of a statistical analysis is highly dependent on the quality of the data that goes into it. “Garbage in -- garbage out.” In this article, I’d like to expand on this adage in the context of A/B testing.

But first, what is an A/B test? A quick primer:

A/B testing, a.k.a. split testing, is the practice of presenting a different experience to a portion of your users (e.g. 50%) over a period of time in order to compare how those users perform on key metrics versus a control, non-exposed group. Users are usually randomized, which allows for rigorous statistical methods to be applied to the data, resulting in error probabilities and estimates of the true effect of the tested change to ensure that it is indeed our change that causes the metric to move and not some external influence.

Businesses often perform A/B tests in order to manage business risk when important decisions are at stake, or to estimate the effect of certain actions, which helps inform future steps -- or to achieve both of the above. A/B tests are also often deployed as part of a conversion rate optimization effort.

The importance of data quality in A/B test

Now, I’d like to review the role data quality plays in A/B testing by focusing on the so-called Duhem-Quine problems, or Duhemian problems. Made prominent by Pierre Duhem and Willard Van Orman Quine, a Duhemian problem states that one cannot test a claim (hypothesis) in isolation: when testing one hypothesis one necessarily also tests a number of auxiliary claims, a.k.a. background assumptions.

Duhemian problems are inherent to A/B testing. When we want to ask the question “does changing the user flow on this page improve the average revenue per user?” through an A/B test, we inherently get the answer to a question more like “does data X accord well with statistical model M?” If we denote the claim “the change results in improvement to average revenue per user” by H1 and unpack it, we see a whole host of assumptions that we are also testing: H1A, H1B, H1C and so on, many of which are related to the data gathering process.

Some examples of auxiliary hypotheses are: the data is gathered in a reasonably accurate manner; there is no bias in the mechanism producing the data towards A or B; there is no missing data; the assumptions of the statistical model hold, and so on, ad infinitum.

The issue is then, after observing a certain outcome of an A/B test and assuming that our change did indeed improve the outcome, how do we attribute this change to the main claim and not to any of the auxiliary claims, any of which could have also resulted in the observed improvement.

In practice, the issue is even more evident when the A/B test outcome points to the change not working, especially if there was great hope that it would, and a correspondingly significant investment in preparing and executing the test. Everyone then starts asking: but can we trust the data, can we trust the statistics, what if this or that went wrong? These are all questions related to the auxiliary assumptions we made, usually implicitly, along the way. Making these explicit early on is a good way to prevent such questions from muddying our business meeting and to avoid heated arguments.

How bad can bad data be?

The short answer is: as bad as you can imagine, and then some.

The reason is that after an A/B test has been conducted, even if things go south after a change is implemented, the change is the last suspect for the issue since, well, we tested it and it was OK, right? Unless it is really obvious, a lot of times the issue may go undetected for a long time, unless it is severe.

Let me give you an example with a test I helped design and analyze involving lead acquisition. The test had the goal of improving the conversion rate for a website, which had leads coming in that could either go through a trial period and then convert automatically unless they cancel, or they could immediately start on a paid subscription. Some of the proposed variants were designed with the intent to push more people to the trial experience, arguing this would improve paid subscription rates significantly as users got to experience the service before committing.

The test was conducted, data was analyzed, and conclusions made, implementing the winner. However, some time later, an issue was detected that caused trial-to-paid conversions to not register at all. The problem originated some time during the test. It affected the outcome of the test despite the randomization, since it was biasing the results against the variants in which more people were pushed through the trial.

The data was re-analyzed using external information to correct the outcomes. The winning variant changed based on the new data, and the winner we initially missed due to bad data actually lead to a vast improvement in ARPU.

Obviously, the initial decision was quite bad for the business. Instead of managing risk and providing estimation, the A/B test, combined with a bad set of data, could have resulted in sunk costs and lost opportunities.  

If you think that this cannot happen to you because you have the best developers and the best processes in the world: maybe you are right, but the odds are stacked against you. I’m willing to bet you are wrong if you are anything like the average customer I’ve interacted with over my career. I can list on the fingers of one hand the cases in which I was called to audit a tracking setup and I found no issues. I’ve done hundreds of such audits, many of them routine (i.e. not based on a particular concern over the data).

If you are running many A/B tests over a long period of time, the question is not whether you will experience a data quality issue but how often you will encounter such issues and to what extent can you decrease such occasions.

What can we do to prevent and combat bad data?

While certain statistical methods can help resolve many of the questions related to the statistical assumptions (mis-specification tests), making sure the data is accurate is generally a separate task. Some methods, like anomaly detection, can help us detect a data stream going bad during a test, but only if you are constantly monitoring the quality of the data stream. Even then, such an approach would be no good if the stream was corrupted all along, hence the importance of initial quality assurance.

Given the above, you should not rely on your A/B testing vendor to detect any cases like the above example. To my knowledge, no vendor currently offers mis-specification tests or anomaly detection as part of their service.

My advice to marketers, user experience specialists and data analysts on ensuring data quality is three-fold:

  1. Rigorously test your tracking setup on deployment by manual and automated tests and by cross-checking the data with as many external references as possible (e.g. server logs, other tools that track the same metrics or informative close ones, etc.).
  2. Deploy continuous monitoring for tracking integrity. Anomaly detection can be your friend, if you can afford deploying it at least on the most important data-streams.
  3. If you do statistical analysis on your data, such as in an A/B test, use mis-specification tests to ensure the statistical validity. On some occasions, such tests may also uncover other issues.

Whether you do this in-house, use an third-party service, or a combination of both will depend on your particular organization in terms of size, needs and how data-dependent you are.

In conclusion: data can be great when you can trust it, and it can be truly bad when you shouldn’t... but you don’t know it. Know your data.

Georgi Georgiev is the founder of Analytics-Toolkit.com, a SaaS that enables users to automate Google Analytics-related tasks, as well as perform statistical analysis on A/B testing data. Follow his work in statistics, A/B testing, and all things data at blog.analytics-toolkit.com.

Tags: Bad Data Data Quality

QA2L is a data governance platform specializing in the automated validation of tracking tags/pixels. We make it easy to automate even the most complicated user journeys (flows) and to QA all your KPIs in a robust set of tests that is a breeze to maintain.

Try it real quick:

Subscribe to our quarterly data quality newsletter: