How to do Frequentist A/B Testing

Akhil Prakash
19 min readNov 20, 2021

Photo by Carlos Muza on Unsplash

This article will cover the finer points of frequentist A/B testing and some gotchas that may be polluting your A/B experiment results.

Don’t change traffic in the middle of test

Just for simplicity’s sake, say you run an experiment for 2 days. The same math works for a longer time period. The experiment has two branches and 100 users are in the experiment on day 1 and 100 users are in the experiment on day 2. The traffic allocation is branch A gets 10 users and branch B gets 90 users on day 1 and on day 2 branch A gets 80 users and branch B gets 20 users.

You are looking at the proportion of users that click on an ad. Say because of some seasonality effect the population proportion users that click on an ad in branch A on day 1 is .4, but on day 2 it is .5 and for branch B on day 1 it is .5 and on day 2 it is .4.

This is under no randomness / noise just to make the math easier, so we don’t need to do a hypothesis test, but the same math holds true with noise.

branch A: 10 * .4 + 80 *.5 = 44 clicks. since there are 90 users you get 44/90 = .489

branch B: 90 * .5 + 20 * .4 = 53 clicks. since there are 110 users you get 53/110 = .482

vs

If you keep the traffic at branch A gets 10 users on day 1 and day 2. and branch B gets 90 users on day 1 and day 2

branch A: 10 * .4 + 10*.5 = 9 clicks. which is 9/20 = .45

branch B: 90 * .4 + 90*.5 = 81 clicks which is 81 /180 = .45

If we keep the traffic the same, then we get the correct result that the click through rate averaged over both days is the same in branch A and branch B. Note we got the correct estimate for the click-through rate even though the traffic allocation is 50/50.

When we changed traffic allocation why did we get that branch A was better than branch B?

First, note that both the estimates from branch A and branch B are higher than the true estimate of .45. This is because we did a weighted average between the two days and gave more weight to the .5 day than we would have with an unweighted average. When the traffic allocation is not changed, we do an unweighted average since the number of users on each day of the experiment were the same. Note that if we had 100 users on day 1 of the experiment and 200 users on day 2 of the experiment, then we would want to put 2 times the weight on day 2 because there are more users not because traffic allocation changed. Note that for branch A we put more weight (8/9) on day 2 than on day 1 (1/9). Note that for branch B we put more weight (9/11) on day 1 than on day 2 (2/11). Since 8/9 > 9/11, we weight day 2 for branch A more than day 1 for branch B. This causes the estimate to be higher from branch A.

The root cause is the seasonality effect. If we know one branch of the experiment dominates the other (branch B is always greater than branch A over time), then it is not as bad to change traffic in the middle of the test but it is still bad. This is going down the multi-armed bandit route where if you know one branch is better than the other branch you give the better branch more traffic. So you are changing the traffic every single day when you do multi-armed bandits. But multi-armed bandits at least in the simplest sense assume there is no seasonality. Another cause could be you are trying to measure a very rare event so just by random change branch B may do better than branch A on one day and branch A may do better than branch B on another day even though overall branch B is better. There is no seasonality issue its just you don’t have enough data to estimate the effect every day.

If you change the traffic allocation in the middle of an experiment, you can either throw away the data before the traffic allocation change or run the experiment longer than expected. Another option is when you compute results still use the whole experiment timeframe and hope that by running the experiment longer you down weight the data before the traffic allocation change so you don’t completely throw away the data you still use the data you just put less emphasis on it.

If we go with running the experiment longer after the traffic allocation change, we can see with the example how long it would take for our estimates for branch A and branch B to match. This is done assuming the click though rate alternates between .4 and .5 for branch A and .5 and .4 for branch B.

branch A: (10 * .4 + 80 *.5 + 80 * .4)/170 = .447

branch B: (90 * .5 + 20 * .4 + .5 * 20) /130 = .485

branch A: (10 * .4 + 80 *.5 + 80 * .4 + 80 * .5)/250 = .464

branch B: (90 * .5 + 20 * .4 + .5 * 20+ .4 * 20) /150 = .473

branch A: (10 * .4 + 80 *.5 + 80 * .4 + 80 * .5 + 80 * .4)/330 = .448

branch B: (90 * .5 + 20 * .4 + .5 * 20+ .4 * 20 + .5 * 20) /170 = .476

branch A: (10 * .4 + 80 *.5 + 80 * .4 + 80 * .5 + 80 * .4 + 80 * .5)/410 = .459

branch B: (90 * .5 + 20 * .4 + .5 * 20+ .4 * 20 + .5 * 20) /170 = .476

After 300 days, the estimate for branch A is 0.4501463. And the estimate for branch B is 0.4505766. After 100 days, the estimates are around .451 ish. In other words, we need to weight a really long time for the estimates to converge to the truth.

Obviously, this is a very contrived example and if the traffic allocation change was not as large we would not see such large differences. One case where you may see large traffic allocation changes is when doing a rollout. Here we will consider two kinds of rollouts: 1. you want to do a slow rollout to production and reach 100% traffic. 2. you want to do a rollout for the treatment branch of an experiment from 0% to 5% for example. For 1, you have either already done the A/B test and want to launch the winning branch to more traffic or have to roll the feature out to production for some other business reason without running an A/B test (i.e. security compliance, legal compliance, etc). In this case, you are not comparing two branches so its ok. You are mostly just monitoring adoption rates or making sure there are no bugs. For 2, you believe that there may be some bugs in the experiment or are unsure of the user impact and thus want to be careful and try it out on a very small set of users. I would recommend throwing away the data from the time that you are increasing the traffic. Since the traffic was very little you are not throwing away that much data. You should try to make the rollout as fast as possible to minimize data that is thrown away.

Don’t add or delete branches of an experiment while the experiment is running

Let’s say you are running an experiment with a holdout branch or some branch that may not be revenue positive. The business person wants you to remove the branch and move traffic to another branch because the business needs more revenue. This is a special case of the above about changing traffic allocation. Instead of changing the traffic allocation, I would change the treatment on the branch. Then, when you do the analysis you ignore the branch whose treatment changed. This has the advantage of meeting the business needs and not messing up the experiment analysis.

If you add a branch, you need to be very careful. Let’s say that you need to wait sometime to give the full treatment (i.e. if you want to give a user two discounts but you want 1 week between the discounts). Then, there may be a user who has gotten the first discount and then changes branches and gets the second discount in a different method than it should have. So the question is how do you deal with this issue. I would throw away data that did not get the full treatment of the branch. You could do some survival analysis methods and say that you have censored data when the user leaves the branch. But that may get more complicated than it needs to be. To add a branch, you need to change the traffic to other branches, so you need to also do what we described in the previous section.

Statistical significant vs. practical significance

When you look at a metric and compare the control to the treatment and do a hypothesis test, you determine if the difference is statistically significant. However, you also need to deal with practical significance. If you run an experiment for long enough, most metrics you measure will be statistically significant because as you increase the sample size the standard error for your metric goes down and p-values become small. Another way of saying this is that power increases as the sample size increases. If you have very high power, you are able to detect any minor difference. This is why we need practical significance. If you see click-through rate in the control is .3567 and click-through rate in the treatment is .357. Does the .003 difference mean a lot? This decision is more a science and business decision than a statistics decision. If the result is statistically significant do we even care? Statistical significance tells us if we would see this difference under randomness. If we made any change, we would expect to see some difference in the click-through rate. Could we see this difference if we made any change?

There are four cases: the result is statistically significant and practically significant, the result is statistically significant and not practically significant, the result is not statistically significant but is practically significant, and the result is not statistically significant and not practically significant.

  1. The result is statistically significant and practically significant — then you are good. you can launch this experiment to production. The lift you saw was not due to randomness and was big enough that you are happy with it from a scientific perspective.
  2. The result is statistically significant but not practically significant — then most likely you had a larger sample size than necessary, so the minimum detectable difference was smaller than your practical significance threshold. You probably are not calling this a winning experiment unless there are other good side effects.
  3. The result is not statistically significant but is practically significant — then you most likely did not run your experiment long enough. You set your minimum detectable difference larger than the partial significance threshold. You should run the test longer. But you also need to worry about running your test until significance. Since you have peeked at the data you need to control for that with a multiple hypothesis testing correction factor. You need to do this because if you had gotten a statistically significant result you would have stopped the test but since you did not get a statistically significant result you ran the test for longer. Thus, you had two chances for the result to be statistically significant.
  4. The result is not statistically significant and is not practically significant- then you don’t want to push this to production. You did not see enough change in the metric that the change could not have come from just randomness.

Define metrics (OEC, guardrail metrics, counter metrics). define decision boundary

There are generally three types of metrics you have 1. Overall Evaluation Criteria 2. guardrail metrics 3. counter metrics. The overall evaluation criteria are the main metrics you are trying to move. Examples would be click-through rate, revenue, user happiness (however you want to measure that). Guardrail metrics are metrics that count as bad user experiences like latency, unsubscription rates, and crashes. You can use counter metrics to trade off short-term and long-term goals. If you are trying to increase subscribers then revenue could be a counter metric if you are going to increase subscribers by giving everyone free subscriptions then you want to make sure revenue does not decrease too much.

Before you run an experiment you should decide what the minimum detectable difference is in all the OEC metrics. Based on that define what sample size you need.

Basically, when you do the experiment analysis, you will have some metrics go up and some metrics go down. If for some reason you have all your metrics go up or all your metrics go down you are very lucky. The experiment analysis is very easy. To solve this, before the experiment is run you should decide if metric A increase by x and metric B decrease by y what the decision should be on the experiment. Basically, you want to define a decision boundary so if x+ y > 3 for example then launch the experiment. Note that you should make this decision before the experiment is launched. Note that this gives us an unbiased decision vs what is normally done is look at the data and then say this metric when down by 3 this metric went up by 10 so we launch the experiment. Note that this also helps determine what a successful experiment looks like. Instead of just saying a successful experiment is if click-through rate increases. Note that doing this also gives an objective criteria of if the experiment will be launched or not that everyone can agree on (the product manager, the engineer, the data scientist, etc). A common problem is that an engineer works on a new feature. Then, the feature gets A/B tested and the data scientist does the analysis on the experiment. The engineer is often upset if the feature does not go to production because they believe that the data scientist is making subjective judgments about the experiment. His coworker had their experiment launch even though some metric went down. Defining the decision boundary upfront stops this problem where everyone has agreed to what a successful experiment looks like.

Now, there are a couple of questions: 1. how to estimate the expected lift before the experiment has been run? 2. what do you do if the experiment results lie very close to the decision boundary? For 1, if you are testing a machine learning model or some heuristic you have data and you can backtest on and get a general estimate of the lift. You may have some user study about how much users want a certain feature or you have done previous experiments in this domain and know the rough lift you see from those experiments.

For 2, this happens all the time when you take a continuous metric and make a binary decision. You have to define a threshold somewhere. The same question is asked in intro stats classes what happens if the p-value is .05 or the p-value is .04999 or .05001. One solution is to report the actual p-value and not just if it was below or above the alpha value. Another solution is you know you have another feature that will increase the metric that this experiment decreased and then you may be ok with launching the experiment. Essentially, there is no real scientific way to do this. You just present the positives and the negatives of the experiment and make some argument that the positives outweigh the negatives or the negatives outweigh the positives.

Don’t run the test until statistical significance

If you run your test until significance, you will have a lot of false positives. This is because you have done multiple hypothesis tests. You have not corrected for doing multiple hypothesis tests. Also, you may have to wait a very long time for some tests to reach significance. Another issue is that the result will also probably not be practically significant.

Another thing you may see is that your test is doing well but has not reached statistical significance. Your manager wants to roll out the test to production quickly, so he wants the test to reach statistical significance faster. Your manager suggests increasing the traffic to the winning variant. Note that this is already covered in the don’t change traffic allocation section.

Sometimes you may actually want to run a test for longer than the sample size calculation tells you to run the test for to get statistical significance. One reason is you may have been too optimistic about how much lift you would get and set the minimum detectable difference too high. The more interesting reason is say you have enough traffic that you can reach statistical significance after 2 days. However, there is a weekly pattern of usage of your app. Then, you would want to run your experiment for at least one week to cover all the different types of users you may see. If you only ran the test for 2 days you could have a biased sample. It could be that the older users use the app on Sundays but since you only ran your experiment from Monday to Wednesday. You did not include enough older users to estimate the effect on the true of the experiment.

Don’t do tests in specific countries and expect results to generalize

You do not want to test in specific countries and then assume results will generalize because every country is different. Instead, test with some percentage of these total global traffic. It is very difficult to pick a group of countries that is representative of the global population so don’t try to do that. Just because you see positive results in the US does mean that you will see positive results in Mexico.

Also, don’t do rollouts by country and then compute experiment metrics in the standard way. You will have biased data because there will be more data coming from the first country you rolled out to. In other words, the distribution of countries does not match the true population distribution of countries.

Don’t deploy changes that will effect the experiment while the experiment is running unless fixing bugs with the experiment

Basically, this is saying don’t change the experiment while it is running. If you fix a bug in an experiment, then you probably want to throw out all the data from before the bug fix unless it affects a very small population. You want to make sure you understand your ecosystem and how one change can affect the experiment in other places. For example, if you are giving a user a discount on a subscription on day 10 as your experiment and you have another campaign in production at 100% traffic on day 3 that gives users a discount and you change the targeting on day 3 to give discounts in a different way then 7 days later a different set of users will enter into the experiment. This will mess up your experiment so be careful doing this. Another similar thing to point out is do not run two concurrent experiments that conflict with each other. For example, do not a test that changes the price of a subscription and a discounting experiment on the same product. Ideally, you would have the users randomly assigned in both experiments and the random assigned either has a different seed or salt so that the overlap between the control group and treatment groups of the two experiments is the same. If you are running one experiment on the same part of the product as someone else, you can run your experiments independently and report the results and then do a joint experiment where you combine both features and test them out together. When you do the joint experiment generally you will see a lift that is smaller than the sum of the two lifts from the individual experiments because there is some overlap.

If you have to remove countries in the middle of a test, remove all data from those countries

Say you are running an experiment and some business person comes to you and says you need to remove some countries from the test because you cannot run an ML model in those countries for legal reasons or someone needs to run a very important test in a certain country that you are already running a test in. One thing to solve the second issue is to plan out your experiments and what traffic they get in advance and get agreement from all partner teams. This situation falls under the don’t do tests in specific countries and don’t change traffic allocation. There are going to be some users in the country you are removing who have not gotten the full treatment or you have not observed long enough to calculate whatever metrics you want to calculate. One way is to treat these users as censored data and use survival analysis methods. Depending on the situation if you need to just wait some number of days (say you are calculating clicks in the next 7 days) but you have only waited 3 days for this user you can move the user to the other experiment and then wait 4 more days and calculate the number of clicks in the next 7 days. This is only okay if you believe that for the 3 days the user is in the new experiment the experiment does not effect the number of clicks the user will make in those 3 days. Because of these biases talked above, I would advise doing two separate analyses: 1. one on the countries that you are going to exclude 2. one on the countries that you have included for the whole time in the test. You can use all the data from the beginning of the experiment. Doing this allows you to understand what happened in the experiment at a global level and you throw away the minimal amount of data.

Correct for multiple hypothesis testing

Often when you have an experiment you have a control branch and then you may be testing multiple variants against the control branch. Some examples are testing out multiple heuristics or a machine learning model with different thresholds. You are testing multiple thresholds from the machine learning model because there is some tradeoff between two metrics you want to understand or you want to get the true PR curve or ROC. For simplicity’s sake, let’s assume you have 3 branches: 1 control and 2 heuristics. You will be happy if either heuristic does better than the control. The following reasoning will be from the frequentist hypothesis testing viewpoint. If the two heuristics were exactly the same as the control and we did not do a statistical test, then by symmetry we would pick one of the heuristics as the winner 2/3 of the time. If we had 2 branches we would have picked the heuristic as the winner 1/2 of the time. If we had n+1 branches, we would have picked the heuristics as the winner n/(n+1) of the time. One may conclude that they should run an experiment with as many branches as possible because that gives them many chances to beat the control and push the experiment into production. This is fundamentally wrong. This is the whole premise of multiple hypothesis testing. When you do a hypothesis test for an experiment you normally say I am ok with a 5% false-positive rate and try to see if the treatment differs from the control by a large enough amount that the difference between control and treatment could not have been random noise. You have 2 hypotheses: 1. control vs treatment1 2. control vs treatment2. Under the null hypothesis, the p-value is uniformly distributed. Thus, if the two tests are independent, then the false-positive rate (chance you reject the null when the null is true) is 1-.9⁵² > .05. This makes us very sad because we thought our false-positive rate was .05 but it is actually .0975. Another way of stating this is that if you run 100 hypothesis tests with a false positive rate of .05 by random chance 5 of them will be statistically significant. This causes you to make the wrong conclusion. There are several ways to correct for this multiple hypothesis testing issue, but we won’t get into those methods here. The general point is to define which branches you want to compare beforehand (maybe you don’t want to do all the pairwise comparisons) and adjust the sample size needed for the multiple tests you are doing. What will happen if you do not do this is you are going to select a winning variant from the experiment and roll that out to production and then you will see worse results in production for the exact same treatment than you saw in the experiment. Then, you will be confused and maybe think either there is some difference in the code from the experiment and production or there was some seasonality effect, but it was really what you saw in the experiment was random noise that caused you to make the wrong decision.

How to end experiment if you have an experiment where you need time to give the full treatment to someone

Say you have run your experiment till the sample size you decided before the test started. Now, what do you do? How do you end the experiment? In the case where you give a treatment and are done (i.e. you show a green button or a blue button and want to know if they click on it or not), you can just give all the traffic to the winning branch. The other case is where you need to wait sometime to give the full treatment (i.e the branch of the experiment gives out two discounts with 7 days in between them). In this case, when you reach the sample size threshold where you include only users who got both discounts there are still some users who have not received the full treatment (only got one discount and are waiting to get their second discount). What do we do with these users? You can do the experiment analysis on the users who received the full treatment and then if the winning variant is giving two discounts you can roll that out and everything carries on as usual. Now, if the winning variant is not giving two discounts you have the decision to make of continuing to give those users the second discount and all new users entering the experiment will go to the winning variant. The other option is to move the users in the two discount variant to the winning variant. If the winning variant and the two discount variant are not compatible (i.e. the wining variant only gives out one discount), then this does really make sense. Also, doing this movement may make the winning variant do worse than expected while these “polluted” users are in the variant.

Basically, you want to be clear with the PM and business stakeholders that when you are running an experiment you should not be changing things while it is running unless you really know what you are doing and know what assumptions you are making. Otherwise, you may end up with a negative result from the experiment and not know if the feature was not helpful for the user or the experiment change caused the issue. You should put science before business because otherwise, you will end up not being able to answer the questions you want and have to use even more complicated methods and assumptions.

--

--