How to Measure A/B Test Performance in Display Advertising Campaigns

Liz

2 years ago

The purpose of machine learning systems is to target the most appropriate audiences for our advertisers, and as we improve these algorithms, we typically encounter the following issues:

What measures should we employ to measure the effectiveness of our ad targeting?
Machine learning systems’ performance is not deterministic. Can we be certain that an observed improvement is not due to random chance or a seasonality effect but is a genuine improvement in this context?
Even though a new machine learning algorithm outperforms all of the advertising campaigns we tested on average, there may be individual campaigns for which the algorithm is damaging. Can we identify these campaigns correctly?

Measuring the Effectiveness of Ad Targeting

Let me illustrate these concerns with an example. Suppose we have developed a new prediction algorithm to estimate the probability that an Internet user will buy a product after seeing an ad for a campaign. We want to improve this algorithm to test whether it improves the relevance of our ad targeting. We refer to this new algorithm as Algorithm A and our current production algorithm as Algorithm B.

The cost per acquisition (CPA) of a display advertising campaign is a regularly used indicator for estimating its effectiveness:

In CPA, an acquisition (or “conversion”) is defined as a purchase of a product online for that advertiser. Cost per acquisition (CPA) is a measure of how much we need to pay to get a certain number of conversions. As our algorithms improve, we should be able to get more conversions for the same cost, and our CPA should decrease.

We can run an A/B split test on a specific campaign to see which algorithm performs better. For this purpose, one metric worth looking at is CPA lift:

If the lift is equal to one, then the two algorithms have acquired the same number of customers for the campaign, assuming the budgets are identical. If the lift is greater than one, then Algorithm A has gained more customers and vice versa.

We also do not have to limit ourselves to running our A/B test with a single campaign.

Using lift measurement, we can determine how many campaigns performed better when using Algorithm A, how many performed worse, and how many were neutral. In addition, we can calculate the average or median lift for all our campaigns.

True Improvement vs. Random Luck

Assume we conducted the aforementioned A/B test across numerous campaigns and discovered that our average lift is 1.3. This means that our new algorithm A outperforms our old algorithm B by 30% on average. However, is this performance improvement genuine or the result of random oscillations in real-time bidding markets? Would we achieve the same 30% rise if we repeated the experiment under identical conditions?

A typical approach to this topic would be to do hypothesis testing. Our null hypothesis is that algorithms A and B have the same average performance and our two-tailed alternative hypothesis is that the average performance of algorithms A and B differs.

If we predict that the campaign lifts are independent and uniformly distributed random variables, we can use a simple t-test to produce confidence intervals for a given significance level.

The “identically distributed” assumption, on the other hand, is highly unlikely to be correct. Distinct advertising campaigns have different budgets, impression limits, and behaviors in general. Consider the following two advertisers:

The first advertiser is a company that sells pizzas across the United States on a CPM (cost-per-thousand impressions) contract with a monthly impression target of 10 million.

The second advertiser is a car dealership in California that has signed a CPM contract to show 100,000 impressions per month.

The expenditure behavior of the two campaigns will obviously differ, as we will show x100 more impressions for the 1st advertiser than for the second. Furthermore, because pizzas are less expensive and clients are more likely to buy many times per month, the pizza campaign will almost certainly have a higher conversion rate than the automobile campaign.

One method for removing the assumption of identically distributed lift is to quantitatively estimate the actual lift distribution for each campaign. This will not only allow you to estimate confidence intervals for each campaign’s lift, but it will also allow you to estimate the distribution of average lift across campaigns.

Here are some measures that might be taken to accomplish this:

Assume a distribution for either the lift value or the CPA of each group (for example, conversions in group A follow a Binomial distribution with regard to impressions…).
Take a sample of points from each campaign’s lift distribution and average them. Repeat this several times (for example, 100,000 times).
Create a distribution of the average lift across campaigns using the simulated mean values.

Once we have the distribution of the average lift, we can compute the mean as well as confidence intervals and use these to determine whether our average lift is statistically significant.

New Algorithm, New Issues

However, even if surpassing the present production algorithm on the average lift is statistically significant, it is not enough to promote the new algorithm to production.

We also want to identify campaigns that performed much worse than expected when using the new algorithm. Such campaigns may require a thorough examination to see what went wrong and how the new model may be improved.

A critical point to remember here is that assessing those campaigns is “expensive” in terms of resources (time, human intervention, etc.). Thus we want to reduce the amount of “false positives” as much as possible while still generating enough potential underperformers.

As the number of campaigns increases, the previous strategy of using confidence intervals and hypothesis testing for each campaign independently falls down.

We expect a particular fraction of campaigns to outperform and/or underperform significantly under the null hypothesis (experiment has the same performance as production).

Furthermore, because it is based solely on individual confidence intervals, that fraction remains relatively constant with the number of campaigns, implying that an increasing number of campaigns may be “false positives” while appearing to be “true positives” — that is, they appear to be underperforming when they are not. As our firm grows and we work with more and more clients, the influence of this effect becomes more evident. So, how can we make amends?

Alternative Techniques

Here’s an alternative method that makes use of multiple hypothesis testing frameworks. First, for each campaign, we compare the experimental model’s individual lift values to the campaign’s lift distribution under the null hypothesis (the experiment is identical to production).

We may then compute a p-value for each campaign, which represents the probability of detecting a lift at least as extreme as the one measured if we accept the null hypothesis.

We might take numerous approaches to identify a group of “major underperformers.” The most basic would be to utilize the Bonferroni adjustment, which would categorize as “significant” all underperforming campaigns with p-values less than /N, where is the necessary statistical significance if we were only looking at one campaign.

This ensures that the Family-Wise Error Rate (FWER = the probability of at least one false-positive overall) is less than. In practice, this is extremely cautious and would result in a large number of false negatives (i.e., will come up with too few underperformers). There are a couple of additional ways to control the FWER, but the ultimate result is too cautious for our purposes.

Rather than requiring near assurance that there are no false positives, a less stringent approach would be to require that not too many of the potential failing campaigns are false positives.

To put it another way, we can choose an acceptable false discovery rate (FDR = expected number of “false positives” / number of campaigns discovered). For example, if FDR = 0.1, we can expect 90% of the identified campaigns to be “genuine underperformers.” The Benjamini–Hochberg(–Yekutieli) approach can then be used to identify such campaigns for further testing.

Conclusion

The methods detailed above are only one approach to advancing an improved model to production status. The final result is a binary choice: should we continue with the present production model or transition to the new model?

A viable alternative is to make a “continuous” choice, such as by employing a multi-armed bandit strategy. We run both models concurrently on various user bases whose sizes are determined by individual model performance.