A/B tests are a crucial tool for evaluating experiments in marketing domain. It is performed by subjecting different population samples to variations of an offer. The results are observed at the end of experiment to select the best performing variant. Marketing teams need to spend a lot of time in crafting these experiments. It also takes time and money to wait for results to iterate and improve. There is a need for methods which can help prioritise which experiments should be A/B tested. In this blog, we will discuss the challenges we faced while evaluating our offer optimisation system (MARS). We also look at how Counterfactual Policy Evaluation can be used as a method for model evaluation without going for A/B tests.


How MARS works?
Before we dive deeper into this post, let’s start by understanding the basics of how MARS works. MARS formulates offer optimisation as a problem of sequential decision making. Our model observes a contextual state information about the user and platform at a given time step. Using this observation it tries to recommend the right offer for the user.  The model then receives a feedback signal to improve its recommendation. This feedback is governed by a goal set by marketer. This loop continues in a finite horizon setting. In the end either the user exits campaign due to certain preset conditions.

A trained model would always recommend optimal offers at each step to help users reach the goal. Before we deploy this model in real world, we want to measure the goodness of the recommended offers. With every trained model, we wanted to ask two questions. How good was it compared to other trained models? How well would it perform if deployed in real world?


The Problem
To answer these questions in supervised learning is easier. You take a bunch of logged data, train your models on a part of it and test it on other held out data. With this you would be able to, more or less, approximate how the model would perform in real world. In our case, doing this was difficult. There was no way we could observe the consequences of a model's recommended action in the past data. Also, we can never know how users will be impacted by offers that are recommended by this model, before actually recommending them.

Only if we could have a time turner

To train a good model we had to run a lot of experiments. It was not possible to run all those experiments at once. The experiments would also take a long time to output the metrics we wanted to observe. Another issue was to evaluate the model (policy) performance over a sequence of offers. Traditional methods of evaluation are better suited for offers recommended at one step.


We went looking for solutions:

You are in a simulation! Inside a simulation!


One of the solution was to simulate the business scenario in a sandbox and use it for evaluation. We realised that it was not easy because of complexity of the environment. It includes a lot of latent variables which are external to the platform. Designing the sandbox would take a long time and would be erroneous.

The other alternative was to run A/B tests. This was a popular and reliable method that similar experiments used. In an A/B test, you split the samples in two sets and use one as treatment, other as control. At the end of the experiment we test the significance of difference in observation. Although they are quite powerful but we realised that it was not possible to test every experiment. There were many issues in running the A/B tests. Firstly, the tests were still too risky as it had direct monetary impact for the business. Secondly, they were time taking - we would need to train our models and wait for tests to run over a long period of time. Only after this time could we observe the results and retrain. This was a very slow process to iterate over and get a model with decent performance.


Counterfactual Policy Evaluation:

Our search for a better method brought us across to this paper. It discusses about techniques for evaluating Reinforcement Learning models. Reinforcement Learning algorithms are usually tested in simulated environments. Counterfactual Policy Evaluation (CPE) techniques can estimate the performance of a trained policy without deploying it in the real environment or simulating the environment.

We have access to a lot of logged data from the past experiences of user and platform. These methods exploit this huge amount of data to have better estimate of the policy. We have implemented techniques like Regression Estimators, Inverse Propensity Estimators and Doubly Robust Estimators. Doubly Robust (DR) Estimator guarantee a low bias and low variance method which comes very close to A/B testing outcomes.

Our hyper-parameter tuning infrastructure uses these scores to optimise the search space. This enables us to train many models simultaneously without waiting to run an A/B test. Implementing methods like sequential DR requires us to think about data collection in a different way. We have built infrastructure to help support these methods and we keep improving it. Once we have a few competing models with good CPE scores, we put them in production for real A/B tests. Such tests also help us validate the effectiveness of using CPE as a primer to A/B testing


Where can you use it?
These set of Counterfactual Policy Evaluation techniques can used in following types of cases:

  • You have a huge amount of logged feedback data.
  • You can’t run A/B or it is ethically too risky.
  • You have sequence of policy recommendations and you need to evaluate sequence of treatments.


We’ll talk in detail about these methods in a series of blog posts which follows. We will also discuss with you, the other challenges we faced while building this system in-house and ways you could use it in your domain.