Stop A/B Testing And Make Out Like A Bandit

btilly · on April 20, 2012

This is an interesting suggestion, but I have significant questions that I'e like to see answered.

1. Most multi-armed bandit algorithms assume that the potential reward for each lever is the same each time you pull it. Unfortunately web traffic does not look like this - there are daily, weekly, and monthly cycles in conversion characteristics, with large random fluctuations on top. An A/B test can ignore this - who got the better traffic is just another random factor that comes out in the statistics. How much would this impact a multi-armed bandit approach?

2. My understanding is that multi-armed bandit algorithms assume that feedback is instantaneous - pull the lever and get the answer. But this is often not true. You send a whole batch of emails before getting feedback on the first. Depending on the business, incoming users can take time to convert to paying customer. I've seen places where the average time to do so was weeks. What decision should be made in that time period of uncertainty? Worse yet, what if one version speeds up conversions relative to the other? I know how to tweak an A/B test to handle this issue (just wait then look at cohorts that should have converted under either), I don't know how to modify a multi-armed bandit algorithm to do so.

3. As a practical matter, companies don't want to keep tests going indefinitely. There is a real technical cost to maintaining a complicated mix of possible pages that can shown. You want losers to be removed from your code base. A/B testing is well-suited to doing that. A multi-armed bandit approach does not.

4. Most companies don't even do A/B testing correctly. I fear that pushing a more complex scheme makes them less likely to try it, and increases the odds of mistakes.

noelwelsh · on April 20, 2012

Good questions. I'll do my best to answer them.

1. We assume that the distribution of rewards for each lever is fixed (over the short term). This allows the reward to vary randomly so long as the average reward (over the short term -- days to weeks) is constant. There are more complex schemes, which allow for greater variation in reward, but the initial Myna offering is intended to be directly comparable to A/B testing.

2. It's not necessary to assume that feedback is instantaneous. Basically you can continue to make suggestions (pull levers) in proportion to your best estimate of their expected return and the maths holds. Very long conversion cycles will cause problems for any system, I think, as you'll spend a long time in a random walk. In these cases we recommend using a proxy measure which is correlated with conversion, if one is available. As for one option speeding up conversion, I don't think that will matter as you'll simply refine your estimate of one lever faster. I haven't thought too much about this particular issue; it would be worth doing some simulations to see.

3. Just turn off the bandit when you've satisfied with the results. That is, you can use Myna like A/B testing, and the Myna UI displays confidence bounds for this very reason, while still getting the benefits of optimising as data arrives.

4. I actually like bandit algs as there is less for the user to mess up. You don't have to worry about how much data to collect, what p value to use, and so on. Just set it running and it optimises automatically.

btilly · on April 20, 2012

Further questions.

1. My experience is that this assumption can be significantly mistaken. I have seen significant daily and weekly variations in conversion rates. (They average out after a bit, but they fluctuate.) Still a small tweak could help.

2. Please do look into this one, the difference is not small. I first encountered the differential timing issue with a test of the benefits of adding a phone touch point on top of an email cycle. That extra contact moved conversions up by weeks. So after the first month it looked fantastic, but then slowly degraded over time. (There was improvement, but not enough to justify the expense.) A careful cohort analysis showed that it was not worthwhile at a point where a naive A/B test was still showing the extra touchpoint winning by a very significant margin.

3. That's reasonable.

4. In practice during A/B testing people don't worry about those things either. They just start the test, and later declare a winner. (People are much more sloppy about it than they theoretically should be...)

storborg · on April 20, 2012

Re: 3, how do you calculate confidence bounds for a bandit test? I was under the impression that traditional AB testing methodologies (e.g. z-tests, p-values, etc) broke down here, because the sample size is variable.

noelwelsh · on April 21, 2012

It's actually simpler in the bandit case, because you're only trying to estimate the average return. You could use a Chernoff-Hoeffding bound for example, but most bandit algorithms are of the upper confidence bound variety and actually calculate a bound as part of the algorithm.

wpietri · on April 20, 2012

I'm probably just missing something. But for us the purpose of A/B tests isn't really optimization; it's learning. We have a hypothesis about how to improve something and we try it out. The most valuable tests for us are the ones that don't work, because they force us to go back and think things through again.

A magic multivariable optimizer seems fine for the kinds of things a human won't be thinking about (e.g., most interesting tweets this hour and the best ads to show next to them). But from this article I'm not seeing an advantage in using a similar mechanism for testing product hypotheses.

nkh · on April 20, 2012

This article needs a patio11 bat signal.

I would be very interested to get his thoughts on this, and he didn't comment on it last time it was posted.

patio11 · on April 20, 2012

Looks nice. Happy to see more optimization in the world. Have not dug into the math of it enough to appreciate that aspect yea or nay. Think claims of superiority over A/B testing are moot unless it successfully fixes the biggest problem with A/B testing, which is that people don't A/B test. Don't feel burning need to implement for myself.

How's that?

judofyr · on April 20, 2012

8 month ago: http://news.ycombinator.com/item?id=2831455 (65 comments)

noelwelsh · on April 20, 2012

Well, maybe we can add to that discussion.

I say that as author of the blog post, and the founder of Myna, which is an implementation of the ideas described therein. And yes, I'm totally hoping this post stays on the front-page so we get more hits. (Myna is: http://mynaweb.com/)

Now that Myna is out (though still in beta) I'm super interested in discussing it with anyone who is interested.

swah · on April 20, 2012

Did you choose at the bird image in your front page using Mina? (I'm asking because normally we see successfull people in that place and supposedly it converts well).

noelwelsh · on April 20, 2012

No we haven't. In a normal day we don't get enough traffic to make optimisation worthwhile. Once we've refined the offering (we have a good number of beta testers) we'll start focusing on growth and optimising the page.

josscrowcroft · on April 20, 2012

I dunno why, but I do feel a Content Optimisation tool's homepage could be a little more ... content optimised.

Having said that, loved the article, fascinating stuff!

noelwelsh · on April 20, 2012

I agree. It is kinda embarrassing, but we have limited time and have to focus where we think we'll get the most bang-for-buck.

aresant · on April 20, 2012

Noel - Very interested in learning more, my email address is in my profile, drop me a line would like to connect you w/somebody on my team to assess further.

hartror · on April 20, 2012

And as coffeemug points out on that post as you (generally) cannot guarantee the independence of the variables you are changing to influence user behaviour. This makes the multiarmed bandit unsuited to the task the article espouses.

pgroves · on April 20, 2012

This is true in any practical machine learning problem and it doesn't cause as much of a problem as theoreticians would have you believe. Having independent variables, or detecting and accounting for all dependencies, will typically lead to a better result. But when there are dependencies it merely degrades accuracy. It does not destroy the accuracy altogether.

If you were to try to address every possible independence assumption in practice, you'd never get through a research project, let alone get a real product out the door as this company has done.

[Edit: syntax]

noelwelsh · on April 20, 2012

Then it also makes A/B testing unsuited to the task...

Independence is a useful assumption, as it allows more tractable algorithms. It is a correct assumption? No, obviously not, but it close enough to correct that the results you get from it are good. People have made plenty of money using A/B testing, which also makes an independence assumption, and people will make plenty of money using bandit algorithms. The only difference is they'll make more money using bandit algs.

cschmidt · on April 20, 2012

I know about the Gittins index to solve Multi-armed bandit problems, from 1979.

http://en.wikipedia.org/wiki/Gittins_index

I thought that was supposed to be optimal in some sense. How does the paper cited in this post improve on that?

noelwelsh · on April 20, 2012

The kind of optimality we're talking about is the up-to-constant factors / Big-O kind. The Gittins index has better constant factors than UCB-1. However, UCB-1 can be computed easily whereas Gittins indices are very expensive to compute.

wccrawford · on April 20, 2012

He'd lose that bet about the supermarket. Produce is in the far back corner, the bakery is in the front corner near the entrance, and the dairy is on the other side of the store in the back. 1 out of 3. I win.

But I doubt that it's laid out that way to make people walk all the way across the store. It's that way because there's only so many walls, and those departments need space that the customers can't enter. Meat and seafood are along the rest of the wall at the back for the same reason.

zachallia · on April 20, 2012

did you a/b test that default styled sign up button?

noelwelsh · on April 20, 2012

Zing! :)

No, we haven't. We're not in a growth stage yet. On a normal day (this is not a normal day) Myna doesn't get very much traffic. We've already validated the concept with earlier beta testers and are now refining the offering. Once we've done that we'll by trying to drum up more traffic and start optimising.

zachallia · on April 21, 2012

haha sweet, i figured. was just being an ass.