Thanks for the reply! My audience is for someone who is early into their undergrad degree, doing math or computer science. Which parts do you think is not communicated clearly? I can make an edit later today :)
Oh cool, I've also done a presentation on bandit methods for early-undergrads (see: https://gtagency.github.io/2016/experimentation-with-no-ragr..., its missing speaker notes so it looks a bit strange, but that outlines the structure fairly well). It was also sort of an intro to the beta distribution and why you should love it, hence the focus on Thompson Sampling.
Some critiques:
- I feel like your justification/explanation for why this is useful is a bit lacking. Personally I find framing it in terms of regret-minimzation better than gain-maximization, even though in practice they're the same. I think it frames the situation in such a way where you go in knowing that you will have to pick non-optimal things some, so your job is to learn the underlying distributions as quickly as possible, instead of trying to pick the best things. Interestingly, I think thinking of it as gain-maximization leads you down an epsilon greedy path, whereas regret-minimzation leads you toward UCB1/Thompson Sampling better. Since you pivot to RL instead of just bandits, I can kind of understand it, but see my last point.
- As a general rule, I try to minimize math in undergrad-focused talks/documents. Even as someone who spends a lot of time explaining statistical concepts to people, my eyes glaze over when I see `q*(a) = E[Rt|At = a]`. Obviously you need some and this is just a personal thing. For the most part I actually think you do a decent job of explaining the equations you use. At least until the gradient bandit part :P Then it just feels like a textbook proof excerpt.
- Nit: You don't fully explain that epsilon-greedy is greedy, except epsilon of the time. That caught me up for a second.
- The last thing is that I feel like the motivation and difference between stationary and nonstationary reward distributions isn't well explained. Nonstationary rewards don't really "fit" the mental model behind k-armed bandits a lot of the time. I'm actually curious for a better motivation there, as I can't articulate one myself.
Hey Joshua, thanks so much for the criticism. I do see your point towards epsilon greedy vs. regret minimization. I will add a section about that before presenting UCB1. I also will add more information in the epsilon-greedy strategy section, elaborating on what the epsilon is really for. I'm not 100% sure how to reframe the nonstationary reward situation, because I feel like that adds state dependent on t to the bandit scenario, which then feels more like MDP.
>because I feel like that adds state dependent on t to the bandit scenario, which then feels more like MDP.
Yeah this was sort of exactly the issue I was running into. I can't justify it to myself without essentially saying "this is just an MDP in disguise", which maybe is the right way to do it. I'm pretty sure you can define a k-armed bandit as an MDP on a single state, where each action corresponds to a machine, and all actions return you to the single state.
So maybe that is the right motivation. But reversing that "an MDP is just a k-armed bandit problem where sometimes playing a machine breaks it and forces you to play other machines, which can impact how quickly the casino fixes your first machine..." feels forced.
One thing that confused me is that you introduce the "preference" for an action, but don't really define it. Some exposition about what that means would have helped.
Ah yes, I didn't want the reader to fuss too much about the idea of a preference, since it's really just an analog of un-normalized pi(modulo the exponentiation part). It doesn't have a strict definition nor is it a formal term. I will un-bold it and italicize it instead. Thank you :)
Great suggestion! The blog was based on a large portion of the book. A friend of mine asked for a version of the first chapter that was digestible for an audience that is in high-school to undergrad college level. I wrote this blog with that in consideration, while adding my own observations as well. I am planning to write up some python solutions for the MDP chapter as well. Thanks for reading :)
Yes, it is as you stated. Due to the fact that bandits are stateless, there is no state parameter in $q_(a,s)$. From where I learned it, this could arguably be an abuse of notation to use $q_$ in the same context. In my newer entry(which is currently WIP), it uses $q_*(a,s)$ and uses cumulative sum of the future rewards(with discount). Thanks for the reply guys :)