Posts
-
The Blogging Gauntlet: May 4 - Exploration-Exploitation Part 1: Multi-armed Bandits
This is part of The Blogging Gauntlet of May 2016, where I try to write 500 words every day. See the May 1st post for full details.
The exploration-exploitation problem is a classical problem in AI. Its generality means it shows up in several topics, such as reinforcement learning and Monte Carlo Tree Search. Additionally, it can be very applicable to issues in industry, like online advertising, A/B testing, and decision making in general.
The cleanest setting for explaining exploration-exploitation is the multi-armed bandit problem. Don’t ask me where the name came from, because I’m not sure of that either.
In the original formulation, a gambler stands in front of a slot machine with \(N\) arms. Every arm has a different reward, and those rewards as random. More formally, the reward from each arm follows an unknown probability distribution. Whenever the gambler picks an arm, they get a sample from that distribution.
Importantly, the gambler doesn’t know which arm has high expected reward and which doesn’t. For \(T\) timesteps, the gambler pulls an arm and receives a reward sampled from that arm’s distribution. The goal is to design a strategy for the gambler which maximizes their expected reward.
If the gambler were all-powerful and omniscient, they could always pick the arm with highest expected reward. Let’s call this the optimal strategy. Now, it’s impossible to achieve the optimal strategy. The gambler has no information on the payout of each arm, so they have to try every arm at least once to make sure they aren’t missing the optimal arm. This motivates the definition of regret.
The regret of a strategy \(S\) is the expected difference in reward between following \(S\) and following the optimal strategy of always picking the best arm. Note the regret is always positive - by definition, it’s impossible to do better than the optimal strategy. As an example, suppose we have two arms.
- Arm 1 pays $2 with probability \(1/2\), and $0 otherwise.
- Arm 2 pays $4 with probability \(1/3\), and $0 otherwise.
Suppose our strategy was to pick arm 1 or 2 randomly. Over \(T\) timesteps, this achieves expected payout
\[\left(\frac{1}{2}\cdot\frac{2}{2} + \frac{1}{2}\cdot \frac{4}{3}\right)T = \frac{7}{6}T\]However, the optimal strategy has expected payout
\[\frac{4}{3}T\]So the regret of this strategy is \(T/6\).
At this point, it’s worth explaining why this problem is so difficult. Because we have no information about each arm’s reward distribution, it’s possible that most of the reward is bundled into low probability outcomes. Think of this example.
- Arm 1 always pays $1
- Arm 2 pays $1,000,000 with probability \(1/10^4\).
In the case, the arm with highest expected reward is arm 2, but in almost all attempts the gambler will receive $0 for that arm. This is the exploration-exploitation problem in a nutshell. Based on our experiences so far, we can either try new things with have a chance of being better than anything we’ve seen so far (exploration), or we can do what we’ve done before, because we know we like it (exploitation). Too much exploration leads to too little reward, because we never stick to what we like the most. On the flip side, too much exploitation means we never discover alternatives that could be orders of magnitude better than our current life.
Figuring out a good balanced between exploration and exploitation is complicated because we have no idea how to distinguish between choices that are good with very low probability and choices which are always bad. Or to be more precise, it is theoretically impossible to distinguish between the two without taking tons of samples. We have to observe at least one good outcome, and the difference between “happens one in a million” and “happens never” is small enough to be almost invisible. I’ll let Wikipedia explain how frustrating this problem can be.
Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it.
I don’t want to get into the work done on this problem. The literature is vast, deep, and I won’t pretend to understand all of it. However, there are a few intuitive properties that are good guides for real-world decision making.
- It’s best to explore choices early and exploit best known choices later. If you discover a really good option, you want to maximize your exploiting time.
- The bigger \(T\) is (the longer your time horizon), the more you should explore early on. The small loss of reward in the now is worth the knowledge of knowing you’re doing the best possible thing.
- Good strategies gradually reduce their tolerance for short-term loss, instead of flipping from “try ALL the things!” to “try ONLY ONE of the things!”
This is already much longer than I expected, so I’ll leave my real life milk tea example for tomorrow.
-
The Blogging Gauntlet: May 3 - Asynchronous Research and You
This is part of The Blogging Gauntlet of May 2016, where I try to write 500 words every day. See the May 1st post for full details.
Last Sunday, I did Berkeley Mystery Hunt. I can’t talk about the puzzles publicly because this puzzle hunt gets run twice: once during the school year for Berkeley students, and once during the summer for people in the Bay Area.
In every puzzle hunt, I run into the same scenario. A new puzzle comes out, and I take a look at it. I’ll solve it, or I’ll get stuck and move to another puzzle. Either way, I run into a problem. Either I need other people to explain what they’ve figured out, or other people need to understand my progress based on my work so far.
For smaller puzzle hunts, this is fine. Usually you’re all in the same room, so you can ask each other in person. However, when you scale up to massive teams like MIT Mystery Hunt, this becomes a huge concern. They might be sleeping, or be on the other side of the country, and you have to figure out the partial progress from only the notes they’ve written down. If those notes don’t explain the motivation behind all the data collected, they’re practically useless. How am I supposed to know why you removed all the vowels, or where this random column of numbers came from?
I didn’t know what to call this, until I read qntm’s series on antimemetics. If you haven’t read them, I highly, highly recommend checking them out. Start at We Need To Talk About Fifty-Five and follow the links from there.
If you don’t want to, here’s the quick version. Memes are ideas that want to be known, spreading from mind to mind. Antimemes are ideas that want to be forgotten. They are self-censoring. When you stop looking at an antimemetic object, your brain erases your memory and you forget it even exists.
The Antimemetics Division studies antimemes, and they have a devil of a time doing so. It’s hard to study things that everyone rediscovers on a daily basis, and it’s even harder when malicious antimemes erase the lives of coworkers so thoroughly that any evidence of their existence disappears. It’s one of the more horrifying ways to go. You can’t prove your existence to the world if the world doesn’t listen to you…
(from Crisis on Infinite Earths, discovered through TVTropes)
Nightmare fuel aside, there’s a quote about antimemetics research that resonated with me, which is applicable outside the realm of fiction.
Still lost for a logical entry point to the data, Wheeler curses all of her predecessors. Asynchronous research — whereby the research topic is forgotten entirely between iterations, and rediscovered over and over — is a perfectly standard practice in the Antimemetics Division, and her people ought to be better trained than this. There should be an obvious single document to read first which makes sense of the rest. A primer—
(qntm, CASE COLOURLESS GREEN)
The idea of asynchronous research is something I’ve thought about before, but this is the first time I’ve seen someone describe it. And I love the description, it’s perfect.
When I take notes on something, whether it’s for puzzle hunts or lecture, I emphasize writing down motivations, writing down questions people ask about the material, and writing down the answers to those questions. If I don’t follow the lecture, I write what the lecturer is saying anyways, and pepper the margins with question marks to indicate my confusion.
The reason I do this is simple: if I understand the material, I’m not going to read my old notes. In the worlds where I’m reading my notes, I don’t understand what’s going on, so I write my notes towards a more confused version of myself. If my notes spend too long explaining a topic I already get, that’s fine, I’ll just skip ahead. But if my notes don’t explain something in enough detail, I have to struggle through them, and that takes a lot longer.
Understanding something in the now doesn’t mean you’ll remember it in the future, so it’s best to leave notes for your forgetful future selves. Comment your code to explain your awful hack. Mark key ideas to give your future self something to start at, and mark when you’re confused to let your future self know where your notes are dodgy.
If you’re confused about an idea, then get it after a few minutes, write down why you were confused and why it makes sense, because there’s a good chance that if you get confused again, you’ll get confused in the same way. (Writing it down also makes it stick better and decreases the chance you get confused in the first place.)
Here are some examples. Below are my notes on backprop from neural nets. I make sure to annotate the intuition behind the gradient.
(Click for full size.)
Here’s another example from theoretical crypto. I don’t follow the intuition said in lecture, but I write it down anyways, and note my confusion appropriately.
(Click for full size.)
There are downsides to this approach. I’ve filled notebooks with notes that I’ve never read again, and it’s tricky to take detailed notes while keeping up with lecture. However, I overall think it’s worth it. This may not be the best way to take notes, but it’s what my note taking strategy evolved into. So far, it’s been working for me.
-
The Blogging Gauntlet: May 2 - Too Tired To Play
This is part of The Blogging Gauntlet of May 2016, where I try to write 500 words every day. See the May 1st post for full details.
Let’s talk about work and play.
This is a theme that people absolutely love to bring up when talking about productivity. Here’s how the argument goes. People who do lots of work do because they like their work. They find ways to make work fun. The average employee goes to work to earn a paycheck, using that money to fund their play elsewhere. They work to live, instead of living to work. The mythical employee who treats their work as play lives for their work, because their work isn’t work; it’s play! That self-satisfaction drives them to work on side project after side project.
I’ll fully admit I’m strawmanning the argument by a bit. It’s not as though it’s unfounded. People who like their job will do more at their job. It’s obvious. People with big hobbies will spend time outside of work on those hobbies. They’ll take salsa lessons, or practice playing the piano. Maybe they’ll try organizing a roleplaying campaign.
What this portrayal misses is that even if you like your work, it’s still work. Someone can both deeply enjoy their job and deeply desire a break from it. And similarly, some hobbies require lots of effort as well. Learning to play the piano requires focused practice. Running a roleplaying campaign requires crafting a plausible setting, and learning how to improvise against the insane actions of roleplayers themselves.
If there is work and there is play, then for me blogging is semi-play. It’s nice, but it requires effort, and sometimes I’m too tired for that. If I’ve just worked for 12 hours, the last thing my brain wants to do is do more. I’m too tired to work, and I’m too tired to play.
It doesn’t matter that blogging is more personally fulfilling than being a passive consumer of other people’s work. Entertainment is entertaining! It’s made that way! Zero effort for a small blip of amusement. Welcome, dear children, to the fire hose of information.
There is always a relevant xkcd
I don’t want to bash the entertainment industry that hard, because it would be hypocritical. This blog is a contributor to that fire hose, and I want people to drink from it.
I want to write more, but there’s this pressure to make the words I’m writing perfect and well-polished. To make every sentence a clean pearl, saying exactly what I mean in exactly as many words as I need. The issue is that it takes a long time to write in a polished way. The more polished I want something, the more effort that has to go into it. My standards for myself make blogging too daunting to do unless I have little else going on, and that’s not sustainable.
A few weeks ago, I had a conversation with somebody. She said it was nice to have a low effort writing outlet. That way, she could write down her passing thoughts, without worrying about making it perfect.
This challenge is a step towards making my blog take less effort. By requiring 500 words every day, I’m forced to lower my standards on what’s publishable and what isn’t. I won’t lie, that’s going to sting a bit. It doesn’t matter how many layers of self-deprecation I weave about myself, I care a lot about the quality of the things I produce. It’s natural to; in fact I’d say it’s incredibly strange not to. The end result of a well-edited most can be glorious to behold, but it’s a long journey to get there, and I find it less fun than the first step of turning my thoughts into words.
I’m treating this gauntlet as one small piece of my long term plan to get better at working towards long term plans. In the short term, my writing is going to be straight garbage. In the long term, I’ll be more articulate, and I’ll be better at understanding the power of taking small concrete steps towards lofty uncertain goals.