Posts

May 5, 2016
The Blogging Gauntlet: May 5 - Exploration-Exploitation Part 2: Milk Tea
This is part of The Blogging Gauntlet of May 2016, where I try to write 500 words every day. See the May 1st post for full details.

Yesterday I introduced multi-armed bandits. Today I’ll give a delicious real world example: deciding on the best place to get milk tea in Berkeley.

Before I get into the details, let’s set up some ground rules.
- Pearl milk tea is a tea drink made of tea, milk, and tapioca balls. The tapioca balls are also known as pearls.
- Many, many people call this drink “boba” or “bubble tea”. These people are wrong. Yes, I understand there are regional dialects that each have a different word for milk tea, but the correct one is PMT. “Boba” and “bubble” refer to only the tapioca inside the tea. Thus, both “boba” and “bubble tea” are incorrect, because they do not signify that the milk tea drink has milk in it.
- “Boba milk tea” or “bubble milk tea” are acceptable terms that I can tolerate.
Anyways, this post isn’t about milk tea linguistics, it’s about multi-armed bandits. In this setting, initially every arm is a milk tea restaurant. The arm’s payout is random, depending on what flavor of tea I get and how I request it be made. Over time, each arm further divides into more choices, because I need to choose what flavor of tea to get within a given milk tea cafe.

Let’s pretend I’m a freshman again, and have no context about the milk tea scene. As a freshman, I know of three milk tea options: Quickly’s, Sweetheart’s, and Lotus House. I try them for a while, and conclude that Sweetheart’s has the best pearls, and Lotus House has the best tea. This continues to hold up as I try different flavors. Eventually, I converge to alternating between Sweetheart’s and Lotus House based on mood.

A few months later, someone mentions PurpleKow to me. This milk tea place is further from the dorms, but I check it out, and my god, it’s so much better. Here, I paid the price for not exploring enough. If I was serious about milk tea (which I am now), I should have done a search of nearby milk tea places. That would have made me find PurpleKow significantly sooner.

Now, fast forward to next year. Near the south side of campus, two stores open.
- Sheng Kee Bakery, a chain of Asian bakeries that also sell milk tea.
- ShareTea.
Sheng Kee’s new, so it deserves a few tries. I quickly decide it’s okay, but not good enough. ShareTea’s also new, so I try it, and

oh

my

god.

ShareTea quickly becomes my go-to spot. At this point, I don’t bother going to PurpleKow or Sweetheart’s. My past observations of the quality of milk tea gives me high confidence in the quality of milk tea from these establishments, and it’s worse than ShareTea’s.

At this point, I have some regret. I could have gotten tea from PurpleKow significantly more than I did. However, it’s outweighed by my decision to explore ShareTea early. I missed PurpleKow for a few months, but I got to exploit ShareTea for years.

Finally, a few months after going to ShareTea several times a week, someone tells me I should really, really try Asha Tea House. I try it, and again I’m surprised. It’s very good, better than ShareTea, but it’s expensive enough to make it a trade-off between quality and quantity. I choose between the two based on which I favor that day.

In the present, it’s worth asking myself how I’ve done. Well, overall my milk tea regret is pretty low. There were a few months where I didn’t explore my options thoroughly enough, but I checked out ShareTea within days of its opening, and got to exploit it for years. There’s some regret from not visiting Asha earlier, but it’s close enough in quality that I don’t feel so bad about it.

The biggest takeaway I want to get across is that we usually don’t do enough exploration. People tend to regret what they didn’t do more than they do, and although there may be some cognitive bias there, the potential reward from trying something amazing means it’s worth exploring more than you think you should. Spending a few months trying different things is cheap compared to years of doing the right thing.
Comments
May 4, 2016
The Blogging Gauntlet: May 4 - Exploration-Exploitation Part 1: Multi-armed Bandits
This is part of The Blogging Gauntlet of May 2016, where I try to write 500 words every day. See the May 1st post for full details.

The exploration-exploitation problem is a classical problem in AI. Its generality means it shows up in several topics, such as reinforcement learning and Monte Carlo Tree Search. Additionally, it can be very applicable to issues in industry, like online advertising, A/B testing, and decision making in general.

The cleanest setting for explaining exploration-exploitation is the multi-armed bandit problem. Don’t ask me where the name came from, because I’m not sure of that either.

In the original formulation, a gambler stands in front of a slot machine with $N$ arms. Every arm has a different reward, and those rewards as random. More formally, the reward from each arm follows an unknown probability distribution. Whenever the gambler picks an arm, they get a sample from that distribution.

Importantly, the gambler doesn’t know which arm has high expected reward and which doesn’t. For $T$ timesteps, the gambler pulls an arm and receives a reward sampled from that arm’s distribution. The goal is to design a strategy for the gambler which maximizes their expected reward.

If the gambler were all-powerful and omniscient, they could always pick the arm with highest expected reward. Let’s call this the optimal strategy. Now, it’s impossible to achieve the optimal strategy. The gambler has no information on the payout of each arm, so they have to try every arm at least once to make sure they aren’t missing the optimal arm. This motivates the definition of regret.

The regret of a strategy $S$ is the expected difference in reward between following $S$ and following the optimal strategy of always picking the best arm. Note the regret is always positive - by definition, it’s impossible to do better than the optimal strategy. As an example, suppose we have two arms.
- Arm 1 pays $2 with probability $1/2$, and $0 otherwise.
- Arm 2 pays $4 with probability $1/3$, and $0 otherwise.
Suppose our strategy was to pick arm 1 or 2 randomly. Over $T$ timesteps, this achieves expected payout
\[\left(\frac{1}{2}\cdot\frac{2}{2} + \frac{1}{2}\cdot \frac{4}{3}\right)T = \frac{7}{6}T\]
However, the optimal strategy has expected payout
\[\frac{4}{3}T\]
So the regret of this strategy is $T/6$.

At this point, it’s worth explaining why this problem is so difficult. Because we have no information about each arm’s reward distribution, it’s possible that most of the reward is bundled into low probability outcomes. Think of this example.
- Arm 1 always pays $1
- Arm 2 pays $1,000,000 with probability $1/10^4$.
In the case, the arm with highest expected reward is arm 2, but in almost all attempts the gambler will receive $0 for that arm. This is the exploration-exploitation problem in a nutshell. Based on our experiences so far, we can either try new things with have a chance of being better than anything we’ve seen so far (exploration), or we can do what we’ve done before, because we know we like it (exploitation). Too much exploration leads to too little reward, because we never stick to what we like the most. On the flip side, too much exploitation means we never discover alternatives that could be orders of magnitude better than our current life.

Figuring out a good balanced between exploration and exploitation is complicated because we have no idea how to distinguish between choices that are good with very low probability and choices which are always bad. Or to be more precise, it is theoretically impossible to distinguish between the two without taking tons of samples. We have to observe at least one good outcome, and the difference between “happens one in a million” and “happens never” is small enough to be almost invisible. I’ll let Wikipedia explain how frustrating this problem can be.

Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it.

I don’t want to get into the work done on this problem. The literature is vast, deep, and I won’t pretend to understand all of it. However, there are a few intuitive properties that are good guides for real-world decision making.
- It’s best to explore choices early and exploit best known choices later. If you discover a really good option, you want to maximize your exploiting time.
- The bigger $T$ is (the longer your time horizon), the more you should explore early on. The small loss of reward in the now is worth the knowledge of knowing you’re doing the best possible thing.
- Good strategies gradually reduce their tolerance for short-term loss, instead of flipping from “try ALL the things!” to “try ONLY ONE of the things!”
This is already much longer than I expected, so I’ll leave my real life milk tea example for tomorrow.
Comments
May 3, 2016
The Blogging Gauntlet: May 3 - Asynchronous Research and You

This is part of The Blogging Gauntlet of May 2016, where I try to write 500 words every day. See the May 1st post for full details.

Last Sunday, I did Berkeley Mystery Hunt. I can’t talk about the puzzles publicly because this puzzle hunt gets run twice: once during the school year for Berkeley students, and once during the summer for people in the Bay Area.

In every puzzle hunt, I run into the same scenario. A new puzzle comes out, and I take a look at it. I’ll solve it, or I’ll get stuck and move to another puzzle. Either way, I run into a problem. Either I need other people to explain what they’ve figured out, or other people need to understand my progress based on my work so far.

For smaller puzzle hunts, this is fine. Usually you’re all in the same room, so you can ask each other in person. However, when you scale up to massive teams like MIT Mystery Hunt, this becomes a huge concern. They might be sleeping, or be on the other side of the country, and you have to figure out the partial progress from only the notes they’ve written down. If those notes don’t explain the motivation behind all the data collected, they’re practically useless. How am I supposed to know why you removed all the vowels, or where this random column of numbers came from?

I didn’t know what to call this, until I read qntm’s series on antimemetics. If you haven’t read them, I highly, highly recommend checking them out. Start at We Need To Talk About Fifty-Five and follow the links from there.

If you don’t want to, here’s the quick version. Memes are ideas that want to be known, spreading from mind to mind. Antimemes are ideas that want to be forgotten. They are self-censoring. When you stop looking at an antimemetic object, your brain erases your memory and you forget it even exists.

The Antimemetics Division studies antimemes, and they have a devil of a time doing so. It’s hard to study things that everyone rediscovers on a daily basis, and it’s even harder when malicious antimemes erase the lives of coworkers so thoroughly that any evidence of their existence disappears. It’s one of the more horrifying ways to go. You can’t prove your existence to the world if the world doesn’t listen to you…

(from Crisis on Infinite Earths, discovered through TVTropes)

Nightmare fuel aside, there’s a quote about antimemetics research that resonated with me, which is applicable outside the realm of fiction.

Still lost for a logical entry point to the data, Wheeler curses all of her predecessors. Asynchronous research — whereby the research topic is forgotten entirely between iterations, and rediscovered over and over — is a perfectly standard practice in the Antimemetics Division, and her people ought to be better trained than this. There should be an obvious single document to read first which makes sense of the rest. A primer—

(qntm, CASE COLOURLESS GREEN)

The idea of asynchronous research is something I’ve thought about before, but this is the first time I’ve seen someone describe it. And I love the description, it’s perfect.

When I take notes on something, whether it’s for puzzle hunts or lecture, I emphasize writing down motivations, writing down questions people ask about the material, and writing down the answers to those questions. If I don’t follow the lecture, I write what the lecturer is saying anyways, and pepper the margins with question marks to indicate my confusion.

The reason I do this is simple: if I understand the material, I’m not going to read my old notes. In the worlds where I’m reading my notes, I don’t understand what’s going on, so I write my notes towards a more confused version of myself. If my notes spend too long explaining a topic I already get, that’s fine, I’ll just skip ahead. But if my notes don’t explain something in enough detail, I have to struggle through them, and that takes a lot longer.

Understanding something in the now doesn’t mean you’ll remember it in the future, so it’s best to leave notes for your forgetful future selves. Comment your code to explain your awful hack. Mark key ideas to give your future self something to start at, and mark when you’re confused to let your future self know where your notes are dodgy.

If you’re confused about an idea, then get it after a few minutes, write down why you were confused and why it makes sense, because there’s a good chance that if you get confused again, you’ll get confused in the same way. (Writing it down also makes it stick better and decreases the chance you get confused in the first place.)

Here are some examples. Below are my notes on backprop from neural nets. I make sure to annotate the intuition behind the gradient.

(Click for full size.)

Here’s another example from theoretical crypto. I don’t follow the intuition said in lecture, but I write it down anyways, and note my confusion appropriately.

(Click for full size.)

There are downsides to this approach. I’ve filled notebooks with notes that I’ve never read again, and it’s tricky to take detailed notes while keeping up with lecture. However, I overall think it’s worth it. This may not be the best way to take notes, but it’s what my note taking strategy evolved into. So far, it’s been working for me.

Comments

← Older posts Newer posts →

Posts

The Blogging Gauntlet: May 5 - Exploration-Exploitation Part 2: Milk Tea

The Blogging Gauntlet: May 4 - Exploration-Exploitation Part 1: Multi-armed Bandits

The Blogging Gauntlet: May 3 - Asynchronous Research and You