Posts
-
Thoughts on ICLR 2019
ICLR did terrible things for my ego. I didn’t have any papers at ICLR. I only went to check out the conference. Despite this, people I haven’t met before are telling me that they know who I am from my blog, and friends are joking that I should change my affiliation from “Google” to “Famous Blogger”.
Look, take it from me - having a famous blog post is great for name recognition, but it isn’t good for much else. Based on Google Analytics funnels, it doesn’t translate to people reading other posts, or reading any more of my research. I’ve said this before, but I blog because it’s fun and I get something out of doing it, even if no one actually reads it. I mean, of course I want people to read my blog, but I really go out of my way to not care about viewership. Sorta Insightful is never going to pay my bills, and my worry is that if I care about viewership too much, I won’t write the posts that I want to write. Instead, I’ll write the posts that I think other people want me to write. Those two aren’t the same, and caring about viewers too much seems like the first step towards a blog that’s less fun for me.
Favorite Papers
Every conference, when I’m doing small-talk with other people, I get the same question: “What papers did you like?”. And every conference I give the same answer: “I don’t remember any of them.”
I try to use conferences to get a feel for what’s hot and what’s not in machine learning, and to catch-up on subfields that I’m following. So generally, I have a hard time remembering any particular poster or presentation. Instead, my eyes glaze over at a few keywords, I type some notes that I know I’ll never read, I’ll make a note of an interesting idea, and if I’m lucky, I’ll even remember what paper it was from. In practice, I usually don’t. For what it’s worth, I think this conference strategy is totally fine.
People still like deep learning. They still like reinforcement learning. GANs and meta-learning are both still pretty healthy. I get the feeling that for GANs and meta-learning, the honeymoon period of deriving slight variants of existing algorithms has worn off - many more of the papers I saw had shifted towards figuring out places where GANs or meta-learning could be applied to other areas of research.
Image-to-image learning is getting pretty good, and this opens up a bunch of interesting problems around predictive learning and image-based models and so forth. And of course, generating cool samples is one of the ways your paper gets more press and engagement, so I expect to see more of this in the next few years.
It’s sad that I’m only half-joking about the engagement part. The nature of research is that some of it looks cool and some of it doesn’t, and the coolness factor is tied more towards your problem rather than the quality of your work. One of these days I should clean up my thoughts on the engagement vs quality gap. It’s a topic where I’ll gladly preach to the choir about it, if it means that it gets through to the person in the choir who’s only pretending they know what the choir’s talking about.
Can We Talk About Adversarial Perturbations?
Speaking of cool factor, wow there were a lot of papers about adversarial perturbations. This might be biased by the poster session where all the GAN and adversarial perturbation papers were bunched together, making it seem like more of the conference than it really was (more about this later), but let me rant for a bit.
As a refresher, the adversarial perturbation literature is based around the discovery that if you take images, and add small amounts of noise imperceptible to the human eye, then you get images that are completely misclassified.
Source: (Goodfellow et al, ICLR 2015)
These results really captured people’s attention, and they caught mine too. There’s been a lot of work on learning defenses to improve robustness to these imperceptible noise attacks. I used to think this was cool, but I’m more lukewarm about it now.
Most of the reason I’m not so interested now is tied to the Adversarial Spheres paper (Gilmer et al, 2018). Now big caveat: I haven’t fully read the paper. Feel free to correct me, but here’s my high-level understanding of the result.
Suppose your data lies in some high-dimensional space. Let \(S\) be the set of points your classifier correctly classifies. The volume of this set should match the accuracy of your classifier. For example, if the classifier has \(95\%\) accuracy, then the volume of \(S\) will be \(95\%\) of the total volume of the data space.
When constructing adversarial perturbations, we add some \(\epsilon\) noise to the input point. Given some correctly classified point \(x\), we can find an adversarial example for \(x\) if the \(\epsilon\)-ball centered at \(x\) contains a point that is incorrectly classified.
To reason about the average distance to an adversarial example over the dataset, we can consider the union of the \(\epsilon\)-balls for every correctly classified \(x\). This is equivalent to the set of points within distance \(\epsilon\) of any \(x \in S\).
Let’s consider the volume of \(S\), once its expanded by \(\epsilon\) out in all directions. The larger \(\epsilon\) is, the more the volume of \(S\) will grow. If we pick an \(\epsilon\) that grows the volume of \(S\) from \(95\%\) of the space to \(100\%\) of the space, then we’re guaranteed to find an adversarial example, no matter how misclassifications are distributed in space, because every possible misclassification is within \(\epsilon\) of \(S\). The adversarial spheres paper proves bounds on the increase in volume for different \(\epsilon\), to increase the volume such that it covers the entire space, specifically for the case where you data lies on a sphere. Carrying out the math gives an \(\epsilon\) below the threshold for human-perceptible noise. This is then combined with some arguments that the sphere is the best you can do (by relating volume increase to surface area and applying the isoparametric inequality), and some loose evidence real data follows similar bounds.
Importantly, the final result depends only on test error and dimensionality of your dataset, and makes no assumptions about how robust your classifier is, or what adversarial defenses you’ve added on top. As long as you have some test error, it doesn’t matter what defenses you try to add to the classifier. The weird geometry of high-dimensional space is sufficient to provide an existence proof for adversarial perturbations. It isn’t that our classifiers aren’t robust, it’s that avoiding adversarial examples is really hard in a 100,000-dimensional space.
(High-dimensional geometry is super-weird, by the way. For example, if you sample two random vectors where each coordinate comes from \(U[0,1]\), they’ll almost always be almost-orthogonal, by the law of large numbers. A more relevant fun fact is that adding \(\epsilon\) noise to each coordinate of an \(n\)-dimensional point produces a new point that’s \(O(\epsilon \sqrt{n})\) away from your original one. If you consider the dimensionality \(n\) of image data, it should be more intuitive why small perturbations in every dimension gives you tons of flexibility towards finding misclassified points.)
The fact that adversarial examples exist doesn’t mean that it’s easy to discover them. One outcome is that adversarial defense research successfully finds a way to hide adversarial examples in a way that’s hard to discover with any efficient algorithm. But, this feels unlikely to me. I think defenses for adversarial attacks will be most useful as a form of adversarial data augmentation, rather than as useful stepping stones to other forms of ML security. I’m just not too interested in that kind of work.
A Quick Tangent on Zero-Knowledge Proofs
I don’t want this post to get hijacked by an adversarial examples train, so I’ll keep this brief.
Ian Goodfellow gave a talk at the SafeML ICLR workshop. I’d encourage listening to the full talk, I’d say I agree with most of it.
In that talk, he said that he thinks people are over-focusing on adversarial perturbations. He also proposed dynamic defenses for adversarial examples. In a dynamic defense, a classifier’s output distribution \(p(class|input)\) may change on every input processed, even if the same input is replayed multiple times. This both breaks a ton of assumptions, and gives you more flexible and expressive defense models.
This may be a completely wild connection, but on hearing this I was reminded of zero-knowledge proofs. A lot of zero-knowledge proof schemes integrate randomness into their proof protocol, in a way that lets the prover prove something is true while protecting details of their internal processing. And with some twisting of logic, it sounds like maybe there’s some way to make a classifier useful without leaking unnecessary knowledge over how it works, by changing \(p(class|input)\) in the right way each input. I feel like there might be something here, but there’s a reasonable chance it’s all junk.
Poster Arrangements
Hey, do you remember that comment I made, about how all the adversarial example and GAN papers were bunched up into one poster session? At this year’s ICLR, posters were grouped by topic. I think the theory was that you could plan which poster sessions you were attending and which ones you weren’t by checking the schedule in advance. Then, you can schedule all your networking catch-ups during the sessions you don’t want to visit.
I wing all my conferences, getting by on a loose level of planning, so that’s not how it played out for me. Instead, I would go between sessions where I didn’t care about any of the posters, and sessions where I wanted to see all the posters. This felt horribly inefficient, because I had to skip posters I knew I’d be interested in reading, due to the time crunch of trying to see everything, and then spend the next session doing nothing.
A friend of mine pointed out another flaw: the posters they most wanted to see were in the same time slot as their poster presentation. That forced a trade-off between presenting their work to other people, and seeing posters for related work in their subfield.
My feeling is that ICLR should cater to the people presenting the posters, and experience of other attendees should be secondary. Let’s quickly solve the optimization problem. Say a subfield has \(N\) posters, and there are \(k\) different poster sessions. As an approximation, every poster is presented by \(1\) person, and that person can’t see any posters in the same session they’re presenting in. We want to allocate the posters such that we maximize the average number of posters each presenter sees.
I’ll leave the formal proof as an exercise (you’ll want your Lagrange multipliers), but the solution you get is that the \(N\) posters should be divided evenly between the \(k\) poster sessions. Now, in practice, posters can overlap between subfields, and it can be hard to even define what is and isn’t a subfield. Distributing exactly evenly is a challenge, but if we assign posters randomly to each poster session, then every subfield should work out to be approximately even.
To me, it felt like the ICLR organizers spent a bunch of time clustering papers, when randomness would have been better. To quote CGP Grey, “Man, it’s always frustrating to know that to literally have done nothing would be faster than the something that is done”. I’m open to explanations why randomness would be bad though!
The Structure and Priors Debate
This year, ICLR tried out a debate during the main conference. The topic was about what should be given to machine learning models as a given structure or prior about the world, and what should be learned from data. I got the impression that the organizers wanted it to be a constructive, fiery, and passionate debate. To be blunt, it wasn’t.
I’m in a slightly unique position to comment on this, because I actually took part in the ICML 2018 Debates workshop. I’d rather people not know I did this, because I was really, really winging it, armed with a position paper I wrote in a day. I’m not even sure I agree with my position paper now. Meanwhile, the other side of the debate was represented by Katherine and Zack, who had considerably more coherent position papers. It was like walking into what I thought was a knife fight, armed with a small paring knife, and realizing it was an “anything goes” fight, where they have defensive turrets surrounding a fortified bunker.
But then the debate started, and it all turned out fine, because we spent 90% of our time agreeing about every question, and none of us had any reason to pull out particularly heavy linguistic weaponry. It stayed very civil, and the most fiery comments came from the audience, not from us.
When talking to the organizers of the ICML debates workshop after the fact, they said the mistake was assuming that if they took people with opposing views, and had them talk about the subject they disagreed on, it would naturally evolve into an interesting debate. I don’t think it works that way. To get things to play out that way, I believe you have to continually prod the participants towards the crux of their disagreements - and this crux is sometimes not very obvious. Without this constant force, it’s easy to endlessly orbit the disagreement without ever visiting it.
Below is a diagram for a similar phenomenon, where grad students want to work on a thesis right up until they actually sit down and try to do it. I feel a similar model is a good approximation for machine learning debates.
Source: PhD Comics
Look, I’m not going to mince words. Machine learning researchers tend to be introverted, tend to agree more than they disagree, and are usually quite tolerant of differing opinions over research hypotheses. And it’s really easy to unintentionally (or intentionally) steer the conversation towards the region of carefully qualified, agreeable conversation, where no one remembers it by tomorrow. This is especially true if you’re debating a nebulous term like “deep learning” or “structure” or “rigor”, where you can easily burn lots of time saying things like, “Yes, but what does deep learning mean?”, at which point every debater presents their own definition and you’ve wasted five minutes saying very little. The context of “we’re debating” pushes towards the center. The instinct of “we’re trying to be nice” pushes way, way harder away from the center.
I think ML debates are cool in theory, and I’d like to see a few more shots at making them happen, but if it does happen again, I’d advise the debate moderators to go in with the mindset that ML debates need a lot of design to end in a satisfying way, with repeated guidance towards the crux of the debaters’ disagreements.
Conclusion
ICLR was pretty good this year. New Orleans is a nice convention city - lots of hotels near the convention center, and lots of culture in walking distance. I had a good time, and as someone who’s lived in California for most of their life, I appreciated getting to experience a city in the South for a change. It was great, asides from the weather. On that front, California just wins.
-
OpenAI Finals
OpenAI just beat OG, champions of The International 8, in a 2-0 series. They also announced that in private, they had won three other pro series: 2-0 over Team Lithium, 2-0 over SG e-sports, and 2-0 over Alliance. Pretty cool! I don’t have a lot to add this time, but here are my thoughts.
OG Isn’t the Top Team and That Doesn’t Matter
After pulling off an incredible Cinderella story and winning TI8, OG went through some troubles. My understanding is that they’ve started to recover, but are no longer the consensus best team.
To show this, we can check the GosuGamers DotA 2 rankings. This assigns an lo rating to the top DotA 2 teams, based on their match history in tournaments. At the time of this post, OG is estimated as the 11th best team.
I don’t think this really matters, because as we’ve seen with the 1v1 bot, the previous OpenAI Five match, and with AlphaStar, once your at the level of semi=pro, reaching pro is more a matter of training time and steady incremental training improvements than anything else. Going into the match, I thought the only way OG would have a chance was if the restrictions were radically different from the ones used at TI8. They weren’t. Given that OpenAI Five beat a few other pro teams, I believe this match wasn’t a fluke and there’s no reason they couldn’t beat Secret or VP or VG with enough training time.
Reaction Times Looked More Believable
I’m not sure if OpenAI added extra delay or not, but the bot play we saw felt more fair and looked more like a player with really good mechanics, rather than superhuman mechanics. There were definitely some crazy outplays but it didn’t look impossible for a human to do it - it just looked very, very difficult.
If I had to guess, it would be that the agent still processes input at the same speed, but has some fixed built-in delay between deciding an action and actually executing it. That would let you get more believable reactions without compromising your ability to observe environment changes that are only visible for fractions of a second.
Limited Hero Pool is a Bit Disappointing
I think it’s pretty awesome that OpenAI Five won, but one thing I’m interested by is the potential for AI to explore the hero pool and identify strategies that pro players have overlooked. We saw this in Go with the 3-3 invasion followup. We saw this in AlphaStar, with the strength of well microed Stalkers, although the micro requirements seem very high. With OpenAI Five, we saw that perhaps early buybacks have value, although again, it’s questionable whether this makes sense or whether the bot is just playing weird. (And the bot does play weird, even if it does win anyways.)
When you have a limited hero pool, you can’t learn about unimplemented heroes, and therefore the learned strategies may not generalize to full DotA 2, which limits the insights humans can take away from the bot’s play. And that’s a real shame.
It seems unlikely that we’ll see an expansion of the hero pool, given that this is the last planned public event. It’s a lot more compute for what is already a compute heavy project. It would also require learning how to draft, assuming draft works the same as the TI8 version. In the TI8 version, the win rate of every possible combination of heroes is evaluated, and the draft is done by picking the least exploitable next hero. Given a pool of 17 heroes, there are \(\binom{17}{5}\binom{12}{5}\cdot 2 = 9801792\) different hero combinations, which is small enough to be brute-forced. A full hero pool breaks this quick hack and requires using a learned approach instead. I’m sure it’s doable (there’s existing work for this), but it’s another hurdle that makes it look even more unlikely.
A Million 3k MMR Teams at Five Million Keyboards Have to Win Eventually
At the end of the match, OpenAI announced that they were opening sign-ups to allow everyone to play against or with OpenAI Five. It’s only going to be up for a few days, but it’s still exciting nonetheless. I have no idea how much the inference will cost in cloud credit (which is presumably why it’s only running for a few days).
I fully expect somebody to figure out a cheese strategy that the bot has trouble handling. I also expect every pro team to try beating it for kicks, because if they can beat it consistently, can you imagine how much free PR they’d get? If they don’t beat it, they don’t have to say anything, so it seems like a win-win.
There is a chance that the bot is genuinely too good, in an “AlphaGo Master goes 60-0 against pros” kind of way, but that was 60 games, and way more than 60 people are going to try to beat OpenAI Five. They’re not all going to be pros, but scale is going to matter more than skill here.
When OpenAI let TI attendees play their 1v1 bot that beat several pros, people were able to find all sorts of cheese strategies that worked. It was an older version of the bot, so perhaps history doesn’t have precedence, but I’m going to guess somebody is going to figure out something sufficiently out-of-distribution.
We Still Take Pride in Few Shot Learning
In the interview with Purge after the match, OG N0tail had an interesting comment
Purge: If you guys got to play 5 matches right now against them, do you think you could take at least 1 win?
N0tail: Yeah, for sure. For sure 1 win. If we played 10, we’d start winning more, and if we could play 50 games against them, I believe we’d start winning very very reliably.
He later elaborated that he felt the bot had exploitable flaws in how it played around vision, but I think the more important note is that we take pride in our ability to actively try new things based on very few examples. The debate over how to do this is endless, but it makes me think that if somebody manages to demo impressive few-shot learning, we’ll start running out of excuses about AI.
-
An Overdue Post on AlphaStar, Part 2
This is part 2 of my post about AlphaStar, which is focused on the machine learning implications. Click here for part 1.
A Quick Overview of AlphaStar’s Training Setup
It’s impossible to talk about AlphaStar without briefly covering how it works. Most of the details are vague right now, but more have been promised in an upcoming journal article. This summary is based off of what’s public so far.
AlphaStar is made of 3 sequence models that likely share some weights. Each sequence model receives the same observations: the raw game state. There are then three sets of outputs: where to click, what to build/train, and an reward predictor.
This model is trained in a two stage process. First, it is trained using imitation learning on human games provided by Blizzard. My notes from the match say it takes 3 days to train the imitation learning baseline.
The models are then further trained using IMPALA and population-based training, plus some other tricks I’ll get to later. This is called the AlphaStar League. Within the population, each agent is given a slightly different reward function, some of which include rewards for exploiting other specific agents in the league. Each agent in the population is trained with 16 TPUv3s, which are estimated to be equivalent to about 50 GPUs each. The population-based training was run for 14 days.
(From original post)
I couldn’t find any references for the population size, or how many agents are trained simultaneously. I would guess “big” and “a lot”, respectively. Now multiply that by 16 TPUs each and you get a sense of the scale involved.
After 14 days, they computed the Nash equilibrium of the population, and for the showmatch, selected the top 5 least exploitable agents, using a different one in every game.
All agents were trained in Protoss vs Protoss mirrors on a fixed map, Catalyst LE.
Takeaways
1. Imitation Learning Did Better Than I Thought
I have always assumed that when comparing imitation learning to reinforcement learning, imitation learning performs better when given fewer samples, but reinforcement learning wins in the long run. We saw that play out here.
One of the problems with imitation learning is the way errors can compound over time. I’m not sure if there’s a formal name for this. I’ve always called it the DAgger problem, because that’s the paper that everyone cites when talking about this problem (Ross et al, AISTATS 2011).
Intuitively, the argument goes like this: suppose you train an agent by doing supervised learning on the actions a human does. This is called behavioral cloning, and is a common baseline in the literature. At \(t=0\), your model acts with small error \(\epsilon_0\). That’s fine. This carries it to state that’s modelled less well, since the expert visited it less often. Now at \(t=1\), it acts with slightly larger error \(\epsilon_1\). This is more troubling, as we’re in a state with even less expert supervision. At \(t=2\), we get a larger error \(\epsilon_2\), at \(t=3\) an even larger \(\epsilon_3\), and so on. As the errors compound over time, the agent goes through states very far from expert behavior, and because we don’t have labels for these states, the agent is soon doing nonsense.
This problem means mistakes in imitation learning often aren’t recoverable, and the temporal nature of the problem means that the longer your episode is, the more likely it is that you enter this negative feedback loop, and the worse you’ll be if you do. You can prove that if the expected loss each timestep is \(\epsilon\), then the worst-case bound over the episode is \(O(T^2\epsilon)\), and for certain loss functions this worst-case bound is tight.
Due to growing quadratically in \(T\), we expect long-horizon tasks to be harder for imitation learning. A StarCraft game is long enough that I didn’t expect imitation learning to work at all. And yet, imitation learning was good enough to reach the level of a Gold player.
The first version of AlphaGo was bootstrapped by doing behavioral cloning on human games, and that version was competitive against top open-source Go engines of the time. But Go is a game with at most 200-250 moves, whereas StarCraft has thousands of decisions points. I assumed that you would need a massive dataset of human games to get past this, more than Blizzard could provide. I’m surprised this wasn’t the case.
My guess is that this is tied into another trend: despite the problems with behavioral cloning, it’s actually a pretty strong baseline. I don’t do imitation learning myself, but that’s what I’ve been hearing. I suspect that’s because many of behavioral cloning’s problems can be covered up with better data collection. Here’s the pseudocode for DAgger’s resolution to the DAgger problem.
Given expert policy \(\pi^*\) and current policy \(\hat{\pi}_i\), we iteratively build a dataset \(\mathcal{D}\) by collecting data from a mixture of the expert \(\pi^*\) and current policy \(\hat{\pi}_i\). We iteratively alternate training policies and collecting data, and by always collecting with a mixture of expert data and on-policy data, we can ensure that our dataset will always include both expert states and states close to ones our current policy would visit.
But importantly, the core optimization loop (the “train classifier” line) is still based on maximizing the likelihood of actions in your dataset. The only change is on how the data is generated. If you have a very large dataset, from a wide variety of experts of varying skill levels (like, say, a corpus of StarCraft games from anyone who’s ever played the game), then it’s possible that your data already has enough variation to let your agent learn how to recover from several of the incorrect decisions it could make.
This is something I’ve anecdotally noticed in my own work. When collecting robot grasping data, we found that datasets collected with small amounts of exploration noise led to significantly stronger policies than datasets without it.
The fact that imitation learning gives a good baseline seems important for bootstrapping learning. It’s true that AlphaZero was able to avoid this, but AlphaGo with imitation learning bootstrapping was developed first. There usually aren’t reasons to discard warm-starting from a good base policy, unless you’re deliberately doing it as a research challenge.
2. Population Based Training is Worth Watching
StarCraft II is inherently a game based around strategies and counter-strategies. My feeling is that in DoTA 2, a heavy portion of your strategy is decided in the drafting phase. Certain hero compositions will only work best for certain styles of play. Because of this, once the draft is done, each team has an idea of what to expect.
However, StarCraft II starts out completely unobserved. Builds can go from heavy early aggression to greedy expansions for long-term payoff. It seems more likely that StarCraft could devolve into unstable equilibria if you try to represent the space of strategies within a single agent.
Population-based training does a lot to avoid this problem. A simple self-play agent “gets stuck”, but a population-based approach reaches Grandmaster level. One of the intuitive traps in self-play is that if you only play against the most recent version of yourself, then you could endlessly walk around a rock-paper-scissors loop, instead of discovering the trick that beats rock, paper, and scissors.
I haven’t tried population based training myself, but from what I heard, it tends to give more gains in unstable learning settings, and it seems likely that StarCraft is one of those games with several viable strategies. If you expect the game’s Nash equilibrium to turn into an ensemble of strategies, it seems way easier to maintain an ensemble of agents, since you get a free inductive prior.
3. Once RL Does Okay, Making It Great Is Easier
In general, big RL projects seem to fall into two buckets.
- They don’t work at all.
- They work and become very good with sufficient compute, which may be very large due to diminishing returns.
I haven’t seen many in-betweens where things start to work, and then hit a disappointingly low plateau.
One model that would explain this is that algorithmic and training tricks are all about adding constant multipliers to how quickly your RL agent can learn new strategies. Early in a project, everything fails, because the learning signal is so weak that nothing happens. With enough tuning, the multipliers become large enough for agents to show signs of life. From there, it’s not like the agent ever forgets how to learn. It’s always capable of learning. It’s just a question of whether the things needed for the next level of play are hard to learn or not.
Humans tend to pick up games easily, then spend forever mastering them. RL seems to have the opposite problem - they pick up games slowly, but then master them with relative ease. This means the gap between blank-slate and pretty-good is actually much larger than the gap between pretty-good and pro-level. The first requires finding what makes learning work. The second just needs more data and training time.
The agent that beat TLO on his offrace was trained for about 7 days. Giving it another 7 days was enough to beat MaNa on his main race. Sure, double the compute is a lot of compute, but the first success took almost three years of research time and the second success took seven days. Similarly, although OpenAI’s DotA 2 agent lost against a pro team, they were able to beat their old agent 80% of the time with 10 dats of training. Wonder where it’s at now…
4. We Should Be Tossing Techniques Together More Often
One thing I found surprising about the AlphaStar architecture is how much stuff goes into it. Here’s a list of papers referenced for the model architecture. I’ve added links to everything that’s non-standard.
- A transformer is used to do self-attention (Vaswani et al, NeurIPS 2017).
- This self-attention is used to add inductive biases to the architecture for learning relations between objects (Zambaldi et al, to be presented at ICLR 2019).
- The self-attention layers are combined with an LSTM.
- The policy is auto-regressive, meaning it predicts each action dimension conditionally on each previous action dimension.
-
This also uses a pointer network (Vinyals et al, NeurIPS 2015), which more easily supports variable length outputs for variable length inputs.
Diagram of PointerNet from original paper. A conventional RNN-based seq2seq model condtionally predicts output from the latent code. A PointerNet outputs attention vectors over its inputs.
My guess for why a pointer net helps is that StarCraft involves controlling many units in concert, and the number of units changes over the game. Given that you need to output an action for each unit, a pointer network is a natural choice.
-
The model then uses a centralized value baseline, linking a counterfactual policy gradient algorithm for multi-agent learning (Foerster et al, AAAI 2018).
Diagram of counterfactual multi-agent (COMA) architecture, from original paper. Instead of having a separate actor-critic pair for each agent, all agents share the same critic and get per-agent advantage estimates by marginalizing over the appropriate action.
This is just for the model architecture. There are a few more references for the training itself.
- It’s trained with IMPALA (Espeholt et al, 2018).
- It also uses experience replay.
- And self-imitation learning (Oh et al, ICML 2018).
- And policy distillation in some way (Rusu et al, ICLR 2016).
- Which is trained with population-based training (Jaderberg et al, 2018).
- And a reference to Game-Theoretic Approaches to Multiagent RL (Lanctot et al, NeurIPS 2017). I’m not sure where this is used. Possibly for adding new agents to the AlphaStar league that are tuned to learn the best response to existing agents?
Many of these techniques were developed just in the last year. Based on the number of self-DeepMind citations, and how often those papers report results on the StarCraft II Learning Environment, it’s possible much of this was developed specifically for the StarCraft project.
When developing ML research for a paper, there are heavy incentives to change as little as possible, and concentrate all risk on your proposed improvement. There are many good reasons for this. It’s good science to change just one variable at a time. By sticking closer to existing work, it’s easier to find previously run baselines. It’s also easier for other to validate your work. However, this means that there are good reasons not to incorporate prior state-of-the-art techniques into your research project. The risk added makes the cost-benefit analysis unfavorable.
This is a shame, because ML is a very prolific field, and yet there isn’t a lot of cross-paper pollination. I’ve always liked the Rainbow DQN paper (Hessel et al, AAAI 2018), just because it asked what would happen if you tossed everything together. AlphaStar feels like something similar: several promising ideas that combine into a significantly stronger state-of-the-art.
These sorts of papers are really useful for verifying what techniques are worth using and which ones aren’t, because distributed evaluation across tasks and settings is really the only way we get confidence that a paper is actually useful. But if the incentives discourage adding more risk to research projects, then very few techniques get this distributed evaluation. Where does that leave us? It is incredibly certain that the existing pieces of machine learning can do something we think it can’t, and the only blocker is that no one’s tried the right combination of techniques.
I wonder if the endgame is that research will turn into a two-class structure. One class of research will be bottom-up, studying well-known baselines, without a lot of crossover with other papers, proposing many ideas of which 90% will be useless. The other class will be top-down, done for the sake of achieving something new on an unsolved problem, finding the 10% of useful ideas with trial-and-error and using scale to punch through any barriers that only need scale to solve.
Maybe we’re already in that endgame. If so, I don’t know how I feel about that.
Predictions
In 2016, shortly after the AlphaGo vs Lee Sedol match, I got into a conversation with someone about AGI timelines. (Because, of course, whenever ML does something new, some people will immediately debate what it means for AGI.) They thought AGI was happening very soon. I thought it wasn’t, and as an exercise they asked what would change my mind.
I told them that given that DeepMind was working on StarCraft II, if they beat a pro player within a year, I’d have to seriously revise my assumptions on the pace of ML progress. I thought it would take five to ten years.
The first win in the AlphaGo vs Lee Sedol match was on March 9, 2016, and the MaNa match was announced January 24, 2019. It took DeepMind just shy of three years to do it.
The last time I took an AI predictions questionnaire, it only asked about moonshot AI projects. Accordingly, almost all of my guesses were at least 10 years in the future. None of what they asked has happened yet, so it’s unclear to me if I’m poorly calibrated on moonshots or not - I won’t be able to know for sure until 10 years have passed!
This is probably why people don’t like debating with futurists who only make long-term predictions. Luckily, I don’t deal with people like that very often.
To try to avoid this problem with AlphaStar, let me make some one-year predictions.
If no restrictions are added besides no global camera, I think within a year AlphaStar will be able to beat any pro, in any matchup, on any set of maps in a best-of-five series.
So far, AlphaStar only uses a single map and a single PvP match. I see no reason why a similar technique wouldn’t generalize to other maps or races if given more time. The Reddit AMA says that AlphaStar already does okay on maps it wasn’t trained on, suggesting the model can learn core, generalizable features of StarCraft.
Other races are also definitively on DeepMind’s roadmap. I’ve read some theories that claimed DeepMind started at Terran, but moved to Protoss because their agents would keep lifting up their buildings early on in training. That could be tricky, but doesn’t sound impossible.
The final showmatch against MaNa did expose a weakness where AlphaStar didn’t know how to deal with drops, wasting time moving its army back and forth across the map. Now that this problem is a known quantity, I expect it to get resolved by the next showmatch.
If restrictions are added to make AlphaStar’s gameplay look more human, I’m less certain what will happen. It would depend on what the added restrictions were. The most likely restriction is one on burst APM. Let’s say a cap of 700 burst APM, as that seems roughly in line with MaNa’s numbers. If a 700 burst APM restriction is added, it’s less likely AlphaStar will be that good in a year, but it’s still at least over 50%. My suspicion is that existing strategies will falter with tighter APM limits. I also suspect that with enough time, population-based training will find strategies that are still effective.
One thing a few friends have mentioned is that they’d like to see extended games of a single AlphaStar agent against a pro, rather than picking a different agent every game. This would test whether a pro can quickly learn to exploit that agent, and whether the agent adapts its strategy based on its opponent’s strategy. I’d like to see this too, but it seems like a strictly harder problem than using a different agent from the ensemble for each game, and I don’t see reasons for DeepMind to switch off the ensemble. I predict we won’t see any changes on this front.
Overall, nothing I saw made me believe we’ve seen the limit of what AlphaStar can do.
Thanks to the following people for reading drafts of this post: Tom Brown, Jared Quincy Davis, Peter Gao, David Krueger, Bai Li, Sherjil Ozair, Rohin Shah, Mimee Xu, and everyone else who preferred staying anonymous.