How AlphaGo Works

Today Google DeepMind announced a neural-network AI for the game Go, AlphaGo, that rivals the strength of human professional players. The paper by David Silver et al describes AlphaGo in detail. Their technique is surprisingly simple yet very powerful. Here is my attempt to explain how the system works, for readers not familiar with the technical jargon used in the paper.

中文译版(Credit: 数据分析网):http://www.afenxi.com/post/8713

networks-go

Deep Learning

“Deep learning” refers to multi-layered artificial neural networks and the methods used to train them. A neural network layer is just a large matrix of numbers that takes a set of numbers as input, weighs them through a nonlinear activation function, and produces another set of numbers as output. This is intended to mimic the function of neurons in a biological brain. With the right numbers in that matrix, and several of these layers hooked together in a chain, a neural network “brain” can perform accurate and sophisticated processing, like recognizing patterns or labeling photos.

Neural networks were invented decades ago, and mostly lay in obscurity until just recently. This is because they require extensive “training” to discover good values for the numbers in those matrices. The minimum amount of training to get good results exceeded the computer power and data sizes available to early neural network reserachers. But in the last few years, teams with access to large-scale resources have rediscovered neural networks, which are now possible to train effectively using “Big Data” techniques.

Two Brains

AlphaGo is built using two different neural-network “brains” that cooperate to choose its moves. These brains are multi-layer neural networks almost identical in structure to the ones used for classifying pictures for image search engines like Google Image Search. They start with several hierarchical layers of 2D filters that process a Go board position just like an image-classifying network processes an image. Roughly speaking, these first filters identify patterns and shapes. After this filtering, 13 fully-connected neural network layers produce judgments about the position they see. Roughly, these layers perform classification or logical reasoning.

networks1
Figure from Silver et al

The networks are trained by repeatedly checking their results and feeding back corrections that adjust the numbers to make the network perform better. This process has a large element of randomness, so it’s impossible to know exactly how the network does its “thinking”, only that it tends to improve after more training.

Brain #1: The Move Picker

AlphaGo’s first neural-network brain, which the paper refers to as the “supervised-learning (SL) policy network”, looks at a board position and attempts to decide the best next move to make. Actually, it estimates the probability that each legal next move is the best, where its top guess is the one with the highest probability. You can think of this as a “move-picking” brain.

How the move picker "sees" the board. Numbers indicate the likelihood of a strong human player putting the next stone in that spot. Figure from Silver et al.
How the move picker sees the board. Numbers indicate the relative likelihood of a strong human player putting their next stone on that spot. Figure from Silver et al.

Silver’s team trained this brain on millions of examples of moves made by strong human players on KGS. It’s the most human-like part of AlphaGo, in that its goal is only to replicate the move choices of strong human players. It expressly does not care about winning games, only picking the same next move that a strong human player would pick. AlphaGo’s move-picker correctly matches strong humans about 57% of the time. (mismatches aren’t necessarily a mistake – it might be the human who made the wrong move!).

The Move Picker, Stronger

The full AlphaGo system actually needs two additional versions of the basic move-picking brain. One version, the “reinforced-learning (RL) policy network”, is a refined version that is trained more intensively using millions of additional simulated games. Think of this as the “stronger” move picker. In contrast to the basic training described above, which only teaches the network to mimic a single human move, the advanced training plays out each simulated game to the end in order to teach the network which next moves are likely to lead to winning the game. Silver’s team synthesized millions of training games by playing the stronger move picker against previous editions of itself from earlier training iterations.

The strong move picker alone is already a formidable Go player, somewhere in the amateur low-dan range, on par with the strongest pre-existing Go AIs. It’s important to note that this move picker does no “reading” at all. It simply scrutinizes a single board position and comes up with one move based on its analysis of that position. It does not attempt to simulate any future moves. This shows the surprising power of simple deep neural-network techniques.

The Move Picker, Faster

The AlphaGo team did not stop here, of course. In a moment I will describe how they added reading ability to the AI. In order for reading to work though, they needed a faster version of the move-picking brain. The stronger version takes too long to produce an answer – it’s fast enough to produce one good move, but reading calculations need to check thousands of possible moves before making a decision.

Silver’s team built a simplified move picker to create a “fast reading” version, which they call the “rollout network”. The simplified version doesn’t look at the entire 19×19 board, but instead only looks at a smaller window around the opponent’s previous move and the new move it’s considering. Removing part of the move-picker’s brain in this way lowers its strength, but the leaner version computes about 1,000 times faster, making it suitable for reading calculations.

Brain #2: The Position Evaluator

AlphaGo’s second brain answers a different question than the move picker. Instead of guessing a specific next move, it estimates the chance of each player winning the game, given a board position. This “position evaluator”, which the paper calls the “value network”, complements the move picker by providing an overall positional judgment. This judgment is only approximate, but it’s useful for speeding up reading. By classifying potential future positions as “good” or “bad”, AlphaGo can decide whether or not to read more deeply along a particular variation. If the position evaluator says a particular variation looks bad, then the AI can skip reading any more moves along that line of play.

How the position evaluator sees the board. Darker blue represents places where the next stone leads to a more likely win for the player. Figure from Silver et al.
How the position evaluator sees the board. Darker blue represents places where the next stone leads to a more likely win for the player. Figure from Silver et al.

The position evaluator is trained on millions of example board positions. Silver’s team created these positions by carefully selecting random samples from simulated games between two copies of AlphaGo’s strong move-picker. It’s important to note that the AI move-picker was invaluable in constructing a sufficiently large data set to train the position evaluator. The move-picker allowed the team to simulate many possible forward continuations from any given board position to guess the approximate winning chances of each player. There probably aren’t enough human game records available out there to accomplish this kind of training.

Adding Reading

With all three versions of the move-picking brain, plus the position-evaluation brain, AlphaGo is now equipped to read sequences of future moves effectively. Reading is accomplished with the same Monte Carlo Tree Search (MCTS) algorithm used by most state-of-the-art Go AIs. But AlphaGo out-smarts other AIs by making more intelligent guesses about which variations to explore, and how deeply to explore them.

Monte Carlo Tree Search algorithm. Figure from Silver et al.
Monte Carlo Tree Search algorithm. Figure from Silver et al.

With infinite computer power, MCTS could theoretically compute optimal moves by exploring every possible continuation of a game. However, because the search space of future moves of a Go game is so large (greater than the number of particles in the known universe), no practical AI can hope to explore every possible variation. Getting MCTS to work in practice depends on how good the AI is at identifying promising variations, so that it can skip exploring bad ones.

Silver’s team equipped AlphaGo with a modular MCTS system. This framework allows the designers to “plug in” different functions for evaluating a variation. The full-power AlphaGo system uses all the “brains” in the following way:

1. From the current board position, choose a few possible next moves. For this they use the “basic” move-picking brain. (they tried using the “stronger” version in this step, but it actually made AlphaGo weaker, because it didn’t offer the MCTS system a broad enough choice of variations to explore. It focused too much on one “obviously best” move instead of reading several alternatives that might reveal themselves to be better later in the game).

2. For each possible next move, evaluate the quality of that move in one of two ways: either use the position evaluator on the board state after that move, or run a deeper Monte Carlo simulation (called a “rollout”) that reads into the future after that move, using the fast-reading move picker to speed up the search. These two methods produce independent guesses about the quality of the move. AlphaGo uses a single parameter, a “mixing coefficient”, to weigh these guesses against each other. Full-power AlphaGo uses a 50/50 mix, judging the quality equally using the position evaluator and the simulated rollouts.

The paper includes a neat graph that shows how AlphaGo’s strength varies as they enable or disable the use of the pluggined-in brains and simulations in the above steps. Using only one individual brain, AlphaGo is about as strong as the best computer Go AIs, but with all of them combined, it’s possibly achieved the level of a professional human player.

How AlphaGo's strength varies as the MCTS "plug-ins" are turned on or off. Figure from Silver et al.
How AlphaGo’s strength varies as the MCTS “plug-ins” are turned on or off. Figure from Silver et al.

The paper goes into detail about some fancy engineering the team did to speed up MCTS by distributing the calculations across a network of computers, but this doesn’t change the fundamental algorithm. There is also an appendix with mathematical justifications for each step. Some parts of the algorithm are exact (in the asymptotic limit of infinite computer power) while some are only approximations. In practical terms, AlphaGo will certainly get stronger with access to greater computer power, but the rate of improvement per additional unit of computing time will slow down as it gets stronger.

Strengths and Weaknesses

Note: I’m only a weak amateur Go player so take my analysis with some skepticism.

I believe AlphaGo will be tremendously strong at small-scale tactics. It knows how to find the “best of” human moves played for many positions and patterns, so it likely won’t make any obvious blunders in a given tactical situation.

However, AlphaGo might have a weakness in “whole-board” judgment. It looks at board positions using a pyramid of 5×5 filters, which suggests that it might have trouble integrating tactical “parts” into a strategic “whole”, in the same way that image-classifying neural networks sometimes get confused by images that contain parts of one thing and parts of another. An example of this might be a game where a joseki in one corner creates a wall or a ladder-breaker that significantly changes the value of the position in a different corner.

Like other MCTS-based AIs, AlphaGo could have trouble judging positions that need very deep reading to resolve a big risk, like long ko seqeuences where the life and death of a large group is at stake.

AlphaGo could also be confused by deliberate attempts to deviate from a normal-looking game, like a tengen opening or other rare joseki, since much of its training is based on human game records.

I look forward to seeing the result of the upcoming match between AlphaGo and top human player Lee Sedol 9p! My prediction: if Lee plays “straight”, as if playing against another human professional, he will probably lose, but he might win if he can force AlphaGo into an unfamiliar strategic situation.

On Virtual Goods

Here is what I don’t get about companies that sell virtual goods: My “virtual credit balance” is just a number in your database. You want me to pay real money for you to change that number. It’s just a tiny piece of data that has no effect outside the virtual world. Why would I want to do that?

“But wait,” you say, “what about a bank account balance? Isn’t that just a tiny piece of data too? And people work very hard to change it!”

True, but that number represents the full faith and credit of the bank, from which I can withdraw real cash to exchange for real goods and services.

OK, decorating my virtual fishtank with a different background image is a kind of real service, in that it affects the pixels on my monitor (and yours, when you come visit on account of some made-up fishtank crisis). But I personally do not a attach a very high value to this service.

Game analysis: A Tale In The Desert

I’ve been meaning to try A Tale In The Desert for a while now, since it’s a fairly successful MMORPGs with a truly deep crafting system (the other one being Star Wars: Galaxies, which won’t install for me anymore). I just downloaded and played the starting-island portion over about three hours. Unfortunately the UI is really annoying so I won’t put more time into it.

Strengths

  • The crafting system is indeed deep. Basically the entire game is about making stuff. You gather resources (modeled nicely and placed throughout the world) and build tools (also nicely modeled) to transform them into ever more complex products. The starting-island tutorial is basic but I read some fascinating things about advanced crafting on the wiki. For instance, you can grow grapes and make wine, but first you can try cross-breeding grape strains to get seeds with better growing properties, and during the growing season you take a series of actions to “tend” the field, which affect the final quality of the wine produced. (“Product quality” seems to be a good way to add depth to crafting – SW:G uses this too – but obviously it complicates the server since it can’t treat all instances of an item type as fungible).
  • No combat, although there is PvP in the form of competing for achievements. ATITD proves that there is actually a good middle ground between purely social “chat rooms” and purely combative PvE/PvP games.

Weaknesses

  • It’s a grind.

The resource-gathering system is cumbersome. For example, early in the demo you have to gather slate to make stone blades. Slate is found dotted along shorelines, but there is no indication of exactly where a slate node is located, except for an icon that appears only when you are standing right on top of one. Gathering slate is a process of walking around randomly until the icon flashes up, then backtracking to find exactly where to gather it. There should be a visual display of some kind to show where the slate nodes are.

Another resource you need early on is grass. You can pick grass anywhere the ground is green, but each time you pick up a handful you have to wait a few seconds for the character to animate, then move aside and click again. The gathering animations should be interruptible. Or, if the goal is to limit the rate of gathering, then you should be able to “batch-gather” a large amount of the resource with a single mouse click, like WoW’s “make all” button. (edit: apparently there is a way to automate grass-gathering during times you are offline, but you first have to gather 2500 units by hand!)

Along the same lines, there is no automation for “processing” steps like cutting wood into boards. You sit there clicking, waiting for an animation, clicking, waiting for an animation, etc. Worse, the blade breaks every so often (for no apparent reason), sending you on another shale-finding quest. I can understand exposing players to these mechanics temporarily, but not on a long-term basis. You should reach a skill level that allows some automation or batch-processing, or at least invent a little point-and-click game (with decreasing difficulty as your skill increases) to control the frequency of success and failure.

  • ATITD uses its own widgets and controls for the GUI, but implements them poorly.

Here’s my take on implementing your own GUI widgets, as opposed to using the native widgets in Windows or OSX: it’s like asking for a bout with the heavywweight title-holder: the rewards are good if you succeed, but you’d better have the skills to back up your challenge.

If the native widgets aren’t ideal for your intended use, then go ahead and write your own custom widgets, but be careful and design them well (good examples are the custom GUIs in certain 3D apps like Lightwave and Houdini).

Just keep in mind it’s incredibly tough to correctly handle all of the corner cases and behaviors that users expect from modern widgets. Sure, your homemade button works when I click it, but what about keyboard focus? How does it move when I scale the containing window? Does it work well on a tablet PC with a touch screen? A PDA with a stylus? Does it handle Unicode text rendering? Right-to-left languages?

Widgets that get this stuff wrong feel unpolished. Like, for example, ATITD’s windows, which you can only resize by dragging a pixel-wide line somewhere in the interior, or the super-annoying pop-up notifications, which appear directly under the mouse cursor, preventing you from clicking nearby until you dismiss the pop-up.

It’s a shame the UI is so annoying, I was really looking forward to seeing some more advanced features. For instance, check out the ATITD wiki page about plant genomics!

Game analysis: Galactic Civilizations II

Galactic Civilizations II was developed by a small team at independent studio Stardock. It’s a sci-fi 4X game, essentially “Civilization in space,” or an evolution of the Master of Orion series.

Strengths

  • The beginning is always peaceful. This is a great feature! One of the most annoying things about Civ games is the tremendously unbalancing effect of early battles, like having your settler taken out by barbarians or conquering an enemy city before their defensive unit moves into place. In GalCiv, you have to get pretty far into the technology tree before you can build units with any offensive capability whatsoever, and the techs that enable you to invade other planets are even deeper. This ensures that all players have some time to expand freely without worrying about military threats.
  • The AI doesn’t cheat (at all but the highest difficulty levels). Stardock put enough effort into the AI to make it competitive without giving it unfair advantages in economy/tech or the ability to peek at your map.
  • Turn-based action is addictive. There is something about quantized jumps of progress that make the action more addictive than a real-time game. This is the same principle behind the prevalence of numerical levels in RPGs (compared to continuous skill improvement). I think it has to do with the “rush” you get from large spikes in activity, which you wouldn’t get from continuous evolution.
  • The first couple of turns are the most exciting. It’s fun to uncover nearby plans and imagine the possibilities for developing them (and how to stop the AI from getting there first!). I think this is because I’m an “explorer”-type player.
  • The technology tree is very well-designed. Aside from the late-starting offensive techs, I also like the rock-paper-scissors aspect to military techs: there are three kinds of weaponry and three kinds of defenses, each specialized against one of the weapon types. The techs are sufficiently resource-intensive that you can usually pursue only one line of development. This mimics real-life arms races where new technologies are developed specifically to counter the opponent. Also, in an expansion pack, Stardock added unique tech trees for each race. This makes game-balancing a lot more difficult, but if you can pull it off (e.g. Starcraft), the variety contributes a lot to replayability,

Weaknesses

  • The map view is cluttered. When zoomed in, it’s difficult to see smaller ships and distinguish their nationality. The schematic zoomed-out view shows ships much more clearly, but is cluttered by the large number of trade-route freighters all over the place.
  • They still haven’t solved the micromanagement problem. I always bog down and lose interest in the later stages of 4X games. The ratio of management effort to turn duration grows beyond what I’m willing to tolerate. I make strategic mistakes like not searching the map for sneak attacks while clicking through turns waiting for a tech to complete. The game needs more powerful automation of planet development, and some ways to direct your attention to important spots on the map (like a “proximity alert” when an enemy unit pops up well behind the front lines).