Current location - Education and Training Encyclopedia - Graduation thesis - How is the strongest AlphaGo made?
How is the strongest AlphaGo made?
How to make the strongest AlphaGo zero?

Just now, Deepmind held an online question-and-answer activity AMA in Reddit's machine learning section. David Silver, the head of Deepmind Intensive Learning Group, and his colleagues enthusiastically answered various questions raised by netizens. Since Deepmind just published the paper "Mastering Go Games without Human Knowledge" the day before AMA, the related issues and discussions were also extremely enthusiastic.

What is AMA?

AMA (Ask me any question) is a special column of Reddit, and you can also understand it as an online "truth or dare". AMA usually sets a time on Reddit a few days in advance to collect questions and the respondents will answer them in a unified way.

The answer of this far-sighted AMA is:

David Silver):deep mind of DeepMind Reinforcement Learning Group and Principal Investigator of AlphaGo. David Silver 1997 graduated from Cambridge University and won the Addison Wesley Award. David received his Ph.D. in Computer Science from university of alberta in 2004 and joined DeepMind on 20 13. He is the main technical director of the AlphaGo project.

Julian Schrittwieser:Deepmind software engineer.

Previously, many big cows/companies in the field of machine learning opened AMA in Reddit machine learning, including: Google Brain team, OpenAI research team, Andrew Ng and Adam Coates, Juergen schmid Huber, Geoffrey Hinton, Michael Jordan, Yann LeCun, Yoshua Bengio and so on.

We selected some representative questions from today's deep mind AMA, which are arranged as follows:

About papers and technical details

Q: Why is DeepMind Zero's training so stable? Deep reinforcement learning is unstable and easy to forget, so is self-game. If there is no good initialization state and historical checkpoint based on imitation, the combination of the two should be a disaster ... but starting from scratch, I didn't see this part in the paper. How did you do that?

David Silver: In deep reinforcement learning, AlphaGo Zero is completely different from typical modeless algorithms (such as strategic gradient or Q learning). By using AlphaGo search, we can greatly improve the results of strategy and self-matching, and then we will train the next strategy and value network through simple gradient-based update. This method will be more stable than simple gradient-based strategy improvement.

Q: I noticed that the data of the upgrade of ELO level only reached the 40th day. Is it because of the deadline of the paper? Or is the data of AlphaGo no longer significantly improved?

David Silver: AlphaGo has retired! This means that we still have a long way to go to transfer personnel and hardware resources to other AI issues.

Q: Two questions about the thesis:

Q 1: Can you explain why the residual block input size of AlphaGo is 19x 19x 17? I don't know why each player needs to be described by eight stacked binary feature layers. I think 1 and 2 floors are enough. Although I don't understand the rules of Go 100%, it seems that the eight levels are a bit too much.

Q2: Since all channels use self-matching to compare with the latest/best model, do you think there is any risk of over-fitting the parameter space for a specific SGD driving trajectory?

David Silver: Speaking of which, it may be better to use appearances than 8-layer stacks now! But there are three reasons why we observe historical data by superposition: 1) It is consistent with common inputs in other fields; 2) We need some historical states to express what has been KO; 3) If we have some historical data, we can better guess the opponent's recent position, which can be used as an attention mechanism (note: in Go, this is called "the enemy's point is my point"), and we can mark whether we are a black defender or a Bai Zi with 17 layer, because the relationship between goals should be considered.

Q: With a powerful chess engine, we can score players-for example, Elo Go's score is gradually obtained by analyzing the players' chess game, so can AlphaGo analyze the players' strength before scoring? This may provide a platform for studying human cognition.

Julian Schrittwieser: Thanks for sharing. That's a good idea!

I think this can be done completely in Go. Maybe we can use the difference between the best response and the actual response or the probability obtained by the strategy net to evaluate the position of each hand? I'll try it sometime.

Q: Now that AlphaGo is retired, is there any plan to open source it? This will have a great impact on Go and machine learning research. Also, when will the Go tool announced by Hasabis in Wuzhen be released?

David Silver: Now this tool is being prepared. You will see new news soon.

Q: What is the biggest obstacle in the system architecture during the development of Q:AlphaGo?

David Silver: One of the main challenges we met was playing against Li Shishi. At that time, we realized that AlphaGo was occasionally influenced by what we called "illusion", that is, the program might misunderstand the current disk situation and continue to take many steps in the wrong direction. We have tried many schemes, including introducing more Go knowledge or human meta-knowledge to solve this problem. But in the end, we succeeded, and solved this problem from AlphaGo itself, relying more on the power of reinforcement learning and getting a higher quality solution.

Go lovers' problems

Q: 1846 In a match between 14 Honipo Xiu Ce and1Inoue Yin Shuo, Xiu Ce's 127 hand turned Yin Shuo's ears red in an instant and became the "red-eared hand". If it is AlphaGo, will it play the same chess?

Julian Schrittwieser: I asked Fan Hui, and his answer was this:

At that time, Go was not sticky, while in AlphaGo, black was 7.5 sticky. Different posting conditions have caused differences in ancient and modern chess games. If AlphaGo had been allowed to take the next step, it would probably have changed places.

Q: From the announced AlphaGo chess game, we can see that Bai Zi has more time to play, so many people speculate that the 7.5-mesh sticker is too high (note: the number of stickers in modern Go is constantly changing, for example, it was popular to stick Bai Zi with 5.5-mesh stickers 30 years ago).

If we analyze a larger data set, can we draw some interesting conclusions about the rules of Go? (For example, who has the advantage, black or white? Should the sticker be higher or lower?)

Julian Schrittwieser: From my experience and running results, the posting of item 7.5 is balanced for both parties, and the winning rate of sunspot is slightly higher (about 55%).

Q: Can you tell me about the first-hand choice? Will ALphaGo start like we have never seen before? For example, is the first hand in Tianyuan or beyond the eye of the eye, or even more bizarre places? If not, is this a "habit", or does AlphaGo have a strong "belief" that star position, small eyes and three or three are better choices?

David Silver: In training, we saw that ALphaGo tried different ways to start-even at the beginning of training, there was a first hand!

Even in the later stage of training, we can still see the start of four or six super-high goals, but we will soon return to the normal start of small goals.

Q: As a huge fan of AlphaGo, I always have a question in my mind: How many players can AlphaGo play? From the paper, we know that AlphaGo can play chess, and I also know that AlphaGo can't let Ke Jie play chess, but I think you must be curious. Have you done an internal test?

David Silver: We didn't make way for human chess players. Of course, we played chess when testing different versions. Among the three versions of AlphaGo Master & Gtalphago Lee & Gtalphago Fan, the latter version can let Sanzi beat the previous version. However, since AlphaGo is self-training, it is especially good at beating its weaker version, so we don't think these training methods can be extended to chess games with human players.

Q: Have you ever thought about using the Generative Countermeasure Network (GAN)?

David Silver: In a sense, self-play is the process of confrontation. Every iteration of the result is trying to find the "reverse strategy" of the previous version.

Rumor buster

Q: I heard that AlphaGo was guided to train in a specific direction in the early stage of development to solve the weaknesses shown in the competition. Now that its ability has surpassed that of human beings, does it need another mechanism to make further breakthroughs? What kind of work have you done?

David Silver: Actually, we have never instructed AlphaGo to solve specific weaknesses. We always focus on basic machine learning algorithms, so that AlphaGo can learn to fix its weaknesses.

Of course, you can't be 100% perfect, so there will always be shortcomings. In practice, we need correct methods to ensure that training will not fall into the trap of local optimization, but we have never used artificial ascension.

About DeepMind company

Q: I have a few questions here: What's it like to work in DeepMind? Who are the AlphaGo team members? Can you tell us about the assignment of AlphaGo team? What's the next big challenge?

David Silver: It's great to work in DeepMind:)-It's not a job advertisement, but I feel lucky to do what I like here every day. There are many (too many to be busy! )) the cool project you participated in.

We are lucky to have many big cows working in AlphaGo. You can get more detailed information by looking at the list of communication authors.

Q: Do you think undergraduates can succeed in the field of artificial intelligence?

Julian Shritves: Of course. I only have a bachelor's degree in computer science, and this field is changing rapidly. I think you can learn by reading the latest papers and experiments. In addition, it is also helpful to practice in companies that have done machine learning projects.

On the extension of algorithms and other projects

Q: Hassabis said in a speech in Cambridge in March this year that one of the future goals of the AlphaGo project is to explain neural networks. My question is: What progress has ALphaGo made in the structure of neural network, or is neural network still a mysterious black box for AlphaGo?

David Silver: Not only ALphaGo, but interpretability is a very interesting topic in all our projects. There are many teams in Deepmind exploring our system in different ways. Recently, a team published a cognitive psychology technology, trying to decipher what happened inside the matching network, and the effect was very good!

Q: I am very happy to see the good results of AlphaGo Zero. One of our NIPS papers also mentioned the similarity in efficiency between deep learning and search tree, so I am particularly interested in the behavior during the longer training process.

In the training process of AlphaGo, what are the relative performances of greedy algorithm of Monte Carlo tree search, greedy algorithm of strategy network and greedy algorithm of value function change in the training process? Can this self-learning method be applied to the recent StarCraft 2 API?

David Silver: Thank you for introducing your paper! I can't believe this paper was published when we submitted it on April 7th. In fact, it is very similar to the strategic component of our learning algorithm (although we also have a numerical component). You can refer to our methods and discussions in reinforcement learning, and you are glad to see similar methods used in other games.

Q: Why didn't the earlier version of AlphaGo try to play the game by itself? Or, AlphaGo has tried to play self-game before, but the effect is not good?

I am curious about the development and progress in this field. Compared with today, where is the bottleneck of designing an AlphaGo with self-training two years ago? What kind of system iteration process has the "machine learning intuition" we see today experienced?

David Silver: Creating a system that can learn by itself completely has always been a problem to be solved to strengthen learning. Our initial attempts, including many similar algorithms you can find, were quite unstable. We have made many attempts, and finally the AlphaGo Zero algorithm is the most effective, which seems to solve this specific problem.

Q: When do you think robots can effectively solve the problems about height and size in the real world (for example, learn how to grab garbage of any shape, size and position by themselves)? Is the strategic gradient method the key point to achieve this goal?

Julian Schrittwieser: This is mainly due to the double improvement of the value/policy network, including better training and better architecture. See the comparison of different network architectures in Figure 4.

Q: It is said that the power consumption of ALphaGo Master who defeated Ke Jie is only110 of AlphaGo Lee who defeated Li Shishi. What kind of optimization has been made?

Julian Schrittwieser: This is mainly due to the double improvement of the value/policy network, including better training and better architecture. See the comparison of different network architectures in Figure 4. Are you sure this is not the answer to the last question?

Q: It seems to be a big obstacle to use or simulate the long-term memory of Agent in reinforcement learning. Looking ahead, do you think we can solve this problem in a new way of thinking? Or do we need to wait for our technology to realize a super network?

Julian Schrittwieser: Yes, long-term memory may be an important factor. For example, StarCraft, you may have made thousands of moves, but you should also remember the scouts you sent.

I think there are exciting ingredients (NeuroTuring machine! ), but I think we still have a lot of room for improvement in this respect.

Q: David, I have seen the video of your speech. You mentioned that reinforcement learning can be used in financial transactions. Are there any real-world examples? How will you deal with the black swan incident (something that has never happened before)?

David Silver: There are very few published papers about learning financial algorithms for real-world enhancement, but some classic papers are worth reading, such as the one written by Nevmyvaka and Kearns in 2006 and the one written by Moody and Safell at 200 1.

Q: You and Facebook study at almost the same time. What are your advantages in getting a master performance faster?

For those fields that can't get as much training data as AlphaGo, how to carry out machine learning or enhance learning?

David _ Silver: Facebook pays more attention to supervised learning, and we choose to pay more attention to reinforcement learning, because we believe that AlphaGo will eventually surpass human knowledge. Our recent results actually show that supervised learning can bring surprises to people, but reinforcement learning is definitely the key point far beyond human level.