Current location - Education and Training Encyclopedia - Graduation thesis - Interpretation of Paper: Strategy Separation and Value Matching in Multi-agent Compensation Learning
Interpretation of Paper: Strategy Separation and Value Matching in Multi-agent Compensation Learning
Theme: Policy dispersion and value matching in multi-agent compensation learning.

Paper link: https://arxiv.org/pdf/1903.59999.99999999996

Research object: multi-agent cooperative system

Research motivation: The existing work on multi-agent reinforcement learning (MARL) is mainly to enjoy information through centralized comment mechanism or communication between agents, so as to improve the learning effect. But these works usually don't study how to solve the dimension disaster problem by sharing information between agents.

It is assumed that the multi-agent problem can be decomposed into a multi-task problem, so that each agent can only search in a subset of the whole state space without searching in the whole state space. The advantage of this assumption is that it can greatly reduce the search space of the system, thus speeding up the learning speed.

Main work:

On the basis of the above assumptions, a new multi-agent actor-critic algorithm is proposed by integrating the knowledge of homogeneous agents by refining and value matching.

The so-called homogeneous multi-agent is an agent with the same state space and action space. For example, UAVs and unmanned vehicles are homogeneous agents, while UAVs and unmanned vehicles are heterogeneous multi-agents.

Question background:

In multi-agent system, agents act independently in the same environment, but at the same time, they also affect each other's decisions. Therefore, if the reinforcement learning technology of single agent is directly applied to multi-agent, that is, the so-called independent learning, then the learning process will be unstable. Because the traditional single-agent reinforcement learning assumes that the change of external environment is stable, the change of any agent's strategy will affect the learning of other agents in multi-agent systems. Therefore, researchers usually adopt the framework of centralized training and distributed execution to realize MARL. But there will be a problem, that is, when the number of agents increases, the state space and action space of the whole system will increase exponentially, and the search space of agents will become particularly large.

For each agent, some states do not contribute to the solution of the optimal strategy, so there is no need to search the whole state space. An efficient search can help agents shorten the training time, but there is no general method to solve this problem in the existing methods, which is one of the reasons why the number of agents in MARL research is limited now.

Solution:

In training, * * * enjoys information such as experience data among agents, and makes full use of and learns new strategies through the idea of policy sublimation.

Firstly, policy sublimation is proposed to solve the problem of multi-task reinforcement learning (MTRL). Paper link: https://arxiv.org/pdf/1511.06295. Therefore, this paper first regards the single task -MARL problem as a single agent -MTRL problem, so that we can use strategy distillation to find the optimal strategy.

But how to understand this hypothesis? For example, the task of three agents, A, B and C, is to reach the designated positions L 1, L2 and L3 in the shortest time, which is the single task-MARL problem. Now this problem is considered as an agent A, which has the ability to reach L 1, L2, L3 in the shortest time. This is a single agent-MTRL problem.

Algorithm details:

Because of the use of strategy refinement, this paper uses random strategy. For the continuous motion problem, the author extends the soft actor criticism algorithm from single agent to multi-agent. At the same time, the output of actor should be converted into the form of probability distribution by softmax function.

Policy sublimation:

The loss function of distillation strategy is

? ( 1)

Note that the formula (1) is a probability distribution, not the strategy itself. Only variables are sampled from the playback buffer, not variables. This is because the replay buffer may not be optimal, and it is easier to find the optimal action by traversing all actions in the action space directly from the replay buffer in this way. After each extraction, all proxy policies are updated to (hard update). In this way, agents and other agents can enjoy information.

The advantage of strategic distillation is that even if a state of an agent is not sampled, but other agents are sampled, then the information of this state can be indirectly transmitted to this agent through strategic distillation of other agents.

Value matching:?

It is not enough to update the strategy by refining it. If distillation is added to strategy learning and the value function is learned by traditional methods, there will obviously be differences. Therefore, it is necessary to adjust the value function.

It should be pointed out that for homogeneous multi-agents with cooperative tasks, their best strategies are the same, because their state space and action space are the same, and they share a reward. Based on this premise, the author puts forward the method of value matching to narrow the search space.

In traditional centralized training, the input of its value function is the observation and action of all agents, such as sum. The order of these inputs is generally fixed, for example, for state value functions, it will be satisfied. However, according to the setting in this paper, that is, multi-agents are homogeneous and * * * enjoys a reward function, so the order of value function input does not affect the specific value output.

For example, the two agents in the above figure assume that the value function of state A () has been learned as; State B is a symmetrical form of State A. According to the hypothesis of homogeneous agent cooperation task, the value functions of these two states should be equal, that is. It is satisfactory to extend it to the case of agency.

(2)

In which all sequential arrangement are represented as a set. In this way, once the value function of the state is learned and used as supervision information to train the value matching evaluation network, different combinations of corresponding symmetric state values will be available. The separation of policies and the value matching of critics constitute DVM.

In order to train this new value function (distillation value function), this paper uses the mean square error loss function (MSE).

. ? (3)

Where the parameters of the matching value function are represented.

Similar to the extraction strategy, the matching value function can represent the knowledge in the state space without traversing all the states. This paper also points out that many MARL methods take Q-value function as the criterion, and the above methods are also applicable as long as the state and action are consistent.

Multi-agent soft actor-critic (SAC):

Actor review (AC) is familiar to everyone, so what is soft actor review?

SAC first appeared in ICML20 18, and the paper link is http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf.

The optimization goal of SAC is not only to maximize the expected cumulative return, but also to maximize the entropy, which is conducive to balancing the learning and exploration of agents. Even if the choice of actions is random enough, the learning of tasks can be guaranteed. The actor of SAC outputs random variables, which is why this paper chooses to use SAC framework. Then, the author extends SAC to multi-agent, puts forward MA-SAC, and adds DVM mentioned above.

It is pointed out that the strategy network is trained by the method of strategy extraction, so the output of its participant network is probability distribution. For algorithms with deterministic strategies such as MADDPG, KL loss cannot be calculated because the strategy network outputs continuous action values.

Aiming at the problem of continuous motion control, the strategy function outputs a Gaussian distribution with a certain mean and variance, and then samples from this Gaussian distribution to get a continuous motion value.

The algorithm flow proposed in this paper is as follows:

Experimental environment:

Summary:

The DVM method proposed in this paper is mainly used for information sharing and transmission among homogeneous cooperative multi-agents. By learning the separation distillation strategy and distillation value function, the multi-agent single task problem is regarded as a single agent multi-task problem. The author thinks that this method can effectively reduce the state search space of agents, thus accelerating the learning speed. Even if an agent does not encounter certain states, as long as other agents do. Distillation strategy can integrate the knowledge learned by other agents into a strategy, so as to realize knowledge sharing among agents. For the problem that KL loss cannot be calculated by continuous actions, the author uses MA-SAC framework to realize MARL, so as to continue to use DVM for learning.