Current location - Education and Training Encyclopedia - Graduation thesis - DDPG and TD3
DDPG and TD3
Fast facts:

DDPG recently learned a Q function and a strategy.

DDPG and Q-learning are very similar, and both expect to know the action value function and get the action in a given state.

In discrete space, you can calculate the Q value and choose the maximum value, but in continuous action space, it is impossible to evaluate the whole action space and it is difficult to optimize it.

In the continuous motion space, it is differentiable relative to the motion parameters, which allows us to obtain strategies based on gradient learning criteria to estimate the motion.

Bellman equation:

Indicates that the next state is sampled.

It is assumed that neural network is the estimation, recording and parameter of q function. There are collected data. Measured by the mean square Berman error, the estimated bellman equation can be satisfied.

The function approximators of Q learning algorithm, such as DQN and its variants, are largely based on minimizing MSBE loss function. Common techniques include replay buffer and target network.

In addition, it is difficult to achieve the maximum operation of the above continuous action space in DDPG. DDPG uses the target strategy network to calculate that an action can be approximately maximized (Q function target network), and the synthesis can be written as follows:

For the target strategy.

Policy learning: We want to learn a policy that can give full play to its role. Because the action space is continuous. We assume that the Q function is differentiable to the action, and we only need to perform gradient ascending optimization.

The parameters of the q function are regarded as constants here.

DDPG trains deterministic strategy by closing strategy. Because the strategy is certain, if the agent explores the strategy at the beginning, it may be difficult for it to try a wider range of actions to make use of useful learning signals. Therefore, in order to make the DDPG policy more exploratory, noise is added to the training to cooperate with the action.

The original author's paper recommends time-related OU noise. Some recent research results show that mean-zero Gaussian noise is better, and the latter is simpler and easier to realize. In order to obtain higher quality training data initially, the noise scale can be reduced during training.

In the testing stage, in order to observe the utilization rate of agent learning, no noise is added.

The disadvantage of DDPG is that it is usually not robust enough for hyperparameters and other fine-tuning. One of the main reasons for the failure is that Q-function overestimates Q-values, and the error in using Q-function leads to the collapse of the strategy. TD3 introduces the following techniques to solve this problem:

Target strategy smoothing

Clipping double q learning

Two QFunctions use a target, and the small value obtained by the two QFunctions is taken as the target value.

Both of them do regression learning on their goals.

Compared with DDPG, the update formula of the policy has not changed: