This paper will introduce a practical problem: how can the community applying reinforcement learning further realize reinforcement learning by collecting scripts and individual cases? Api-a tool to strengthen learning? tf-learn? Or? skikit-learn? Discuss? TensorForce? Before the framework, we will talk about the observations and ideas that inspired this project. If that's what you want to know? API, you can skip this part. We want to emphasize that this article does not contain the introduction of deep reinforcement learning itself, nor does it propose any new models or talk about the latest and best algorithms, so for pure researchers, this article may not be so interesting.
Development motive force
Suppose you are a researcher in computer system, natural language processing or other application fields, you must have some basic understanding of reinforcement learning, and you are familiar with deep reinforcement learning (Deep? RL) is used to control some aspects of the system.
Deep reinforcement learning, DQN, vanilla? Strategic gradient, A3C? There have been many introduction articles, such as? Capassi? Articles (/openai/baselines), rllab(/openai/rllab) and? GitHub? There are many specific algorithms in the world.
However, we find that there is still a huge gap between the development of the research framework of reinforcement learning and its practical application. In practical application, we may face the following problems:
Strengthening the close coupling between learning logic and analog handle: simulation environment? API? Very convenient, for example, they allow us to create an environment object and then create it in the. For what? In the loop, it can also manage its internal update logic (for example, by collecting output features). If our goal is to evaluate an idea of reinforcement learning, then this is reasonable, but it is much more difficult to separate the reinforcement learning code from the simulation environment. It also involves the problem of process control: when the environment is ready, can reinforcement learning code call it? Or will the reinforcement learning agent be called when the environment needs to make a decision? We often need the latter for the application reinforcement learning library implemented in many fields.
Fixed network architecture: Most implementation cases include hard-coded neural network architecture. This is usually not a big problem, because we can directly add or delete different network layers as needed. Nevertheless, it would be much better if there is a reinforcement learning library that can provide the function of declarative interface without modifying the library code. In addition, in some cases, it is (surprisingly) much more difficult to modify the schema, such as when internal state needs to be managed (see below).
Incompatible state/action interfaces: popular for many early open source codes? OpenAI? Gym? Environment, simple interface with flat state input and single discrete or continuous action output. But? DeepMind? The lab? Then use the dictionary format, which usually has multiple states and actions. And then what? OpenAI? The universe? Named key event (named? Keys? Ideally, we hope that reinforcement learning intelligence can handle any number of potentially different types and shapes of states and actions. Like TensorForce? One of the authors is working? NLP? I use reinforcement learning and want to deal with multimodal input, where a state conceptually contains two inputs-an image and a corresponding description.
Opaque execution settings and performance issues: write? TensorFlow? When coding, we naturally give priority to logic. This will bring a lot of repetitive/unnecessary operations or realize unnecessary intermediate values. In addition, the goal of distributed/asynchronous/parallel reinforcement learning is a bit uncertain, distributed? TensorFlow? Some hardware settings need to be manually adjusted to a certain extent. Similarly, if there is an execution configuration in the end, only the available devices or machines need to be declared, and then everything else can be handled internally, such as the difference between the two? IP? Can the machine run asynchronously? VPG .
To be clear, these questions are not intended to criticize the code written by researchers, because these codes are not intended to be used as? API? Or for other applications. Here we introduce the views of researchers who want to apply reinforcement learning to different fields.
TensorForce? application program interface
TensorForce? Provides a declarative interface, which is a robust implementation that can use deep reinforcement learning algorithms. In applications that want to use deep reinforcement learning, it can be used as a library, allowing users to experiment with different configurations and network architectures without worrying about all the underlying designs. We fully understand that the current deep reinforcement learning methods are often fragile and need a lot of fine-tuning, but this does not mean that we cannot establish a general software infrastructure for reinforcement learning solutions.
TensorForce? It is not a collection of original implementation results, because it is not a research simulation, and it needs a lot of work to apply the original implementation to the actual environment. Any such framework inevitably contains some structural decisions, which will make nonstandard things even more annoying (abstract omission? This is why core reinforcement learning researchers may prefer to build their models from scratch. Use? TensorForce, our goal is to get the best general direction of research at present, including emerging viewpoints and standards.
Next, we will discuss in depth? TensorForce? API? The basic aspects of, and discuss our design scheme.
Create and configure agents
The states and actions in this example are short forms of more general states/actions. For example, a multimodal input consist of an image and a description is defined as follow. Similarly, multi-output actions can also be defined. Note that throughout the code, the short form of a single state/action must be continuously used for communication with the agent.
states = dict(
image=dict(shape=(64,64,3),type='float '),
caption=dict(shape=(20,),type='int ')
)
Configuration parameters depend on the basic agent and model used. A complete list of parameters for each agent can be found in the following sample configuration:/reinforcio/tensorforce/tree/master/examples/configs.
TensorForce? At present, the following reinforcement learning algorithms are provided:
Stochastic agent
With what? Generalization? Advantage? Estimate? what's up Vanilla? Policy gradient (VPGAgent)
Trust zone policy optimization (TRPOAgent)
Depth? q? Learning/double depth? q? Learning (dq management)
Standardized dominance function
The depth of the presentation to the experts? q? Learning (DQFDAgent)
Asynchronous? Advantage? Actor-critic (A3C (A3c)) (Can it be transmitted implicitly? Distributed? Use)
Does the last item mean no? A3CAgent? Something like this, because? A3C? In fact, it describes an asynchronous update mechanism, not a specific agent. So, distributed? TensorFlow? Is asynchronous update mechanism universal? Models? Part of the base class from which all agents derive. Just like the paper "Asynchronous? Method? For what? Deep? Strengthen? As stated in Learning, A3C? Is it over? VPGAgent? Settings? Distributed? Flag? And implicitly realized. It should be pointed out that A3C? It is not the optimal distributed update strategy for every model (even meaningless for some models). At the end of this article, we will discuss how to implement other methods (such as? PAAC). It is very important to conceptually distinguish between the problem of proxy and updating semantics and the problem of executing semantics.
We also want to talk about the difference between model and agent. Agent? Class defines reinforcement learning as. API? The interface can manage incoming observation data, pretreatment, exploration and other work. What are the two key methods? agent.act(state)? And then what? Agent.observe (reward, terminal). agent.act(state)? Reply to an action, and then what? Agent.observe (reward, terminal)? The model will be updated according to the agent mechanism, such as non-strategic memory agent or strategic batch agent. Please note that these functions must be called alternately for the internal mechanism of the agent to work properly. Models? Class implements the core reinforcement learning algorithm and passes it. get_action? And then what? Update? The method provides the necessary interface that the agent can call inside the relevant point. Like DQNAgent? This is a place with. DQNModel? And an extra line (for target network updates)? MemoryAgent? agent
Neural network configuration
A key problem of reinforcement learning is to design an effective value function. Conceptually, we regard the model as a description of the updating mechanism, which is different from something new in reality-in the case of deep reinforcement learning, it refers to one (or more) neural networks. Therefore, there is no hard-coded network in the model, but different instantiations are made according to the configuration.
In the above example, we created a network configuration through programming as a dictionary list describing each layer. Can such a configuration pass? JSON? Here, then use a utility function to turn it into a network builder (network? Constructor)
What is the default activation layer? Relu, but there are other activation functions available (currently? Elu, Lu Se, softmax, tanh? And then what? Sigmoid colon). In addition, you can modify other properties of the layer.
We choose not to use the existing layer implementation (such as from? Tf.layers), so as to clearly control the internal operations and ensure that they can work with? TensorForce? The rest of has been integrated correctly. We should avoid dynamic? Wrapping paper? Library, so only rely on the lower level? TensorFlow? Operation.
Our? Layer? At present, the library only provides very few basic layer types, but it will be expanded in the future.
How much have we paid so far? TensorForce? The function of creating a hierarchical network, that is, a network with a single input state tensor, has a sequence of layers and can get an output tensor. However, in some cases, it may be necessary or more appropriate to deviate from this layer stacking structure. Most obviously, this is necessary when multiple input states are to be processed, and this task cannot be accomplished naturally by using a single processing layer sequence.
At present, we have not provided a higher-level configuration interface to automatically create the corresponding network builder. Therefore, in this case, you must programmatically define its network builder function and add it to the proxy configuration as before.
Internal state and? An episode? operate
Different from the classical supervised learning settings (in which instances and neural network calls are considered independent), reinforcement learning is a kind of? An episode? The time step in depends on the previous action and also affects the subsequent state. Therefore, in addition to its state input and action output at each time step, it is conceivable that the neural network may have internal states. An episode? Corresponding to the input/output of each time step in. The following figure shows how the network operates over time:
Management of these internal states (i.e., propagating forward between time steps and starting a new state? An episode? When resetting them) can be completely controlled by? TensorForce? what's up Agent? And then what? Models? Class processing. Note that this can handle all related use cases (in? Batch? Within one? Episode, are you there? Batch? More than one? Episode, are you there? Batch? There is no terminal inside? Episode)
In this example architecture, the output of the dense layer is sent to? LSTM? Cell, and then it reaches the final output of this time step. When going forward? LSTM? In one step, its internal state will be updated and the internal state output here will be given. For the next time step, the network will get the new state input and this internal state, and then send it? LSTM? Further push output actual output and new internal? LSTM? State, and so on ...
For the user-defined implementation of a layer with internal state, this function not only returns the output of the layer, but also returns a list of internal state input placeholders, corresponding internal state output tensors and a list of internal state initialization tensors (all of the same length, in this order).
Pretreatment state
We can define the preprocessing steps applied to these states (or multiple states, if a dictionary of lists is specified).
This? Stack? Each preprocessor in has a type and optional? args? List and/or? Quarles? A dictionary. Like what? Sequence? The preprocessor will take the last four states (i.e. frames) and stack them to simulate Markov properties. By the way: in use, such as mentioned above? LSTM? Layer, this is obviously unnecessary because? LSTM? Each layer can model and communicate time dependence through internal state.
explore
Exploration can be in? Configuration? Object, which can be applied by the agent to the actions determined by its model (in order to handle multiple actions, similarly, a standardized dictionary will be given). For example, for use? Ornstein Uhlenbeck? Explore for continuous action output, and the following specifications will be added to the configuration.
Use? Runner? Utility functions use proxies.
Let's use a proxy. This code is an agent running in our test environment:/Feedback IO/TensorForce/BLOB/Master/TensorForce/Environments/Minimum _ test.py. We use it for continuous integration-a minimum environment to verify the action, observation and update mechanism of a given agent/ model. Pay attention to all our environmental implementations (OpenAI? Gym, open competition? Universe, DeepMind? Labs) all use the same interface, so you can run tests directly in another environment.
Runner? Utility function can promote the running process of agents in the environment. Given any agent and environment instance, what can it manage? An episode? The number of each? An episode? Maximum length, termination conditions, etc. Runner? Is that acceptable? cluster_spec? Parameter, if you have this parameter, you can manage distributed execution (TensorFlow? A supervisor/meeting/etc. ). Through optional? Episode _ End? Parameters, you can also report the results regularly, and you can also give the maximum value? An episode? A pointer that stops execution before a number.
As mentioned in the introduction, how to use it in a given application scenario? Runner? Classes depend on process control. If we use reinforcement learning, we can reasonably? TensorForce? Query status information (for example, through a queue or a network service) and return an action (to another queue or service), then it can be used to implement the environment interface, so that you can use (or extend) runner? Utility function.
A more common situation may be to put? TensorForce? It is used as an external application library controlled by the driver, so it cannot provide an environment handle. For researchers, this may be insignificant, but in computer systems and other fields, it is a typical deployment problem, which is the fundamental reason why most research scripts can only be used for simulation, but not for practical applications.
It is also worth mentioning that declarative central configuration objects enable us to directly configure interfaces for all components of the reinforcement learning model, especially the network architecture.
Think further
We hope you can find out? TensorForce? Very useful. So far, our focus has been to put the architecture in place first. We believe that this will enable us to realize different reinforcement learning concepts and new methods more consistently, and avoid the inconvenience of exploring deep reinforcement learning use cases in new fields.
In such a rapidly developing field, it is difficult to decide which functions to include in the actual library. There are many algorithms and concepts now, and it seems? Arcade? Study? Environment? (ALE)? On a subset of the environment, new ideas get better results every week. But there is also a problem: many ideas are just easy to parallelize or concretize? An episode? Structural environment is effective-we don't have an accurate concept of environmental attributes and their relationship with different methods. However, we can see some obvious trends:
Sum of strategic gradients? q? Hybrid learning method improves sample efficiency (PGQ, Q-Prop? This is a logical thing. Although we don't know which hybrid strategy will prevail, we think it will be the next "standard method". We are very interested to know the practicability of these methods in different application fields (rich/sparse data). One of our very subjective views is that most applied researchers tend to use? Vanilla? Variants of strategic gradients, because they are easy to understand and implement, and more importantly, they are more robust than the new algorithm, which may need a lot of fine-tuning to deal with potential numerical instability (numerical? A different point of view is that non-reinforcement learning researchers may just not know the new methods, or they are unwilling to work hard to realize them. Is this a motive? TensorForce? The development of. Finally, it is worth considering that the update mechanism of application domain is often not as important as modeling state, action and return and network architecture.
Make better use of it GPU? And other devices (PAAC, GA3C? ): One problem with methods in this field is the implicit assumption about the time it takes to collect data and update them. In the field of non-simulation, these assumptions may not be valid, and more research is needed to understand how environmental attributes will affect device execution semantics. Are we still using it? Feed_dicts, but also considering improving the performance of input processing.
Detection mode (e.g., count-based detection, parameter space noise, etc.). )
Decomposition of large-scale discrete action space, hierarchical model and sub-goals Like what? Drucker Arnold? And others ■ Paper depth? Strengthen? Study? Are you online? Large size? Discrete? Action? Spaces. Complex discrete spaces (such as many state-dependent sub-options) are highly relevant in the application field, but it is still difficult to pass? API? Use. We look forward to many achievements in the next few years.
Internal modules for state prediction and methods based on brand-new models: for example, The paper "The? Preamplifier:? End to end? Study? And then what? Planning.
Bayesian Deep Reinforcement Learning and Uncertainty Reasoning
Generally speaking, we are tracking these developments and will include the existing technologies that we missed before (there should be many); Once we believe that a new idea may become a powerful standard method, we will also include it. In this sense, we do not have a clear competition with the research framework, but have a higher coverage.
Finally, we have an internal version to realize these ideas and see how we can turn the latest advanced methods into useful library functions. Once we are satisfied with a project, we will consider opening it up. So what? GitHub? The world has not been updated for a long time, probably because we are still trying to do internal development (or our doctoral students are too busy), but not because we gave up this project. Please contact us if you are interested in developing interesting application cases.