Recently, Google opened up an intensive learning deep planning network on the official blog. PlaNet, PlaNet successfully solved various image-based control tasks, and the final performance increased by 5,000% in data processing efficiency compared with advanced model-free agents. Lei Feng.com AI Technology Review compiled this as follows.
In view of how artificial agents improve their decision-making mechanisms over time, the most widely used method is to strengthen learning. In technical implementation, the agent observes stream information from the perceptual input (such as a camera image) during the selection of an action (such as a motor command), and sometimes receives a reward for achieving the specified goal. This model-free reinforcement learning approach directly predicts behavior after perceptual observation, enabling DeepMind's DQN to play Atari games and use other agents to manipulate the robot. However, this method with the "black box" nature often requires several weeks of simulation interaction, and it can be completed after repeated trials and trials, thus limiting the application in reality.
In contrast, model-based reinforcement learning attempts to allow agents to learn the daily workings of the real world. Rather than translating observations directly into action, this approach allows the agent to explicitly plan ahead and act more cautiously by “imagining” long-term returns. This model-based reinforcement learning approach has actually achieved substantial success, most notably AlphaGo, which can be moved on a game board with familiar rules. If you want to extend the method to an unknown environment (such as manipulating a robot with only pixels as input), the agent must know that you are learning the rules from experience. Only by implementing this dynamic model, in principle, it is possible to conduct more efficient and natural multi-task learning. Creating a model that is accurate enough for planning has always been a long-term goal of reinforcement learning.
In order to make this difficult breakthrough as soon as possible, we teamed up with DeepMind to launch the DeepNet Network (PlaNet) Agent, which can learn about the world model by image input and effectively expand the planning scope of the model. PlaNet has successfully solved a variety of image-based control tasks, resulting in an average performance improvement of 5,000 percent in data processing efficiency compared to advanced model-free agents. We have open sourced code in the community:
Open source URL:Https://github.com/google-research/planet
How PlaNet works
Simply put, PlaNet learns dynamic models given image input and uses it to efficiently absorb new experiences. Compared to past methods based on image planning, we rely on a compact sequence of hidden or potential states. It is called a latent dynamic model because it is no longer direct prediction from an image to an image. Instead, it predicts the future potential state and then generates an image of each step from the corresponding potential state. reward. By compressing the image in this way, the agent will automatically learn more abstract representations, such as the position and velocity of the object, and predict the future state without having to generate the image all the way.
Potential dynamic learning model: In the potential dynamic learning model, the information of the input image will be integrated into the hidden state (green) through the encoder network (grey trapezoid). The hidden state is then mapped forward to predict future images (blue trapezoids) and rewards (blue rectangles).
In order to let everyone accurately grasp the potential dynamic learning model, we recommend to you:
A Recurrent State Space ModelA potential dynamic learning model with both deterministic and stochastic factors can predict the various possible futures needed to achieve robust planning, keeping in mind the many pieces of information in the process. The final experiments show that these two factors are critical to the achievement of high planning performance.
A Latent Overshooting Objective: Forced consistency between one step and multi-step prediction in potential space, we extract the goal for training multi-step prediction for potential dynamic learning models. This creates a goal that quickly and effectively improves long-term predictive performance and is compatible with any potential sequence model.
Although predicting future images allows us to "give" the model, the encoding and decoding of images (the trapezoid in the above figure) depends on a large number of operations, which will reduce our planning efficiency. In any case, planning in a compact potential state space is still efficient because we only need to predict future sequences of rewards rather than images to evaluate the sequence of actions. For example, even if the scene cannot be visualized, the agent can imagine for itself how the ball is positioned and how its distance from the target will be changed for certain actions. This also means that each time an agent selects an action, it can be compared to a large batch of nearly 10,000 imaginary action sequences. Finally, by performing the first action to find the best sequence, we will re-plan the next step accordingly.
Planning in the potential space: For planning, we changed the past image (gray trapezoid) encoding to the current hidden state (green). Based on this, we effectively predict future rewards for multiple action sequences. Notice how the image decoder (blue trapezoid) based on the past image disappears in the above image. Finally, the first action (red box) to find the best sequence is performed.
Compared to previous work on world models, PlaNet operates without any policy guidance —— it chooses actions purely through planning, so it can benefit from real-time model improvements. For technical details you can check:
PlaNet vs. no model approach
We examined the performance of PlaNet using a series of control tasks. These agents only get image observations and rewards in the experiment. These tasks cover a variety of different types of challenges:
Cartpole ascends the mission with a fixed camera so the cart can safely move out of sight. The agent must absorb and remember the information of multiple frames.
Finger rotation tasks require prediction of two separate objects and their interactions.
Cheetah running missions, including ground contact that is difficult to predict accurately, requires a model that predicts multiple possible futures.
Cup missions, when the ball is captured, will only provide a sparse reward signal, which means that a model that accurately predicts the future to plan a precise sequence of actions is needed.
The Pacers mission, the simulated robot will initially lie on the ground, and must learn to stand up and walk.
PlaNet Agents are trained in a variety of image-based control tasks. These tasks cover different challenges: partial observability, contact with the ground, sparse rewards for catching the ball, and control of challenging biped robots.
We are the first to use the learning model for image-based task-based planning, and then the results are better than the no-model approach. The following table compares PlaNet with the well-known A3C agent and D4PG agent, and the combination of the two represents the latest developments in the model-free reinforcement learning method. The baseline numbers are taken from the DeepMind Control Suite. The end result shows that PlaNet performs significantly better than A3C on all tasks and is close to the final performance of D4PG, with an average reduction of 5000% in the frequency of interaction with the environment.
One Agent for all tasks
In addition, we trained the PlaNet Universal Agent for all six tasks. The agent is randomly placed into different environments without specifying a task target, and it is necessary to infer the task from the image observation by itself. The multitasking agent achieves the same average performance level as the universal agent without changing the hyperparameters. While the omnipotent agent learns slowly in the cartpole ascending mission, it demonstrates higher levels of learning and performance in more challenging and challenging pedestrian tasks.
Predictive video of the PlaNet agent training on multiple tasks. The trained agent collects information as shown above, and below is the open-loop illusion agent. The omnipotent agent considers the first 5 frames as a contextual context to infer tasks and states, and accurately predicts the next 50 steps given a series of actions.
Our findings demonstrate the promise of a dynamic learning model for building autonomous reinforcement learning agents. We suggest that future research can focus on how to learn more accurate dynamic learning models through more difficult tasks, such as robot tasks in the 3D environment and the real world. One factor that may make further breakthroughs in this research is the TPU processing power. We are extremely excited about the possibility of model-based reinforcement learning methods after open source, and the areas that may benefit include multi-task learning, hierarchical programming, and active exploration tasks that are estimated by uncertainty.