Skip to content

Commit

Permalink
Fixes (GAIL, A2C and BC) + Add Pretraining (openai#206)
Browse files Browse the repository at this point in the history
* Remove GAIL results

* Begin GAIL cleanup

* Fixes for GAIL + add tests

* Fix GAIL saving/loading

* Add notes

* Bug fixes + update changelog for GAIL

* Remove unused file

* Fix A2C with continuous actions
+ update logging

* Fix ACKTR runner

* Fixed behavior cloning + add test

* Add pretrain method

* Update coverage rc

* Rename dataset file

* Refactor expert dataset generation

* Clean up code for codacy

* Remove unused import

* Fix close method for DummyVecEnv

* Add support for pretraining with discrete actions

* Style fix

* Kill env processes to avoid memory error during tests

* Call close instead of killing processes

* Start rewriting expert dataset
In order to support image dataset

* Style fixes

* Add image dataset recorder

* Add documentation for pretraing + GAIL

* Remove pretrain with image buggy test for now

* Fix forkserver hanging when using atari env

* Reduce number of cpu for atari envs

* Change default start method (test CI)

* Test sequential mode

* Add sequential processing param

* Switch back to fork start method + add warning

* Add discrete actions support for GAIL + fix deprecations

* Update documentation

* Clean up + format code

* Replace num_iters by n_epochs + update doc

* Do not display NaN reward when not using Monitor (for SAC)

* Document Gumbel-max trick

* Simplify dataloader

* Expert model can be a callable
  • Loading branch information
araffin authored Mar 28, 2019
1 parent 76e6d2f commit aa62f02
Show file tree
Hide file tree
Showing 89 changed files with 1,311 additions and 1,379 deletions.
4 changes: 3 additions & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ omit =
stable_baselines/ppo1/run_humanoid.py
stable_baselines/ppo1/run_robotics.py
# HER requires mpi and Mujoco
stable_baselines/her/experiment/
stable_baselines/her/experiment/*
tests/*
setup.py

[report]
exclude_lines =
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
.coverage.*
__pycache__/
_build/
*.npz

# Setuptools distribution and build folders.
/dist/
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ All the following examples can be executed online using Google colab notebooks:
| ACKTR | :heavy_check_mark: | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: |
| DDPG | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :x: | :x: | :x: |
| DQN | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: | :x: | :x: | :x: |
| GAIL <sup>(2)</sup> | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :x: | :x: | :heavy_check_mark: <sup>(4)</sup> |
| GAIL <sup>(2)</sup> | :heavy_check_mark: | :x: | :heavy_check_mark: |:heavy_check_mark:| :x: | :x: | :heavy_check_mark: <sup>(4)</sup> |
| HER <sup>(3)</sup> | :x: <sup>(5)</sup> | :x: | :heavy_check_mark: | :x: | :x: | :x: | :x: |
| PPO1 | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
Expand Down
6 changes: 3 additions & 3 deletions docs/guide/algos.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ A2C ✔️ ✔️ ✔️ ✔
ACER ✔️ ✔️ ❌ [#f5]_ ✔️ ✔️
ACKTR ✔️ ✔️ ❌ [#f5]_ ✔️ ✔️
DDPG ✔️ ❌ ✔️ ❌ ❌
DQN ✔️ ❌ ❌ ✔️
DQN ✔️ ❌ ❌ ✔️
GAIL [#f2]_ ✔️ ✔️ ✔️ ✔️ ✔️ [#f4]_
PPO1 ✔️ ✔️ ✔️ ✔️ ✔️ [#f4]_
PPO1 ✔️ ✔️ ✔️ ✔️ [#f4]_
PPO2 ✔️ ✔️ ✔️ ✔️ ✔️
SAC ✔️ ❌ ✔️ ❌ ❌
TRPO ✔️ ✔️ ✔️ ✔️ ✔️ [#f4]_
TRPO ✔️ ✔️ ✔️ ✔️ [#f4]_
============ ======================== ========= =========== ============ ================

.. [#f1] Whether or not the algorithm has be refactored to fit the ``BaseRLModel`` class.
Expand Down
2 changes: 1 addition & 1 deletion docs/guide/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ We recommend using `Anaconda <https://conda.io/docs/user-guide/install/windows.h

1. Install `MPI for Windows <https://www.microsoft.com/en-us/download/details.aspx?id=57467>`_ (you need to download and install ``msmpisetup.exe``)

2. Clone Stable-Baselines Github repo and replace this line ``gym[atari,classic_control]>=0.10.9`` by this one ``gym[classic_control]>=0.10.9``
2. Clone Stable-Baselines Github repo and replace the line ``gym[atari,classic_control]>=0.10.9`` in ``setup.py`` by this one: ``gym[classic_control]>=0.10.9``

3. Install Stable-Baselines from source, inside the folder, run ``pip install -e .``

Expand Down
152 changes: 152 additions & 0 deletions docs/guide/pretrain.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
.. _pretrain:

.. automodule:: stable_baselines.gail


Pre-Training (Behavior Cloning)
===============================

With the ``.pretrain()`` method, you can pre-train RL policies using trajectories from an expert, and therefore accelerate training.

Behavior Cloning (BC) treats the problem of imitation learning, i.e., using expert demonstrations, as a supervised learning problem.
That is to say, given expert trajectories (observations-actions pairs), the policy network is trained to reproduce the expert behavior:
for a given observation, the action taken by the policy must be the one taken by the expert.

Expert trajectories can be human demonstrations, trajectories from another controller (e.g. a PID controller)
or trajectories from a trained RL agent.


.. note::

Only ``Box`` and ``Discrete`` spaces are supported for now for pre-training a model.


.. note::

Images datasets are treated a bit differently as other datasets to avoid memory issues.
The images from the expert demonstrations must be located in a folder, not in the expert numpy archive.



Generate Expert Trajectories
----------------------------

Here, we are going to train a RL model and then generate expert trajectories
using this agent.

Note that in practice, generating expert trajectories usually does not require training an RL agent.

The following example is only meant to demonstrate the ``pretrain()`` feature.

However, we recommend users to take a look at the code of the ``generate_expert_traj()`` function (located in ``gail/dataset/`` folder)
to learn about the data structure of the expert dataset (see below for an overview) and how to record trajectories.


.. code-block:: python
from stable_baselines import DQN
from stable_baselines.gail import generate_expert_traj
model = DQN('MlpPolicy', 'CartPole-v1', verbose=1)
# Train a DQN agent for 1e5 timesteps and generate 10 trajectories
# data will be saved in a numpy archive named `expert_cartpole.npz`
generate_expert_traj(model, 'expert_cartpole', n_timesteps=int(1e5), n_episodes=10)
Here is an additional example when the expert controller is a callable,
that is passed to the function instead of a RL model.
The idea is that this callable can be a PID controller, asking a human player, ...


.. code-block:: python
import gym
from stable_baselines.gail import generate_expert_traj
env = gym.make("CartPole-v1")
# Here the expert is a random agent
# but it can be any python function, e.g. a PID controller
def dummy_expert(_obs):
"""
Random agent. It samples actions randomly
from the action space of the environment.
:param _obs: (np.ndarray) Current observation
:return: (np.ndarray) action taken by the expert
"""
return env.action_space.sample()
# Data will be saved in a numpy archive named `expert_cartpole.npz`
# when using something different than an RL expert,
# you must pass the environment object explicitely
generate_expert_traj(dummy_expert, 'dummy_expert_cartpole', env, n_episodes=10)
Pre-Train a Model using Behavior Cloning
----------------------------------------

Using the ``expert_cartpole.npz`` dataset generated with the previous script.

.. code-block:: python
from stable_baselines import PPO2
from stable_baselines.gail import ExpertDataset
# Using only one expert trajectory
# you can specify `traj_limitation=-1` for using the whole dataset
dataset = ExpertDataset(expert_path='expert_cartpole.npz',
traj_limitation=1, batch_size=128)
model = PPO2('MlpPolicy', 'CartPole-v1', verbose=1)
# Pretrain the PPO2 model
model.pretrain(dataset, n_epochs=1000)
# As an option, you can train the RL agent
# model.learn(int(1e5))
# Test the pre-trained model
env = model.get_env()
obs = env.reset()
reward_sum = 0.0
for _ in range(1000):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
reward_sum += reward
env.render()
if done:
print(reward_sum)
reward_sum = 0.0
obs = env.reset()
env.close()
Data Structure of the Expert Dataset
------------------------------------

The expert dataset is a ``.npz`` archive. The data is saved in python dictionary format with keys: ``actions``, ``episode_returns``, ``rewards``, ``obs``,
``episode_starts``.

In case of images, ``obs`` contains the relative path to the images.

obs, actions: shape (N * L, ) + S

where N = # episodes, L = episode length
and S is the environment observation/action space.

S = (1, ) for discrete space


.. autoclass:: ExpertDataset
:members:
:inherited-members:


.. autoclass:: DataLoader
:members:
:inherited-members:


.. autofunction:: generate_expert_traj
7 changes: 4 additions & 3 deletions docs/guide/vec_envs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,10 @@ SubprocVecEnv ✔️ ✔️ ✔️ ✔️ ✔️

.. warning::

When using ``SubprocVecEnv``, users must wrap the code in an ``if __name__ == "__main__":``
if using the ``forkserver`` or ``spawn`` start method (the default). For more information, see Python's
`multiprocessing guidelines <https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods>`_.
When using ``SubprocVecEnv``, users must wrap the code in an ``if __name__ == "__main__":`` if using the ``forkserver`` or ``spawn`` start method (default on Windows).
On Linux, the default start method is ``fork`` which is not thread safe and can create deadlocks.

For more information, see Python's `multiprocessing guidelines <https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods>`_.


DummyVecEnv
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ This toolset is a fork of OpenAI Baselines, with a major structural refactoring,
guide/custom_policy
guide/tensorboard
guide/rl_zoo
guide/pretrain


.. toctree::
Expand Down
23 changes: 19 additions & 4 deletions docs/misc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,29 @@ Changelog

For download links, please look at `Github release page <https://github.com/hill-a/stable-baselines/releases>`_.

Pre-release 2.4.2a (WIP)
------------------------
Pre-Release 2.5.0a0 (WIP)
--------------------------

**Working GAIL and hotfix for A2C with continuous actions**

- fixed various bugs in GAIL
- added scripts to generate dataset for gail
- added tests for GAIL + data for Pendulum-v0
- removed unused ``utils`` file in DQN folder
- fixed a bug in A2C where actions were cast to ``int32`` even in the continuous case
- added addional logging to A2C when Monitor wrapper is used
- changed logging for PPO2: do not display NaN when reward info is not present
- change default value of A2C lr schedule
- removed behavior cloning script
- added ``pretrain`` method to base class, in order to use behavior cloning on all models
- fixed ``close()`` method for DummyVecEnv.
- added support for Dict spaces in DummyVecEnv and SubprocVecEnv. (@AdamGleave)
- made SubprocVecEnv thread-safe by default; support arbitrary multiprocessing start methods. (@AdamGleave)
- added support for arbitrary multiprocessing start methods and added a warning about SubprocVecEnv that are not thread-safe by default. (@AdamGleave)
- added support for Discrete actions for GAIL
- fixed deprecation warning for tf: replaces ``tf.to_float()`` by ``tf.cast()``
- fixed bug in saving and loading ddpg model when using normalization of obs or returns (@tperol)
- changed DDPG default buffer size from 100 to 50000.
- fixed a bug in ``ddpg.py`` in ``combined_stats`` for eval. Computed mean on ``eval_episode_rewards`` and ``eval_qs``
- fixed a bug in ``ddpg.py`` in ``combined_stats`` for eval. Computed mean on ``eval_episode_rewards`` and ``eval_qs`` (@keshaviyengar)
- fixed a bug in ``setup.py`` that would error on non-GPU systems without TensorFlow installed


Expand Down
Loading

0 comments on commit aa62f02

Please sign in to comment.