Fixes (GAIL, A2C and BC) + Add Pretraining (openai#206)

* Remove GAIL results * Begin GAIL cleanup * Fixes for GAIL + add tests * Fix GAIL saving/loading * Add notes * Bug fixes + update changelog for GAIL * Remove unused file * Fix A2C with continuous actions + update logging * Fix ACKTR runner * Fixed behavior cloning + add test * Add pretrain method * Update coverage rc * Rename dataset file * Refactor expert dataset generation * Clean up code for codacy * Remove unused import * Fix close method for DummyVecEnv * Add support for pretraining with discrete actions * Style fix * Kill env processes to avoid memory error during tests * Call close instead of killing processes * Start rewriting expert dataset In order to support image dataset * Style fixes * Add image dataset recorder * Add documentation for pretraing + GAIL * Remove pretrain with image buggy test for now * Fix forkserver hanging when using atari env * Reduce number of cpu for atari envs * Change default start method (test CI) * Test sequential mode * Add sequential processing param * Switch back to fork start method + add warning * Add discrete actions support for GAIL + fix deprecations * Update documentation * Clean up + format code * Replace num_iters by n_epochs + update doc * Do not display NaN reward when not using Monitor (for SAC) * Document Gumbel-max trick * Simplify dataloader * Expert model can be a callable
HumanCompatibleAI · Mar 28, 2019 · aa62f02 · aa62f02
1 parent 76e6d2f
commit aa62f02
Show file tree

Hide file tree

Showing 89 changed files with 1,311 additions and 1,379 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -6,7 +6,9 @@ omit =
     stable_baselines/ppo1/run_humanoid.py
     stable_baselines/ppo1/run_robotics.py
     # HER requires mpi and Mujoco
-    stable_baselines/her/experiment/
+    stable_baselines/her/experiment/*
+    tests/*
+    setup.py
 
 [report]
 exclude_lines =

diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,7 @@
 .coverage.*
 __pycache__/
 _build/
+*.npz
 
 # Setuptools distribution and build folders.
 /dist/

diff --git a/README.md b/README.md
@@ -148,7 +148,7 @@ All the following examples can be executed online using Google colab notebooks:
 | ACKTR               | :heavy_check_mark:           | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x:                 | :x:                | :heavy_check_mark:                |
 | DDPG                | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :x:                               |
 | DQN                 | :heavy_check_mark:           | :x:                | :x:                | :heavy_check_mark: | :x:                 | :x:                | :x:                               |
-| GAIL <sup>(2)</sup> | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :heavy_check_mark: <sup>(4)</sup> |
+| GAIL <sup>(2)</sup> | :heavy_check_mark:           | :x:                | :heavy_check_mark: |:heavy_check_mark:| :x:                 | :x:                | :heavy_check_mark: <sup>(4)</sup> |
 | HER <sup>(3)</sup>  | :x: <sup>(5)</sup>           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :x:                               |
 | PPO1                | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |
 | PPO2                | :heavy_check_mark:           | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark:                |

diff --git a/docs/guide/algos.rst b/docs/guide/algos.rst
@@ -23,12 +23,12 @@ A2C          ✔️                        ✔️         ✔️           ✔
 ACER         ✔️                        ✔️         ❌ [#f5]_   ✔️            ✔️
 ACKTR        ✔️                        ✔️         ❌ [#f5]_   ✔️            ✔️
 DDPG         ✔️                        ❌        ✔️           ❌           ❌
-DQN          ✔️                        ❌        ❌           ✔️           ❌
+DQN          ✔️                        ❌        ❌          ✔️            ❌
 GAIL [#f2]_  ✔️                        ✔️         ✔️           ✔️            ✔️ [#f4]_
-PPO1         ✔️                        ✔️         ✔️           ✔️            ✔️ [#f4]_
+PPO1         ✔️                        ❌        ✔️           ✔️            ✔️ [#f4]_
 PPO2         ✔️                        ✔️         ✔️           ✔️            ✔️
 SAC          ✔️                        ❌        ✔️           ❌           ❌
-TRPO         ✔️                        ✔️         ✔️           ✔️            ✔️ [#f4]_
+TRPO         ✔️                        ❌        ✔️           ✔️            ✔️ [#f4]_
 ============ ======================== ========= =========== ============ ================
 
 .. [#f1] Whether or not the algorithm has be refactored to fit the ``BaseRLModel`` class.

diff --git a/docs/guide/install.rst b/docs/guide/install.rst
@@ -39,7 +39,7 @@ We recommend using `Anaconda <https://conda.io/docs/user-guide/install/windows.h
 
 1. Install `MPI for Windows <https://www.microsoft.com/en-us/download/details.aspx?id=57467>`_ (you need to download and install ``msmpisetup.exe``)
 
-2. Clone Stable-Baselines Github repo and replace this line ``gym[atari,classic_control]>=0.10.9`` by this one ``gym[classic_control]>=0.10.9``
+2. Clone Stable-Baselines Github repo and replace the line ``gym[atari,classic_control]>=0.10.9`` in ``setup.py`` by this one: ``gym[classic_control]>=0.10.9``
 
 3. Install Stable-Baselines from source, inside the folder, run ``pip install -e .``
 

diff --git a/docs/guide/pretrain.rst b/docs/guide/pretrain.rst
@@ -0,0 +1,152 @@
+.. _pretrain:
+
+.. automodule:: stable_baselines.gail
+
+
+Pre-Training (Behavior Cloning)
+===============================
+
+With the ``.pretrain()`` method, you can pre-train RL policies using trajectories from an expert, and therefore accelerate training.
+
+Behavior Cloning (BC) treats the problem of imitation learning, i.e., using expert demonstrations, as a supervised learning problem.
+That is to say, given expert trajectories (observations-actions pairs), the policy network is trained to reproduce the expert behavior:
+for a given observation, the action taken by the policy must be the one taken by the expert.
+
+Expert trajectories can be human demonstrations, trajectories from another controller (e.g. a PID controller)
+or trajectories from a trained RL agent.
+
+
+.. note::
+
+	Only ``Box`` and ``Discrete`` spaces are supported for now for pre-training a model.
+
+
+.. note::
+
+  Images datasets are treated a bit differently as other datasets to avoid memory issues.
+  The images from the expert demonstrations must be located in a folder, not in the expert numpy archive.
+
+
+
+Generate Expert Trajectories
+----------------------------
+
+Here, we are going to train a RL model and then generate expert trajectories
+using this agent.
+
+Note that in practice, generating expert trajectories usually does not require training an RL agent.
+
+The following example is only meant to demonstrate the ``pretrain()`` feature.
+
+However, we recommend users to take a look at the code of the ``generate_expert_traj()`` function (located in ``gail/dataset/`` folder)
+to learn about the data structure of the expert dataset (see below for an overview) and how to record trajectories.
+
+
+.. code-block:: python
+
+  from stable_baselines import DQN
+  from stable_baselines.gail import generate_expert_traj
+
+  model = DQN('MlpPolicy', 'CartPole-v1', verbose=1)
+	# Train a DQN agent for 1e5 timesteps and generate 10 trajectories
+	# data will be saved in a numpy archive named `expert_cartpole.npz`
+  generate_expert_traj(model, 'expert_cartpole', n_timesteps=int(1e5), n_episodes=10)
+
+
+
+Here is an additional example when the expert controller is a callable,
+that is passed to the function instead of a RL model.
+The idea is that this callable can be a PID controller, asking a human player, ...
+
+
+.. code-block:: python
+
+		import gym
+
+		from stable_baselines.gail import generate_expert_traj
+
+		env = gym.make("CartPole-v1")
+		# Here the expert is a random agent
+		# but it can be any python function, e.g. a PID controller
+		def dummy_expert(_obs):
+		    """
+		    Random agent. It samples actions randomly
+		    from the action space of the environment.
+
+		    :param _obs: (np.ndarray) Current observation
+		    :return: (np.ndarray) action taken by the expert
+		    """
+		    return env.action_space.sample()
+		# Data will be saved in a numpy archive named `expert_cartpole.npz`
+		# when using something different than an RL expert,
+		# you must pass the environment object explicitely
+		generate_expert_traj(dummy_expert, 'dummy_expert_cartpole', env, n_episodes=10)
+
+
+
+Pre-Train a Model using Behavior Cloning
+----------------------------------------
+
+Using the ``expert_cartpole.npz`` dataset generated with the previous script.
+
+.. code-block:: python
+
+  from stable_baselines import PPO2
+  from stable_baselines.gail import ExpertDataset
+  # Using only one expert trajectory
+	# you can specify `traj_limitation=-1` for using the whole dataset
+  dataset = ExpertDataset(expert_path='expert_cartpole.npz',
+                          traj_limitation=1, batch_size=128)
+
+  model = PPO2('MlpPolicy', 'CartPole-v1', verbose=1)
+  # Pretrain the PPO2 model
+  model.pretrain(dataset, n_epochs=1000)
+
+	# As an option, you can train the RL agent
+	# model.learn(int(1e5))
+
+  # Test the pre-trained model
+  env = model.get_env()
+  obs = env.reset()
+
+  reward_sum = 0.0
+  for _ in range(1000):
+  	action, _ = model.predict(obs)
+  	obs, reward, done, _ = env.step(action)
+  	reward_sum += reward
+  	env.render()
+  	if done:
+  		print(reward_sum)
+  		reward_sum = 0.0
+  		obs = env.reset()
+
+  env.close()
+
+
+Data Structure of the Expert Dataset
+------------------------------------
+
+The expert dataset is a ``.npz`` archive. The data is saved in python dictionary format with keys: ``actions``, ``episode_returns``, ``rewards``, ``obs``,
+``episode_starts``.
+
+In case of images, ``obs`` contains the relative path to the images.
+
+obs, actions: shape (N * L, ) + S
+
+where N = # episodes, L = episode length
+and S is the environment observation/action space.
+
+S = (1, ) for discrete space
+
+
+.. autoclass:: ExpertDataset
+  :members:
+  :inherited-members:
+
+
+.. autoclass:: DataLoader
+  :members:
+  :inherited-members:
+
+
+.. autofunction:: generate_expert_traj
diff --git a/docs/guide/vec_envs.rst b/docs/guide/vec_envs.rst
@@ -29,9 +29,10 @@ SubprocVecEnv ✔️       ✔️           ✔️        ✔️         ✔️
 
 .. warning::
 
-		When using ``SubprocVecEnv``, users must wrap the code in an ``if __name__ == "__main__":``
-                if using the ``forkserver`` or ``spawn`` start method (the default).  For more information, see Python's 
-                `multiprocessing guidelines <https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods>`_.
+				When using ``SubprocVecEnv``, users must wrap the code in an ``if __name__ == "__main__":`` if using the ``forkserver`` or ``spawn`` start method (default on Windows).
+				On Linux, the default start method is ``fork`` which is not thread safe and can create deadlocks.
+
+				For more information, see Python's `multiprocessing guidelines <https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods>`_.
 
 
 DummyVecEnv

diff --git a/docs/index.rst b/docs/index.rst
@@ -46,6 +46,7 @@ This toolset is a fork of OpenAI Baselines, with a major structural refactoring,
    guide/custom_policy
    guide/tensorboard
    guide/rl_zoo
+   guide/pretrain
 
 
 .. toctree::

diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -5,14 +5,29 @@ Changelog
 
 For download links, please look at `Github release page <https://github.com/hill-a/stable-baselines/releases>`_.
 
-Pre-release 2.4.2a (WIP)
-------------------------
+Pre-Release 2.5.0a0 (WIP)
+--------------------------
 
+**Working GAIL and hotfix for A2C with continuous actions**
+
+- fixed various bugs in GAIL
+- added scripts to generate dataset for gail
+- added tests for GAIL + data for Pendulum-v0
+- removed unused ``utils`` file in DQN folder
+- fixed a bug in A2C where actions were cast to ``int32`` even in the continuous case
+- added addional logging to A2C when Monitor wrapper is used
+- changed logging for PPO2: do not display NaN when reward info is not present
+- change default value of A2C lr schedule
+- removed behavior cloning script
+- added ``pretrain`` method to base class, in order to use behavior cloning on all models
+- fixed ``close()`` method for DummyVecEnv.
 - added support for Dict spaces in DummyVecEnv and SubprocVecEnv. (@AdamGleave)
-- made SubprocVecEnv thread-safe by default; support arbitrary multiprocessing start methods. (@AdamGleave)
+- added support for arbitrary multiprocessing start methods and added a warning about SubprocVecEnv that are not thread-safe by default.  (@AdamGleave)
+- added support for Discrete actions for GAIL
+- fixed deprecation warning for tf: replaces ``tf.to_float()`` by ``tf.cast()``
 - fixed bug in saving and loading ddpg model when using normalization of obs or returns (@tperol)
 - changed DDPG default buffer size from 100 to 50000.
-- fixed a bug in ``ddpg.py`` in ``combined_stats`` for eval. Computed mean on ``eval_episode_rewards`` and ``eval_qs``
+- fixed a bug in ``ddpg.py`` in ``combined_stats`` for eval. Computed mean on ``eval_episode_rewards`` and ``eval_qs`` (@keshaviyengar)
 - fixed a bug in ``setup.py`` that would error on non-GPU systems without TensorFlow installed
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,6 +10,7 @@ @@
     .coverage.*
     __pycache__/
     _build/
+    *.npz
     # Setuptools distribution and build folders.
     /dist/
@@ Expand Down @@