Skip to content

Commit

Permalink
Merge pull request #69 from araffin/feat/callbacks
Browse files Browse the repository at this point in the history
Add callback support
  • Loading branch information
araffin authored Mar 22, 2020
2 parents 7cb227c + 89c54a2 commit 79a6fd1
Show file tree
Hide file tree
Showing 16 changed files with 248 additions and 187 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ cluster_sbatch.sh
cluster_sbatch_mpi.sh
trained_agents/
.git/
.pytype/
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ pytest:

# Type check
type:
pytype
pytype .

docker: docker-cpu docker-gpu

Expand Down
72 changes: 44 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ If you have trained an agent yourself, you need to do:
python enjoy.py --algo algo_name --env env_id -f logs/ --exp-id 0
```

To load the best model (when using evaluation environment):
```
python enjoy.py --algo algo_name --env env_id -f logs/ --exp-id 1 --load-best
```

## Train an Agent

The hyperparameters for each environment are defined in `hyperparameters/algo_name.yml`.
Expand All @@ -48,6 +53,17 @@ For example (with tensorboard support):
python train.py --algo ppo2 --env CartPole-v1 --tensorboard-log /tmp/stable-baselines/
```

Evaluate the agent every 10000 steps using 10 episodes for evaluation:
```
python train.py --algo sac --env HalfCheetahBulletEnv-v0 --eval-freq 10000 --eval-episodes 10
```

Save a checkpoint of the agent every 100000 steps:
```
python train.py --algo td3 --env HalfCheetahBulletEnv-v0 --save-freq 100000
```


Continue training (here, load pretrained agent for Breakout and continue training for 5000 steps):
```
python train.py --algo a2c --env BreakoutNoFrameskip-v4 -i trained_agents/a2c/BreakoutNoFrameskip-v4.pkl -n 5000
Expand All @@ -74,6 +90,34 @@ python train.py --algo ppo2 --env MountainCar-v0 -n 50000 -optimize --n-trials 1
```


## Env Wrappers

You can specify in the hyperparameter config one or more wrapper to use around the environment:

for one wrapper:
```
env_wrapper: gym_minigrid.wrappers.FlatObsWrapper
```

for multiple, specify a list:

```
env_wrapper:
- utils.wrappers.DoneOnSuccessWrapper:
reward_offset: 1.0
- utils.wrappers.TimeFeatureWrapper
```

Note that you can easily specify parameters too.

## Overwrite hyperparameters

You can easily overwrite hyperparameters in the command line, using ``--hyperparams``:

```
python train.py --algo a2c --env MountainCarContinuous-v0 --hyperparams learning_rate:0.001 policy_kwargs:"dict(net_arch=[64, 64])"
```

## Record a Video of a Trained Agent

Record 1000 steps:
Expand Down Expand Up @@ -204,34 +248,6 @@ MiniGrid-DoorKey-5x5-v0:
env_wrapper: gym_minigrid.wrappers.FlatObsWrapper
```

## Env Wrappers

You can specify in the hyperparameter config one or more wrapper to use around the environment:

for one wrapper:
```
env_wrapper: gym_minigrid.wrappers.FlatObsWrapper
```

for multiple, specify a list:

```
env_wrapper:
- utils.wrappers.DoneOnSuccessWrapper:
reward_offset: 1.0
- utils.wrappers.TimeFeatureWrapper
```

Note that you can easily specify parameters too.

## Overwrite hyperparameters

You can easily overwrite hyperparameters in the command line, using ``--hyperparams``:

```
python train.py --algo a2c --env MountainCarContinuous-v0 --hyperparams learning_rate:0.001 policy_kwargs:"dict(net_arch=[64, 64])"
```


## Colab Notebook: Try it Online!

Expand Down
25 changes: 13 additions & 12 deletions enjoy.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,8 @@
warnings.filterwarnings("ignore", category=UserWarning, module='gym')

import gym
try:
import pybullet_envs
except ImportError:
pybullet_envs = None
import utils.import_envs # pytype: disable=import-error
import numpy as np
try:
import highway_env
except ImportError:
highway_env = None
import stable_baselines
from stable_baselines.common import set_global_seeds
from stable_baselines.common.vec_env import VecNormalize, VecFrameStack, VecEnv
Expand Down Expand Up @@ -50,6 +43,8 @@ def main():
help='Use deterministic actions')
parser.add_argument('--stochastic', action='store_true', default=False,
help='Use stochastic actions (for DDPG/DQN/SAC)')
parser.add_argument('--load-best', action='store_true', default=False,
help='Load best model instead of last model if available')
parser.add_argument('--norm-reward', action='store_true', default=False,
help='Normalize reward if applicable (trained with VecNormalize)')
parser.add_argument('--seed', help='Random generator seed', type=int, default=0)
Expand Down Expand Up @@ -78,7 +73,7 @@ def main():

assert os.path.isdir(log_path), "The {} folder was not found".format(log_path)

model_path = find_saved_model(algo, log_path, env_id)
model_path = find_saved_model(algo, log_path, env_id, load_best=args.load_best)

if algo in ['dqn', 'ddpg', 'sac', 'td3']:
args.n_envs = 1
Expand Down Expand Up @@ -108,12 +103,13 @@ def main():
deterministic = args.deterministic or algo in ['dqn', 'ddpg', 'sac', 'her', 'td3'] and not args.stochastic

episode_reward = 0.0
episode_rewards = []
episode_rewards, episode_lengths = [], []
ep_len = 0
# For HER, monitor success rate
successes = []
state = None
for _ in range(args.n_timesteps):
action, _ = model.predict(obs, deterministic=deterministic)
action, state = model.predict(obs, state=state, deterministic=deterministic)
# Random Agent
# action = [env.action_space.sample()]
# Clip Action to avoid out of bound errors
Expand All @@ -140,7 +136,9 @@ def main():
# is a normalized reward when `--norm_reward` flag is passed
print("Episode Reward: {:.2f}".format(episode_reward))
print("Episode Length", ep_len)
state = None
episode_rewards.append(episode_reward)
episode_lengths.append(ep_len)
episode_reward = 0.0
ep_len = 0

Expand All @@ -159,7 +157,10 @@ def main():
print("Success rate: {:.2f}%".format(100 * np.mean(successes)))

if args.verbose > 0 and len(episode_rewards) > 0:
print("Mean reward: {:.2f}".format(np.mean(episode_rewards)))
print("Mean reward: {:.2f} +/- {:.2f}".format(np.mean(episode_rewards), np.std(episode_rewards)))

if args.verbose > 0 and len(episode_lengths) > 0:
print("Mean episode length: {:.2f} +/- {:.2f}".format(np.mean(episode_lengths), np.std(episode_lengths)))

# Workaround for https://github.com/openai/gym/issues/893
if not args.no_render:
Expand Down
2 changes: 1 addition & 1 deletion hyperparams/a2c.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ MountainCarContinuous-v0:
policy: 'MlpPolicy'
ent_coef: 0.0

BipedalWalker-v2:
BipedalWalker-v3:
normalize: true
n_envs: 16
n_timesteps: !!float 5e6
Expand Down
2 changes: 1 addition & 1 deletion hyperparams/acktr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ BipedalWalkerHardcore-v3:
vf_coef: 0.51

# Tuned
BipedalWalker-v2:
BipedalWalker-v3:
normalize: true
n_envs: 8
n_timesteps: !!float 5e6
Expand Down
2 changes: 1 addition & 1 deletion hyperparams/ddpg.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Pendulum-v0:
memory_limit: 50000

# Tuned
BipedalWalker-v2:
BipedalWalker-v3:
n_timesteps: !!float 1e6
policy: 'MlpPolicy'
noise_type: 'adaptive-param'
Expand Down
2 changes: 1 addition & 1 deletion hyperparams/ppo2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Acrobot-v1:
noptepochs: 4
ent_coef: 0.0

BipedalWalker-v2:
BipedalWalker-v3:
normalize: true
n_envs: 16
n_timesteps: !!float 5e6
Expand Down
2 changes: 1 addition & 1 deletion hyperparams/sac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ LunarLanderContinuous-v2:
batch_size: 256
learning_starts: 1000

BipedalWalker-v2:
BipedalWalker-v3:
n_timesteps: !!float 1e6
policy: 'CustomSACPolicy'
learning_rate: lin_3e-4
Expand Down
2 changes: 1 addition & 1 deletion hyperparams/td3.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ HalfCheetahBulletEnv-v0:
gradient_steps: 1000
policy_kwargs: "dict(layers=[400, 300])"

BipedalWalker-v2:
BipedalWalker-v3:
n_timesteps: !!float 2e6
policy: 'MlpPolicy'
gamma: 0.99
Expand Down
2 changes: 1 addition & 1 deletion hyperparams/trpo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ HopperBulletEnv-v0:
vf_stepsize: !!float 1e-3

# Tuned
BipedalWalker-v2:
BipedalWalker-v3:
env_wrapper: utils.wrappers.TimeFeatureWrapper
n_timesteps: !!float 5e6
policy: 'MlpPolicy'
Expand Down
Loading

0 comments on commit 79a6fd1

Please sign in to comment.