[Classic Control Promble] Training baselines branch TF2-PPO2 to solve Pendulum-v0 extremely slow and unstable #1103

MrForExample · 2020-05-05T09:42:55Z

Command:
python -m baselines.run --alg=ppo2 --env=Pendulum-v0 --num_timesteps=1e6 --reward_scale=0.001 --save_path=./models/Pendulm_ppo2

The hyperparameter use in baselines/ppo2/defaults.py:

def classic_control():
    return dict(
        nsteps=128,
        nminibatches=4,
        lam=0.95,
        gamma=0.99,
        noptepochs=4,
        log_interval=1,
        ent_coef=.01,
        lr=lambda f: f * 2e-4,
        cliprange=0.1,
        num_layers=2, 
        num_hidden=64, 
        activation='relu'
    )

Example result:

Beginning of training
-------------------------------------------
| eplenmean               | 200           |
| eprewmean               | -1.43e+03     |
| fps                     | 836           |
| loss/approxkl           | 1.0275808e-13 |
| loss/clipfrac           | 0.0           |
| loss/policy_entropy     | 1.4534587     |
| loss/policy_loss        | 1.0652002e-07 |
| loss/value_loss         | 0.006119203   |
| misc/explained_variance | -1.88         |
| misc/nupdates           | 11            |
| misc/serial_timesteps   | 1408          |
| misc/time_elapsed       | 2.96          |
| misc/total_timesteps    | 1408          |
-------------------------------------------
After 1M step:
-------------------------------------------
| eplenmean               | 200           |
| eprewmean               | -1.43e+03     |
| fps                     | 799           |
| loss/approxkl           | 0.0           |
| loss/clipfrac           | 0.0           |
| loss/policy_entropy     | 13.628587     |
| loss/policy_loss        | -5.378388e-08 |
| loss/value_loss         | 0.0008752448  |
| misc/explained_variance | 0.266         |
| misc/nupdates           | 7812          |
| misc/serial_timesteps   | 999936        |
| misc/time_elapsed       | 1.24e+03      |
| misc/total_timesteps    | 999936        |
-------------------------------------------

The source code of baselines and PPO is download use git and left untouched, I spend quite some time adjust the hyperparameter and it doesn't seem have much effect on result if there any,
Does any one have any idea what's could go wrong?

The text was updated successfully, but these errors were encountered:

DanielTakeshi · 2020-05-06T15:22:43Z

I would recommend using stable-baselines. This repository is not currently being maintained.

FWIW I have gotten PPO2 to work on MuJoCo (well, last time I tried many months ago).
#938

MrForExample · 2020-05-10T05:46:44Z

@DanielTakeshi Thanks for your information, I want find the PPO implement use TF2, and stable-baselines is use TF1.
But never mind, I spent 2 days and write my own PPO implement use TF2 and now seems work fine!

MrForExample changed the title ~~[Classic Control Promble] Training baselines TF2-PPO2 to solve Pendulum-v0 extremely slow and unstable~~ [Classic Control Promble] Training baselines branch TF2-PPO2 to solve Pendulum-v0 extremely slow and unstable May 5, 2020

MrForExample closed this as completed May 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Classic Control Promble] Training baselines branch TF2-PPO2 to solve Pendulum-v0 extremely slow and unstable #1103

[Classic Control Promble] Training baselines branch TF2-PPO2 to solve Pendulum-v0 extremely slow and unstable #1103

MrForExample commented May 5, 2020 •

edited

Loading

DanielTakeshi commented May 6, 2020

MrForExample commented May 10, 2020

[Classic Control Promble] Training baselines branch TF2-PPO2 to solve Pendulum-v0 extremely slow and unstable #1103

[Classic Control Promble] Training baselines branch TF2-PPO2 to solve Pendulum-v0 extremely slow and unstable #1103

Comments

MrForExample commented May 5, 2020 • edited Loading

DanielTakeshi commented May 6, 2020

MrForExample commented May 10, 2020

MrForExample commented May 5, 2020 •

edited

Loading