Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Provide atari results across all algorithms (as applicable) #2663

Closed
3 of 7 tasks
ericl opened this issue Aug 15, 2018 · 20 comments
Closed
3 of 7 tasks

[rllib] Provide atari results across all algorithms (as applicable) #2663

ericl opened this issue Aug 15, 2018 · 20 comments

Comments

@ericl
Copy link
Contributor

ericl commented Aug 15, 2018

Describe the problem

We should publish results for at least a few of the standard Atari games on all applicable algorithms, and fix any discrepancies, e.g. #2654

Results uploaded to this repo: https://github.com/ray-project/rl-experiments

  • IMPALA
  • IMPALA-LSTM
  • A3C
  • A2C
  • DQN
  • APEX
  • PPO

Envs to run: PongNoFrameskip-v4, BreakoutNoFrameskip-v4, BeamRiderNoFrameskip-v4, QbertNoFrameskip-v4, SpaceInvadersNoFrameskip-v4

(Chosen such that all but pong can run concurrently on a g3.16xl machine).

Some references:
https://github.com/btaba/yarlp
openai/baselines#176

@richardliaw
Copy link
Contributor

Also relevant reference: https://github.com/hill-a/stable-baselines

@ericl
Copy link
Contributor Author

ericl commented Aug 16, 2018

Just ran a "30% full speed" IMPALA across a couple environments. The results are pretty reasonable at 40M frames, with Qbert / Space invaders about inline with results from the A3C paper, and Breakout / Beamrider a bit below. Note that the episode max reward for Breakout and Beamrider are pretty good, but the mean is not quite up there.

I'm guessing we can improve on this with some tuning.

# Runs on a single g3.16xl node
atari-impala:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4 
    run: IMPALA
    config:
        sample_batch_size: 250  # 50 * num_envs_per_worker
        train_batch_size: 500
        num_workers: 12
        num_envs_per_worker: 5

atari-impala

@robertnishihara
Copy link
Collaborator

In what format does it make sense to publish the results? E.g., a collection of full learning curves (e.g., as CSV)? Or actual visualizations like you have above? Or something else?

@ericl
Copy link
Contributor Author

ericl commented Aug 16, 2018

If we have a public ray perf dashboard, that would be a good place to put these.

Otherwise, I think posting some summary visualizations on github or the docs would do (for example, just having the tuned example yamls with pointers to this issue). The full learning curve data probably isn't that interesting, but we could also upload that to S3 pretty easily.

@luochao1024
Copy link

Do you have any result about A3C or A3C-LSTM?

@ericl
Copy link
Contributor Author

ericl commented Aug 19, 2018 via email

@luochao1024
Copy link

A3C is very sensitive with learning rate as the staleness of gradients increases with learning rate

@ericl
Copy link
Contributor Author

ericl commented Aug 19, 2018

For reference, here is the run and params (with the default lr=0.0001, and grad_clip=40.0). Note that the gradient magnitude scales with the lr * batch size = 20.

This is also on this branch: #2679

# Runs on a single m4.16xl node
atari-a3c:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4 
    run: A3C
    config:
        num_workers: 11
        sample_batch_size: 20
        optimizer:
            grads_per_step: 1000

a3c

@ericl
Copy link
Contributor Author

ericl commented Aug 19, 2018

That PR also adds A2C. Since A2C is deterministic, it should be easy to copy hyperparameters from another A2C implementation to compare results (I'm doing some runs right now, but it might take a while).

@luochao1024
Copy link

you are using 11 workers for experiment. I would recommend 16 workers.

@ericl
Copy link
Contributor Author

ericl commented Aug 20, 2018

One discovery: we're handling EpisodicLifeEnv resets incorrectly. For example, for BeamRider you get three lives, which we are treating as three episodes, but you're supposed to count as one.

This kind of explains why BeamRider's starting score is about 3x too low.

@ericl
Copy link
Contributor Author

ericl commented Aug 21, 2018

@luochao1024 this PR reproduces standard Atari results for IMPALA and A2C: #2700

I'm still having trouble finding the right hyperparams for A3C (vf_explained_var tends to dive to <0 with A3C whereas it is always close to 1 with A2C / IMPALA), but since it works in A2C it's probably just a matter of tweaking the lr / batch size / grad clipping.

@luochao1024
Copy link

Do you have some right hyperparams that work for a3c now?

@ericl
Copy link
Contributor Author

ericl commented Aug 25, 2018 via email

@luochao1024
Copy link

@ericl Can you give it a try for BreakoutNoFrameskip-v4? I try a grid search for the lr, but I still get some really bad results. Here is the configs I use:

atari-a3c:
    env: BreakoutNoFrameskip-v4
    run: A3C
    config:
        num_workers: 8
        sample_batch_size: 20
        use_pytorch: false
        vf_loss_coeff: 0.5
        entropy_coeff: -0.01
        gamma: 0.99
        grad_clip: 40.0
        lambda: 1.0
        lr:
            grid_search:
                - 0.000005
                - 0.00001
                - 0.00005
                - 0.0001
        observation_filter: NoFilter
        preprocessor_pref: rllib
        num_envs_per_workers: 5
        optimizer:
            grads_per_step: 1000

@ericl
Copy link
Contributor Author

ericl commented Aug 29, 2018 via email

@luochao1024
Copy link

Now I am running A3C with the following config:

atari-a3c:
    env:
        BreakoutNoFrameskip-v4
    run: A3C
    config:
        num_workers: 5
        sample_batch_size: 20
        preprocessor_pref: deepmind
        lr:
           grid_search:
               - 0.000005
               - 0.00001
               - 0.00005
               - 0.0001
               - 0.0005
               - 0.001
        num_envs_per_worker: 5
        optimizer:
            grads_per_step: 1000

Do you think the configs are reasonable now? I am also running BeamRiderNoFrameskip-v4, QbertNoFrameskip-v4, SpaceInvadersNoFrameskip-v4 at then same time. I will report it when I finish the training.

@ericl
Copy link
Contributor Author

ericl commented Aug 29, 2018 via email

@luochao1024
Copy link

luochao1024 commented Aug 29, 2018

The result seems normal now with num_workers=5

BreakoutNoFrameskip-v4:
image

SpaceInvadersNoFrameskip-v4:
image

QbertNoFrameskip-v4:
image

I will set the num_envs_per_worker=1 later

@ericl
Copy link
Contributor Author

ericl commented Sep 15, 2018

Closing this in favor of individual tickets. Main TODOs are the DQN family.

@ericl ericl closed this as completed Sep 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants