Default hyperparamters for `run_atari.py` (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

DanielTakeshi · 2018-06-08T16:33:43Z

The default hyperparameters of baselines/baselines/deepq/experiments/run_atari.py, which presumably is the script we should be using for DQN-based models, fail to gain any noticeable reward for both Breakout and Pong. I've attached log files later and the steps to reproduce in this issue; the main reason why I'm filing it is that it probably makes sense to have default hyperparameters be working for the scripts that are provided. Or, alternatively, perhaps list the ones that work somewhere? Upon reading run_atari.py it seems like the number of steps is a bit low and the replay buffer should be 10x larger, but I don't think that's going to fix the issue since Pong should be able to learn quickly with this kind of setup.

I know this is probably not the top priority now but in theory this is easy to fix (just run it with the correct hyperparameters), and it would be great for users since running even 10 million steps (the default value right now) can take over 10 hours on a decent personal workstation. If you're in the process of refactoring this code, is there any chance you can take this feedback into account? Thank you!

Steps to reproduce:

Use a machine with Ubuntu 16.04.
I doubt this matters, but I'm also using an NVIDIA Titan X GPU with Pascal.
Install baselines as of commit 36ee5d1
I used a Python 3.5 virtual environment with the following packages, with Tensorflow 1.8.0.
Enter the experiments directory: cd baselines/baselines/deepq/experiments/
Finally, run python run_atari.py with either PongNoFrameskip-v4 or BreakoutNoFrameskip-v4 as the --env argument. I kept all other parameters their default value, so this was prioritized dueling double DQN.

By default the logger in baselines will create log.txt, progress.csv, and monitor.csv files that contain information about training runs. Here are the Breakout and Pong log files:

breakout_log.txt
pong_log.txt

Since GitHub doesn't upload csv files, here are the monitor.csv files for Breakout and then Pong:

https://www.dropbox.com/s/ibl8lvub2igr9kw/breakout_monitor.csv?dl=0
https://www.dropbox.com/s/yuf3din6yjb2swl/pong_monitor.csv?dl=0

Finally, here are the progress.csv files for Breakout and the for Pong:

https://www.dropbox.com/s/79emijmnsdcjm37/breakout_progress.csv?dl=0
https://www.dropbox.com/s/b817wnlyyyriti9/pong_progress.csv?dl=0

The text was updated successfully, but these errors were encountered:

vpj · 2018-06-12T10:39:09Z

I too got similar results for Breakout, with default parameters.

DanielTakeshi · 2018-07-06T13:24:02Z

Thanks @vpj .

Not sure if anyone on the team has been able to check this. Hopefully this will be updated in their code refactor, which I think they are doing behind the scenes.

In the meantime I'm using an older version of the code to get DQN-based algorithms to get the published literature results.

uotter · 2018-07-10T03:04:36Z

Hi @DanielTakeshi, I met the same problems, and can you tell me which version of the code you are using now?

DanielTakeshi · 2018-07-10T03:07:42Z

@uotter It's a bit unfortunate, I am actually using this old commit

4993286

Because right after that is the one which changed a bunch of image processing stuff.

uotter · 2018-07-10T03:29:43Z

Thanks @DanielTakeshi , does this version work well and reach the scores in the paper published?

DanielTakeshi · 2018-07-10T14:03:09Z

Yes, that version works well. I've reproduced publishable scores from all the 20 games I tried.

andytwigg · 2018-07-26T17:10:09Z

@DanielTakeshi Seeing that commit 4993286 makes me nervous about VecFrameStack used in PPO2, which I've also had trouble with. step_wait contains the following:

self.stackedobs = np.roll(self.stackedobs, shift=-1, axis=-1)
...
self.stackedobs[..., -obs.shape[-1]:] = obs

so I'm wondering if this should be updated to match the changes in that commit?

meet-cjli · 2018-07-27T02:33:03Z

@DanielTakeshi Ok, I closed the issue. How did you solve this problem？The old version code does not seem to have this problem. Do you know what is the problem？

DanielTakeshi · 2018-07-27T02:49:20Z

@Sudsakorn123 unfortunately I do not know. I went through the current version of the code (the one I can't get training to work) line by line, and also checked preprocessing of images, but didn't seem to find anything unusual.

DanielTakeshi · 2018-07-27T02:52:22Z

Oh, I just saw an older issue

#176

Where the users were having some similar issues. Unfortunately it seems like nothing got resolved there.

skpenn · 2018-07-27T08:48:25Z

Met the same issue and finally solved by this pull request Fix dtype for wrapper observation spaces.
wondering why this pull request is not merged into main branch

DanielTakeshi · 2018-08-15T16:17:43Z

@skpenn Good news, looks like the pull request you linked is 'effectively merged'!

Michalos88 · 2018-08-23T20:17:36Z

I believe, that the issue has not been solved yet. I tried training the deepq model on breakout and pong with the default hyper parameters and even after 40M time steps the average episode return wouldn't be greater than 0.4. I tired tuning the hyper parameters, but it didn't really help.
I am using similar setup to @DanielTakeshi:

Ubuntu 16.04 LTS
NVIDIA Titan X GPU with Pascal
Python 3.5.2
baselines as of commit 3900f2a4473ce6b26a8129372ca8d5e02c766c9cl
In main baselines main folder running: python -m baselines.run --alg=deepq --env=BreakoutNoFrameskip-v4 --num_timesteps=1e7

requirements.txt
monitor.txt

DanielTakeshi · 2018-08-23T21:21:39Z

@Michalos88 really? That's unfortunate.

For hyperparameters I strongly suggest sticking with defaults here (or with what the DeepMind paper did) since it's too expensive for us to keep tweaking with those. The repository here will eventually, I think, get results standardized for Atari and DQN based models.

I'll run a few trials on my end as well (maybe next week) to see if default DQN parameters can make progress.

Michalos88 · 2018-08-24T13:44:46Z

Thanks, @DanielTakeshi.

Yeah, let us know next week!

DanielTakeshi · 2018-09-26T16:27:03Z

@Michalos88 @skpenn @vpj @uotter Unfortunately it looks like the refactored code still runs into the same issue. The refactoring is helpful to make the interface uniform but I am guessing there are still some issues with the core DQN algorithm here. I'll split this into three parts.

First Attempt

Using commit 4402b8e of baselines and the same machine as described in my earlier message here, I ran this command for PDD-DQN:

python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4

because that is what they tell us to run in the README:
https://github.com/openai/baselines/tree/master/baselines/deepq

Unfortunately I get -20.7. The logs:

log.txt
https://www.dropbox.com/s/8klj70brmhfp4i5/monitor.csv?dl=0
https://www.dropbox.com/s/2fn05ze4op2z0mn/progress.csv?dl=0

Granted these are with the hyperparameters:

Logging to /tmp/openai-2018-09-25-10-59-49-863956
env_type: atari
Training deepq on atari:PongNoFrameskip-v4 with arguments 
{'target_network_update_freq': 1000, 'gamma': 0.99, 'lr': 0.0001, 'dueling': True, 'prioritized_replay_alpha': 0.6, 'checkpoint_freq': 10000, 'learning_starts': 10000, 'train_freq': 4, 'checkpoint_path': None, 'exploration_final_eps': 0.01, 'prioritized_replay': True, 'network': 'conv_only', 'buffer_size': 10000, 'exploration_fraction': 0.1}

and with just 1M time steps by default. I think one needs around 10M for instance and then to make the buffer size larger.

Second Attempt

I then tried to use similar hyperparameters that I used for an older baselines commit (roughly 1 year ago) in which PDD-DQN easily gets at least +20 on Pong.

This is what I next ran with different hyperparameters:

(py3-baselines-sep2018) daniel@takeshi:~/baselines-sandbox$ python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=5e7 --buffer_size=50000 --lr=5e-4
Logging to /tmp/openai-2018-09-25-13-50-13-182778
Logging to /tmp/openai-2018-09-25-13-50-13-223205
env_type: atari
Training deepq on atari:PongNoFrameskip-v4 with arguments 
{'exploration_fraction': 0.1, 'learning_starts': 10000, 'checkpoint_path': None, 'lr': 0.0005, 'target_network_update_freq': 1000, 'dueling': True, 'exploration_final_eps': 0.01, 'train_freq': 4, 'prioritized_replay': True, 'buffer_size': 50000, 'prioritized_replay_alpha': 0.6, 'checkpoint_freq': 10000, 'gamma': 0.99, 'network': 'conv_only'}

The 5e7 time steps and 50k buffer size puts it more in line with what I think the older baselines code used (and which the Nature paper may have used).

The following morning (after running for about 12 hours) I noticed that after about 15M steps, the scores are still stuck at -21. PDD-DQN still doesn't seem to learn anything. I killed the script to avoid having to run 35M more steps. Here are the logs I have:

log.txt
https://www.dropbox.com/s/qi1f9de0lhnhw7a/monitor.csv?dl=0
https://www.dropbox.com/s/1odyl2reda7ncuy/progress.csv?dl=0

Note that the learning seems to collapse. Early we get plenty of -20s and -19s, which I'd expect, and then later it's almost always -21.

Observing the Benchmarks

Note that the benchmarks for Atari they use:

http://htmlpreview.github.io/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm

show that DQN gets a score of minus seven on Pong, which is really bad but better than what I am getting here. (It also shows Breakout with a score of just one...) I am not sure what command line arguments they are using for this, but maybe it's hidden somewhere in the code which generates the benchmarks?

@pzhokhov Since this is a fairly critical issue, is there any chance the README can be adjusted with a message like:

The DQN-based algorithms currently do not get high scores on the Atari games [see GitHub issues XX, YY, etc]. We are currently investigating this and recommend users to instead use [insert working algorithm here, e.g., PPO2].

I think this might help save some time for those who are hoping to use the DQN-based algorithms. In the meantime I can help try and figure out what the issue is, and I will also keep using my older version of baselines (from a year ago) which has the working DQN algorithms.

pzhokhov · 2018-09-27T19:52:01Z

added a note to README

DanielTakeshi · 2018-09-28T22:36:01Z

Thanks @pzhokhov

In case it helps you can see in an earlier message the commit that I used which has DQN working well. ( 4993286 ) More precisely, for the commit I listed earlier, I literally tried training 24 Atari games using PDD-DQN and got good scores for all of them with (I think) 5e7 time steps. The commit after this seemed to be when things changed, and that involved adjusting some processing of the game frames, so that could be one area to check. I checked the source code but the core DQN seemed to be implemented correctly (at least, as of June 2018 but I don't think it was changed since then), and I couldn't find any obvious errors (e.g., not scaling the frame pixels, etc.).

Do you have any other suspicions on what could be happening? I have some spare cycles that I could spend for testing. For efficiency reasons, I just want to make sure I don't duplicate my tests with what others are doing.

I should also add, I ran some A2C tests on Pong as of today's commit, and got good scores (20+) in less than 2e7 time steps for num envs 2, 4, 8, and 16. So that removes one source of uncertainty.

andytwigg · 2018-09-28T22:57:20Z

@DanielTakeshi just to be sure, does PPO2 on current master get good scores?

pzhokhov · 2018-09-28T23:21:35Z

Thanks @DanielTakeshi ! My strongest suspect would be hyperparameters; but your investigation shows that's not the case... Another possible area of failure is wrong datatype casts - if we accidentally convert to int after dividing by 255 somewhere. I'll look into the diff between commits shortly (today / tomorrow), if nothing jumps out, then we'll have to go through exceedingly painstaking exercise of starting with the same weights and ensuring updates are the same. It is really not fun, so hopefully it does not come to that :)

pzhokhov · 2018-10-02T21:00:06Z

Soo nothing in the commit changes jumped at me as a obvious source of error, however, I narrowed down the commits between which the breaking changes have happened.
So... 2b0283b is still working, and 24fe3d6 is not working anymore. Bad news is that all of those are mine, so you know whom to blame :( Good news is that hopefully I'll find the bug soon

pzhokhov · 2018-10-03T00:28:29Z

I think it is scale=True option passed to wrap_deepmind, which leads to dividing inputs by (255*255) instead of 255 ... running tests now

pzhokhov · 2018-10-03T16:46:31Z

Confirmed. Here's a fixing PR: #632; I'll update benchmark results shortly

DanielTakeshi · 2018-10-03T17:07:27Z

Whoa, this seems like great news! @pzhokhov Thanks for this, I am eager to see the benchmarks and to try out myself.

@andytwigg I haven't confirmed PPO2 have you run it yourself? If PPO2 is implemented in a similar manner as A2C then it takes a few hours on a decent workstation, and A2C is getting reasonable results for me. (I get scores similar to those in the appendix in Self-Imitation Learning, ICML 2018).

Michalos88 · 2018-10-03T20:34:31Z

@andytwigg I can confirm that PPO2 produces expected results. @DanielTakeshi @pzhokhov Thanks for handling this issue! :)

DanielTakeshi · 2018-10-03T22:37:49Z

And by the way, if things are looking good, this part can be removed:

NOTE: The DQN-based algorithms currently do not get high scores on the Atari games (see GitHub issue 431) We are currently investigating this and recommend users to instead use PPO2.

JulianoLagana · 2019-02-04T10:00:24Z

Was this solved? I'm still not able to obtain good scores in Pong using P-DDDQN with default hyperparameters.

DanielTakeshi · 2019-02-04T20:35:40Z

@JulianoLagana it was solved

JulianoLagana · 2019-02-05T08:09:14Z

Thanks for the quick reply, @DanielTakeshi. Three days ago I ran the file train_pong.py (multiple times, with different seeds), and only one out of 5 runs actually managed to get a score (not average score) higher than zero. In my free time I'll investigate further and try to post here a minimal example.

cbjtu · 2019-04-07T05:17:06Z

Unfortunately, with 6d1c6c7 still can't reproduce the result of benchmark on Breakout, with the command:
python -m baselines.run --alg=deepq --env=BreakoutNoFrameskip-v4 --num_timesteps=2e6
3 seed got a average score about 15, which should be reached about 200 points in benchmark smoothly?

cbjtu · 2019-04-07T05:19:03Z

I'm trying to deploy the version same to the benchmark.

cbjtu · 2019-04-08T07:19:25Z

Enduro-v0 with 6d1c6c7 is good.

cbjtu · 2019-04-08T16:58:10Z

Enduro-v0 with the latest master version is good, too. The problem is Breakout env?

DongChen06 · 2019-07-16T14:37:49Z

Enduro-v0 with the latest master version is good, too. The problem is Breakout env?

Have you resolved the problem with Breakout? I can not reproduce the result with only ~16 score.

DanielTakeshi mentioned this issue Jul 26, 2018

DQN can't work well in Pong and Breakout #478

Closed

kclary referenced this issue in toybox-rs/Toybox Jan 13, 2019

attempt to fix DQN bug; copied commit 4121d9c from baselines repo

825b40f

DanielTakeshi mentioned this issue Mar 25, 2019

DDPG doesn't work #855

Open

DanielTakeshi mentioned this issue Jun 20, 2019

DDPG implementation fails to learn well on at least five MuJoCo-v2 envs for all three noise types. I report steps to reproduce and learning curve plots [and show that PPO2 seems to work fine]. #938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default hyperparamters for `run_atari.py` (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

Default hyperparamters for `run_atari.py` (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

DanielTakeshi commented Jun 8, 2018 •

edited

Loading

vpj commented Jun 12, 2018

DanielTakeshi commented Jul 6, 2018

uotter commented Jul 10, 2018

DanielTakeshi commented Jul 10, 2018

uotter commented Jul 10, 2018

DanielTakeshi commented Jul 10, 2018

andytwigg commented Jul 26, 2018

meet-cjli commented Jul 27, 2018

DanielTakeshi commented Jul 27, 2018

DanielTakeshi commented Jul 27, 2018

skpenn commented Jul 27, 2018

DanielTakeshi commented Aug 15, 2018

Michalos88 commented Aug 23, 2018

DanielTakeshi commented Aug 23, 2018

Michalos88 commented Aug 24, 2018

DanielTakeshi commented Sep 26, 2018

pzhokhov commented Sep 27, 2018

DanielTakeshi commented Sep 28, 2018 •

edited

Loading

andytwigg commented Sep 28, 2018

pzhokhov commented Sep 28, 2018

pzhokhov commented Oct 2, 2018

pzhokhov commented Oct 3, 2018

pzhokhov commented Oct 3, 2018 •

edited

Loading

DanielTakeshi commented Oct 3, 2018 •

edited

Loading

Michalos88 commented Oct 3, 2018 •

edited

Loading

DanielTakeshi commented Oct 3, 2018

JulianoLagana commented Feb 4, 2019

DanielTakeshi commented Feb 4, 2019

JulianoLagana commented Feb 5, 2019

cbjtu commented Apr 7, 2019

cbjtu commented Apr 7, 2019

cbjtu commented Apr 8, 2019

cbjtu commented Apr 8, 2019

DongChen06 commented Jul 16, 2019

Default hyperparamters for run_atari.py (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

Default hyperparamters for run_atari.py (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

Comments

DanielTakeshi commented Jun 8, 2018 • edited Loading

vpj commented Jun 12, 2018

DanielTakeshi commented Jul 6, 2018

uotter commented Jul 10, 2018

DanielTakeshi commented Jul 10, 2018

uotter commented Jul 10, 2018

DanielTakeshi commented Jul 10, 2018

andytwigg commented Jul 26, 2018

meet-cjli commented Jul 27, 2018

DanielTakeshi commented Jul 27, 2018

DanielTakeshi commented Jul 27, 2018

skpenn commented Jul 27, 2018

DanielTakeshi commented Aug 15, 2018

Michalos88 commented Aug 23, 2018

DanielTakeshi commented Aug 23, 2018

Michalos88 commented Aug 24, 2018

DanielTakeshi commented Sep 26, 2018

First Attempt

Second Attempt

Observing the Benchmarks

pzhokhov commented Sep 27, 2018

DanielTakeshi commented Sep 28, 2018 • edited Loading

andytwigg commented Sep 28, 2018

pzhokhov commented Sep 28, 2018

pzhokhov commented Oct 2, 2018

pzhokhov commented Oct 3, 2018

pzhokhov commented Oct 3, 2018 • edited Loading

DanielTakeshi commented Oct 3, 2018 • edited Loading

Michalos88 commented Oct 3, 2018 • edited Loading

DanielTakeshi commented Oct 3, 2018

JulianoLagana commented Feb 4, 2019

DanielTakeshi commented Feb 4, 2019

JulianoLagana commented Feb 5, 2019

cbjtu commented Apr 7, 2019

cbjtu commented Apr 7, 2019

cbjtu commented Apr 8, 2019

cbjtu commented Apr 8, 2019

DongChen06 commented Jul 16, 2019

Default hyperparamters for `run_atari.py` (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

Default hyperparamters for `run_atari.py` (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

DanielTakeshi commented Jun 8, 2018 •

edited

Loading

DanielTakeshi commented Sep 28, 2018 •

edited

Loading

pzhokhov commented Oct 3, 2018 •

edited

Loading

DanielTakeshi commented Oct 3, 2018 •

edited

Loading

Michalos88 commented Oct 3, 2018 •

edited

Loading