Cannot reproduce Breakout benchmark using Double DQN #176

gbg141 · 2017-10-20T12:54:03Z

I haven't been able to reproduce the results of the Breakout benchmark with Double DQN when using similar hyperparameter values than the ones presented in the original paper. After more than 20M observed frames (~100.000 episodes), the mean 100 episode reward still remains around 10, having achieved a maximum value of 12.

I present in the following list the neural network configuration as well as the hyperparameter values that I'm using in case I'm missing or getting something important wrong:

env = gym.make("BreakoutNoFrameskip-v4")
env = ScaledFloatFrame(wrap_dqn(env))
model = deepq.models.cnn_to_mlp(
        convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
        hiddens=[512],
        dueling=False
)
act = deepq.learn(
        env,
        q_func=model,
        lr=25e-5,
        max_timesteps=200000000,
        buffer_size=100000, #cannot store 1M frames as the paper suggests
        exploration_fraction=1000000/float(200000000), #so as to finish after !M steps
        exploration_final_eps=0.1,
        train_freq=4,
        batch_size=32,
        learning_starts=50000,
        target_network_update_freq=10000,
        gamma=0.99,
        prioritized_replay=False
)

Does anyone have some idea of what is going wrong? The analogous results exposed in a jupyter notebook in openai/baselines-results indicate that I should be able to get much better scores.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

asimmunawar · 2017-11-14T01:58:30Z

Same here, I also get similar average rewards. I also ran the deepq/experiments/run_atari.py example without any modifications and still it just converge to 11 in around 5 million steps. Any help or suggestion would be appreciated.

candytalking · 2017-11-24T08:38:01Z

I observe the same problem when training using the "learn" function within the "simple.py" file, which is the case to use "run_atari.py". When training using "deepq/experiments/atari/train.py" instead it works fine.

BNSneha · 2017-11-25T12:47:56Z

File "train.py", line 244, in
start_time, start_steps = time.time(), info['steps']
KeyError: 'steps'
How to get rid of this error when trying to run atari/train.py?

btaba · 2018-01-15T23:39:18Z

I have been running train.py in baselines/deepq/experiments/atari/train.py with the following command python train.py --env BeamRider --save-dir 'savedir-dueling' --dueling --prioritized, and I also cannot reproduce results for BeamRider compared to the jupyter notebook (although it seems that the train.py script was used to create those benchmarks). I had to make minor corrections to run the script due to issues referenced by the comment directly above and from this ticket. I'm effectively running this lightly modified version.

btaba · 2018-01-25T23:06:23Z

@gbg141 Part of your issue might be that the rewards from the environment wrapped with wrap_deepmind are by default clipped to -1, 1 using np.sign, so the reported rewards in deepq/experiments/atari/train.py are clipped. If you turn reward clipping off and explicitly save the clipped reward in the replay buffer for training, that might work for you.

kdu4108 · 2018-02-03T16:23:54Z

@btaba Did you try this/did it work for you?

btaba · 2018-02-03T23:53:08Z

@kdu4108 that actually didn't work for me, I also tried a git reset --hard 1f3c3e33e7891cb3 and wasn't able to reproduce the results in this notebook for Breakout.

Edit: trained on commit 1f3c3e33e7891cb3 using python train.py --env Breakout --target-update-freq 10000 --learning-freq 4 --prioritized --dueling for 50M frames and I am only able to reach a reward of ~250 as opposed to ~400.

kdu4108 · 2018-02-09T19:58:48Z

@btaba Okay thanks for the response. I tried training the default Pong using that version and successfully reproduced their results. Out of curiosity, have you tried to reproduce results on any other environments using that commit? Or, have you tested any later commits that might hold fixes for the Breakout reward difference?

btaba · 2018-02-12T06:11:37Z

@kdu4108 I only tried that commit on Breakout and BeamRider, and was not able to reproduce results

AshishMehtaIO · 2018-02-20T07:21:09Z

I'm facing the same issue.
The only major difference between DQN paper and baselines implementation is the Optimizer. (rmsprop vs ADAM). is there a major difference when using one or the other?

btaba · 2018-02-20T15:35:23Z

@ashishm-io You can try. Don't forget to log the actual episode rewards and not the clipped ones.

I find this DQN implementation to actually work. It's probably easier from there to add double-q and dueling networks.

AshishMehtaIO · 2018-03-13T06:22:46Z

Shouldn't Baselines log both clipped and episode rewards by default? Isn't that an essential feature to compare results with other implementations?

benbotto · 2018-05-14T22:41:20Z

@ashishm-io Another difference is the size of the replay buffer. You might try bumping that to 1e6, because by default it's only 1e4. Note that in run_atari.py the ScaledFloatFrame wrapper is used, so 32-bit floats are used to store observations rather than 8-bit ints. In other words, you'll need a ton of memory!

@kdu4108 Yea, but Pong is the simplest of the Atari games as far as I know. In my implementation I achieve an average of over 20 in about 3 million frames. Breakout is significantly harder.

@btaba When you achieved the 250 average, that's the actual score, right? As opposed to the clipped score? And also, is that with or without episodic life? In other words, is that an average of 250 in one life, or in 5 lives?

OpenAI team: How do we reproduce what's reported in the baselines-results repository (https://github.com/openai/baselines-results/blob/master/dqn_results.ipynb)? It shows average scores of 400+; however, it references files that no longer exist, like wang2015_eval.py. I'm using the run_atari.py script, with dueling off but otherwise default, and getting an average of just over 18 after 10M frames (the default). I'm trying to implement DQN, but most of the code I find online has subtle bugs. It's important to have something out there to reference that has reproducible results!

ppwwyyxx · 2018-05-14T22:46:14Z

@benbotto The implementation I open sourced two years ago (https://github.com/ppwwyyxx/tensorpack/tree/master/examples/DeepQNetwork) can reproduce 400+ average score on Breakout within 10 hours on one GTX1080Ti.

benbotto · 2018-05-15T00:41:08Z

Thank you @ppwwyyxx, I'll definitely run your implementation and compare the results against my own. I'm able to reproduce the 400 score as well in my code in vanilla DQN, but I'm running into trouble with Prioritized Experience Replay. This is the only implementation that I know of that uses PER and takes into account Importance Sampling Weights: most forgo that last part. I've found this implementation, which does not correctly normalize the weights. There's also this one, which ignores the IS weights altogether. The Baselines implementation looks right to me--aside from a minor off-by-one bug that's awaiting a pull. That said, it would be nice to be able to reliably reproduce the reported numbers in the baselines-results repository!

ichaelm mentioned this issue Jan 17, 2018

Fix Atari module by adding SimpleMonitor back in #255

Closed

DanielTakeshi mentioned this issue Jul 27, 2018

Default hyperparamters for run_atari.py (using P-DDDQN) fail with Pong and Breakout (log files attached) #431

Open

ericl mentioned this issue Aug 15, 2018

[rllib] Provide atari results across all algorithms (as applicable) ray-project/ray#2663

Closed

7 tasks

benbotto mentioned this issue Apr 17, 2019

SpaceInvaders test results benbotto/bsy-dqn-atari#1

Open

This was referenced Nov 2, 2019

Cannot reproduce the benchmark results of DQN on Breakout #672

Closed

Inclusion of baseline results araffin/rl-baselines-zoo#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce Breakout benchmark using Double DQN #176

Cannot reproduce Breakout benchmark using Double DQN #176

gbg141 commented Oct 20, 2017 •

edited

Loading

asimmunawar commented Nov 14, 2017

candytalking commented Nov 24, 2017 •

edited

Loading

BNSneha commented Nov 25, 2017

btaba commented Jan 15, 2018

btaba commented Jan 25, 2018 •

edited

Loading

kdu4108 commented Feb 3, 2018

btaba commented Feb 3, 2018 •

edited

Loading

kdu4108 commented Feb 9, 2018

btaba commented Feb 12, 2018

AshishMehtaIO commented Feb 20, 2018

btaba commented Feb 20, 2018 •

edited

Loading

AshishMehtaIO commented Mar 13, 2018

benbotto commented May 14, 2018

ppwwyyxx commented May 14, 2018 •

edited

Loading

benbotto commented May 15, 2018

Cannot reproduce Breakout benchmark using Double DQN #176

Cannot reproduce Breakout benchmark using Double DQN #176

Comments

gbg141 commented Oct 20, 2017 • edited Loading

asimmunawar commented Nov 14, 2017

candytalking commented Nov 24, 2017 • edited Loading

BNSneha commented Nov 25, 2017

btaba commented Jan 15, 2018

btaba commented Jan 25, 2018 • edited Loading

kdu4108 commented Feb 3, 2018

btaba commented Feb 3, 2018 • edited Loading

kdu4108 commented Feb 9, 2018

btaba commented Feb 12, 2018

AshishMehtaIO commented Feb 20, 2018

btaba commented Feb 20, 2018 • edited Loading

AshishMehtaIO commented Mar 13, 2018

benbotto commented May 14, 2018

ppwwyyxx commented May 14, 2018 • edited Loading

benbotto commented May 15, 2018

gbg141 commented Oct 20, 2017 •

edited

Loading

candytalking commented Nov 24, 2017 •

edited

Loading

btaba commented Jan 25, 2018 •

edited

Loading

btaba commented Feb 3, 2018 •

edited

Loading

btaba commented Feb 20, 2018 •

edited

Loading

ppwwyyxx commented May 14, 2018 •

edited

Loading