Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce Breakout benchmark using Double DQN #176

Open
gbg141 opened this issue Oct 20, 2017 · 15 comments
Open

Cannot reproduce Breakout benchmark using Double DQN #176

gbg141 opened this issue Oct 20, 2017 · 15 comments

Comments

@gbg141
Copy link

gbg141 commented Oct 20, 2017

I haven't been able to reproduce the results of the Breakout benchmark with Double DQN when using similar hyperparameter values than the ones presented in the original paper. After more than 20M observed frames (~100.000 episodes), the mean 100 episode reward still remains around 10, having achieved a maximum value of 12.

I present in the following list the neural network configuration as well as the hyperparameter values that I'm using in case I'm missing or getting something important wrong:

env = gym.make("BreakoutNoFrameskip-v4")
env = ScaledFloatFrame(wrap_dqn(env))
model = deepq.models.cnn_to_mlp(
        convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
        hiddens=[512],
        dueling=False
)
act = deepq.learn(
        env,
        q_func=model,
        lr=25e-5,
        max_timesteps=200000000,
        buffer_size=100000, #cannot store 1M frames as the paper suggests
        exploration_fraction=1000000/float(200000000), #so as to finish after !M steps
        exploration_final_eps=0.1,
        train_freq=4,
        batch_size=32,
        learning_starts=50000,
        target_network_update_freq=10000,
        gamma=0.99,
        prioritized_replay=False
)

Does anyone have some idea of what is going wrong? The analogous results exposed in a jupyter notebook in openai/baselines-results indicate that I should be able to get much better scores.

Thanks in advance.

@asimmunawar
Copy link

Same here, I also get similar average rewards. I also ran the deepq/experiments/run_atari.py example without any modifications and still it just converge to 11 in around 5 million steps. Any help or suggestion would be appreciated.

@candytalking
Copy link

candytalking commented Nov 24, 2017

I observe the same problem when training using the "learn" function within the "simple.py" file, which is the case to use "run_atari.py". When training using "deepq/experiments/atari/train.py" instead it works fine.

@BNSneha
Copy link

BNSneha commented Nov 25, 2017

File "train.py", line 244, in
start_time, start_steps = time.time(), info['steps']
KeyError: 'steps'
How to get rid of this error when trying to run atari/train.py?

@btaba
Copy link

btaba commented Jan 15, 2018

I have been running train.py in baselines/deepq/experiments/atari/train.py with the following command python train.py --env BeamRider --save-dir 'savedir-dueling' --dueling --prioritized, and I also cannot reproduce results for BeamRider compared to the jupyter notebook (although it seems that the train.py script was used to create those benchmarks). I had to make minor corrections to run the script due to issues referenced by the comment directly above and from this ticket. I'm effectively running this lightly modified version.

@btaba
Copy link

btaba commented Jan 25, 2018

@gbg141 Part of your issue might be that the rewards from the environment wrapped with wrap_deepmind are by default clipped to -1, 1 using np.sign, so the reported rewards in deepq/experiments/atari/train.py are clipped. If you turn reward clipping off and explicitly save the clipped reward in the replay buffer for training, that might work for you.

@kdu4108
Copy link

kdu4108 commented Feb 3, 2018

@btaba Did you try this/did it work for you?

@btaba
Copy link

btaba commented Feb 3, 2018

@kdu4108 that actually didn't work for me, I also tried a git reset --hard 1f3c3e33e7891cb3 and wasn't able to reproduce the results in this notebook for Breakout.

Edit: trained on commit 1f3c3e33e7891cb3 using python train.py --env Breakout --target-update-freq 10000 --learning-freq 4 --prioritized --dueling for 50M frames and I am only able to reach a reward of ~250 as opposed to ~400.

@kdu4108
Copy link

kdu4108 commented Feb 9, 2018

@btaba Okay thanks for the response. I tried training the default Pong using that version and successfully reproduced their results. Out of curiosity, have you tried to reproduce results on any other environments using that commit? Or, have you tested any later commits that might hold fixes for the Breakout reward difference?

@btaba
Copy link

btaba commented Feb 12, 2018

@kdu4108 I only tried that commit on Breakout and BeamRider, and was not able to reproduce results

@AshishMehtaIO
Copy link

I'm facing the same issue.
The only major difference between DQN paper and baselines implementation is the Optimizer. (rmsprop vs ADAM). is there a major difference when using one or the other?

@btaba
Copy link

btaba commented Feb 20, 2018

@ashishm-io You can try. Don't forget to log the actual episode rewards and not the clipped ones.

I find this DQN implementation to actually work. It's probably easier from there to add double-q and dueling networks.

@AshishMehtaIO
Copy link

Shouldn't Baselines log both clipped and episode rewards by default? Isn't that an essential feature to compare results with other implementations?

@benbotto
Copy link

@ashishm-io Another difference is the size of the replay buffer. You might try bumping that to 1e6, because by default it's only 1e4. Note that in run_atari.py the ScaledFloatFrame wrapper is used, so 32-bit floats are used to store observations rather than 8-bit ints. In other words, you'll need a ton of memory!

@kdu4108 Yea, but Pong is the simplest of the Atari games as far as I know. In my implementation I achieve an average of over 20 in about 3 million frames. Breakout is significantly harder.

@btaba When you achieved the 250 average, that's the actual score, right? As opposed to the clipped score? And also, is that with or without episodic life? In other words, is that an average of 250 in one life, or in 5 lives?

OpenAI team: How do we reproduce what's reported in the baselines-results repository (https://github.com/openai/baselines-results/blob/master/dqn_results.ipynb)? It shows average scores of 400+; however, it references files that no longer exist, like wang2015_eval.py. I'm using the run_atari.py script, with dueling off but otherwise default, and getting an average of just over 18 after 10M frames (the default). I'm trying to implement DQN, but most of the code I find online has subtle bugs. It's important to have something out there to reference that has reproducible results!

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented May 14, 2018

@benbotto The implementation I open sourced two years ago (https://github.com/ppwwyyxx/tensorpack/tree/master/examples/DeepQNetwork) can reproduce 400+ average score on Breakout within 10 hours on one GTX1080Ti.

@benbotto
Copy link

Thank you @ppwwyyxx, I'll definitely run your implementation and compare the results against my own. I'm able to reproduce the 400 score as well in my code in vanilla DQN, but I'm running into trouble with Prioritized Experience Replay. This is the only implementation that I know of that uses PER and takes into account Importance Sampling Weights: most forgo that last part. I've found this implementation, which does not correctly normalize the weights. There's also this one, which ignores the IS weights altogether. The Baselines implementation looks right to me--aside from a minor off-by-one bug that's awaiting a pull. That said, it would be nice to be able to reliably reproduce the reported numbers in the baselines-results repository!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants