-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot reproduce Breakout benchmark using Double DQN #176
Comments
Same here, I also get similar average rewards. I also ran the deepq/experiments/run_atari.py example without any modifications and still it just converge to 11 in around 5 million steps. Any help or suggestion would be appreciated. |
I observe the same problem when training using the "learn" function within the "simple.py" file, which is the case to use "run_atari.py". When training using "deepq/experiments/atari/train.py" instead it works fine. |
File "train.py", line 244, in |
I have been running train.py in |
@gbg141 Part of your issue might be that the rewards from the environment wrapped with |
@btaba Did you try this/did it work for you? |
@kdu4108 that actually didn't work for me, I also tried a Edit: trained on commit |
@btaba Okay thanks for the response. I tried training the default Pong using that version and successfully reproduced their results. Out of curiosity, have you tried to reproduce results on any other environments using that commit? Or, have you tested any later commits that might hold fixes for the Breakout reward difference? |
@kdu4108 I only tried that commit on Breakout and BeamRider, and was not able to reproduce results |
I'm facing the same issue. |
@ashishm-io You can try. Don't forget to log the actual episode rewards and not the clipped ones. I find this DQN implementation to actually work. It's probably easier from there to add double-q and dueling networks. |
Shouldn't Baselines log both clipped and episode rewards by default? Isn't that an essential feature to compare results with other implementations? |
@ashishm-io Another difference is the size of the replay buffer. You might try bumping that to 1e6, because by default it's only 1e4. Note that in @kdu4108 Yea, but Pong is the simplest of the Atari games as far as I know. In my implementation I achieve an average of over 20 in about 3 million frames. Breakout is significantly harder. @btaba When you achieved the 250 average, that's the actual score, right? As opposed to the clipped score? And also, is that with or without episodic life? In other words, is that an average of 250 in one life, or in 5 lives? OpenAI team: How do we reproduce what's reported in the baselines-results repository (https://github.com/openai/baselines-results/blob/master/dqn_results.ipynb)? It shows average scores of 400+; however, it references files that no longer exist, like |
@benbotto The implementation I open sourced two years ago (https://github.com/ppwwyyxx/tensorpack/tree/master/examples/DeepQNetwork) can reproduce 400+ average score on Breakout within 10 hours on one GTX1080Ti. |
Thank you @ppwwyyxx, I'll definitely run your implementation and compare the results against my own. I'm able to reproduce the 400 score as well in my code in vanilla DQN, but I'm running into trouble with Prioritized Experience Replay. This is the only implementation that I know of that uses PER and takes into account Importance Sampling Weights: most forgo that last part. I've found this implementation, which does not correctly normalize the weights. There's also this one, which ignores the IS weights altogether. The Baselines implementation looks right to me--aside from a minor off-by-one bug that's awaiting a pull. That said, it would be nice to be able to reliably reproduce the reported numbers in the baselines-results repository! |
I haven't been able to reproduce the results of the Breakout benchmark with Double DQN when using similar hyperparameter values than the ones presented in the original paper. After more than 20M observed frames (~100.000 episodes), the mean 100 episode reward still remains around 10, having achieved a maximum value of 12.
I present in the following list the neural network configuration as well as the hyperparameter values that I'm using in case I'm missing or getting something important wrong:
Does anyone have some idea of what is going wrong? The analogous results exposed in a jupyter notebook in
openai/baselines-results
indicate that I should be able to get much better scores.Thanks in advance.
The text was updated successfully, but these errors were encountered: