Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty Reproducing HalfCheetah-v2 SAC Results #128

Open
xanderdunn opened this issue Feb 21, 2021 · 6 comments
Open

Difficulty Reproducing HalfCheetah-v2 SAC Results #128

xanderdunn opened this issue Feb 21, 2021 · 6 comments

Comments

@xanderdunn
Copy link

xanderdunn commented Feb 21, 2021

Huge thanks for providing this implementation, it's very high quality.

I'm having difficulty reproducing the results of the original SAC paper using the provided examples/sac.py script.

The paper reports a mean return of 15,000 in 3M steps (blue and orange lines are SAC):
Screen Shot 2021-02-21 at 08 52 07

My runs on the unmodified examples/sac.py script appear to be considerably less sample efficient:
Screen Shot 2021-02-21 at 08 58 18

My runs are pretty consistently achieving 13,000 average return on 10M steps. They may eventually get to 15,000 average return if left to run for millions of steps further, but my runs are requiring more than 3x the number of steps to achieve 13k vs 15k mean return.

I have found that results can vary greatly from run to run. Notice the pink line in my above chart that does poorly. Is the paper doing many runs and reporting the best? I didn't see this mentioned in the Experiments section of the paper.

It appears to me that the hyper parameters shown in the paper are the same in the script, which I have not modified:
Screen Shot 2021-02-21 at 08 50 22

Am I interpreting the "num total steps" and "Returns Mean" correctly? Do you know what might cause this difference in sample efficiency and final return?

@vitchyr
Copy link
Collaborator

vitchyr commented Feb 22, 2021

Hi, thanks for pointing this out. One possible cause for this difference is that this implementation alternates between sampling entire trajectories and taking gradient steps, where as the original SAC paper alternates between one environment step and one gradient step. It's hard to compare the two exactly, but I'm guessing that something small like increase num_trains_per_train_loop would compensate for this difference.

Another possible differences are differences in network initialization or very minor differences in the Adam optimizer implementation (I've seen people talk about this, though I don't particularly suspect this).

@xanderdunn
Copy link
Author

@vitchyr Thanks very much, I will try increasing num_trains_per_train_loop.

I don't see mention in the SAC paper of how the network's weights were initialized. I might look at the official implementation to see if it differs.

@xanderdunn
Copy link
Author

What values of num_trains_per_train_loop would you recommend trying? With values 1000-3000 I'm not seeing a large difference in sample efficiency:

Screen Shot 2021-02-22 at 08 10 38

Light blue is the default 1000 and the others are 2000 or 3000. The best I'm seeing by step 3M is mean return 10.2k, vs. the paper's 15k.

@vitchyr
Copy link
Collaborator

vitchyr commented Feb 22, 2021

Thanks for trying that. My main suspicion then is the difference between the batch data collection versus the intertwining data collection that could cause the difference. If you want to investigate this, replace the evaluation path collector with a step collector and replace the batch RL algorithm with an online RL algorithm. It might take a few more edits to get it to run, but these components should be fairly plug-and-play.

@xanderdunn
Copy link
Author

Thanks again for your help @vitchyr.

It looks like this issue in the soft learning repo is related: rail-berkeley/softlearning#75

However, I managed to get the same experiment running in soft learning and found the results matched those in the paper. Running this:

softlearning run_example_local examples.development \
    --algorithm SAC \
    --universe gym \
    --domain HalfCheetah \
    --task v2 \
    --exp-name my-sac-experiment-1 \
    --checkpoint-frequency 1000

I got these results on four different seeds:
Screen Shot 2021-02-23 at 07 50 32

These results match the paper's reported result achieving ~15,000 mean return on the first 3M timesteps. The evaluation mean return was >15k for all runs. Note that each of these runs took 10.7 hours.

Compare to rlkit runs with four different values of num_trains_per_train_loop:
Screen Shot 2021-02-23 at 07 47 31

Mean return on the first 3M tilmesteps ranges from 6,200-11,000. Due to the high values of num_trains_per_train_loop, these results also took longer to compute. The best performing one, with num_trains_per_train_loop==5000, took 14 hours under the same hardware conditions.

rlkit has more RL algorithms implemented and is better maintained, but for now I will continue with the tensorflow implementation since the baseline is immediately accessible. The sample and computational efficiency are important aspects for our work.

@ZhenhuiTang
Copy link

Hii, where could I see the results, when I run "python3 examples/ddpg.py" ? I could not find the 'output' file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants