Difficulty Reproducing HalfCheetah-v2 SAC Results #128

xanderdunn · 2021-02-21T17:08:19Z

Huge thanks for providing this implementation, it's very high quality.

I'm having difficulty reproducing the results of the original SAC paper using the provided examples/sac.py script.

The paper reports a mean return of 15,000 in 3M steps (blue and orange lines are SAC):

My runs on the unmodified examples/sac.py script appear to be considerably less sample efficient:

My runs are pretty consistently achieving 13,000 average return on 10M steps. They may eventually get to 15,000 average return if left to run for millions of steps further, but my runs are requiring more than 3x the number of steps to achieve 13k vs 15k mean return.

I have found that results can vary greatly from run to run. Notice the pink line in my above chart that does poorly. Is the paper doing many runs and reporting the best? I didn't see this mentioned in the Experiments section of the paper.

It appears to me that the hyper parameters shown in the paper are the same in the script, which I have not modified:

Am I interpreting the "num total steps" and "Returns Mean" correctly? Do you know what might cause this difference in sample efficiency and final return?

vitchyr · 2021-02-22T06:20:53Z

Hi, thanks for pointing this out. One possible cause for this difference is that this implementation alternates between sampling entire trajectories and taking gradient steps, where as the original SAC paper alternates between one environment step and one gradient step. It's hard to compare the two exactly, but I'm guessing that something small like increase num_trains_per_train_loop would compensate for this difference.

Another possible differences are differences in network initialization or very minor differences in the Adam optimizer implementation (I've seen people talk about this, though I don't particularly suspect this).

xanderdunn · 2021-02-22T07:12:36Z

@vitchyr Thanks very much, I will try increasing num_trains_per_train_loop.

I don't see mention in the SAC paper of how the network's weights were initialized. I might look at the official implementation to see if it differs.

xanderdunn · 2021-02-22T16:27:09Z

What values of num_trains_per_train_loop would you recommend trying? With values 1000-3000 I'm not seeing a large difference in sample efficiency:

Light blue is the default 1000 and the others are 2000 or 3000. The best I'm seeing by step 3M is mean return 10.2k, vs. the paper's 15k.

vitchyr · 2021-02-22T17:47:47Z

Thanks for trying that. My main suspicion then is the difference between the batch data collection versus the intertwining data collection that could cause the difference. If you want to investigate this, replace the evaluation path collector with a step collector and replace the batch RL algorithm with an online RL algorithm. It might take a few more edits to get it to run, but these components should be fairly plug-and-play.

xanderdunn · 2021-02-23T16:16:15Z

Thanks again for your help @vitchyr.

It looks like this issue in the soft learning repo is related: rail-berkeley/softlearning#75

However, I managed to get the same experiment running in soft learning and found the results matched those in the paper. Running this:

softlearning run_example_local examples.development \
    --algorithm SAC \
    --universe gym \
    --domain HalfCheetah \
    --task v2 \
    --exp-name my-sac-experiment-1 \
    --checkpoint-frequency 1000

I got these results on four different seeds:

These results match the paper's reported result achieving ~15,000 mean return on the first 3M timesteps. The evaluation mean return was >15k for all runs. Note that each of these runs took 10.7 hours.

Compare to rlkit runs with four different values of num_trains_per_train_loop:

Mean return on the first 3M tilmesteps ranges from 6,200-11,000. Due to the high values of num_trains_per_train_loop, these results also took longer to compute. The best performing one, with num_trains_per_train_loop==5000, took 14 hours under the same hardware conditions.

rlkit has more RL algorithms implemented and is better maintained, but for now I will continue with the tensorflow implementation since the baseline is immediately accessible. The sample and computational efficiency are important aspects for our work.

ZhenhuiTang · 2022-12-15T05:05:08Z

Hii, where could I see the results, when I run "python3 examples/ddpg.py" ? I could not find the 'output' file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty Reproducing HalfCheetah-v2 SAC Results #128

Difficulty Reproducing HalfCheetah-v2 SAC Results #128

xanderdunn commented Feb 21, 2021 •

edited

Loading

vitchyr commented Feb 22, 2021

xanderdunn commented Feb 22, 2021

xanderdunn commented Feb 22, 2021

vitchyr commented Feb 22, 2021

xanderdunn commented Feb 23, 2021

ZhenhuiTang commented Dec 15, 2022

Difficulty Reproducing HalfCheetah-v2 SAC Results #128

Difficulty Reproducing HalfCheetah-v2 SAC Results #128

Comments

xanderdunn commented Feb 21, 2021 • edited Loading

vitchyr commented Feb 22, 2021

xanderdunn commented Feb 22, 2021

xanderdunn commented Feb 22, 2021

vitchyr commented Feb 22, 2021

xanderdunn commented Feb 23, 2021

ZhenhuiTang commented Dec 15, 2022

xanderdunn commented Feb 21, 2021 •

edited

Loading