-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some benchmarks on six MuJoCo-v2 environments for DDPG and TD3 #63
Comments
One more thing the examples script has code like this: and we are using Tanh policies: Just wondering, is the NormalizedBoxEnv needed in this case? Perhaps it was just added to let us know what we could do with it later? By default it seems like we are not normalizing observations or returns. Thus, NormalizedBoxEnv would only serve to clip actions in [-1,1] for each component. But the tanh will naturally force it in that range anyway. The only other possibility I can think of for the NormalizedBoxEnv is if the extra noise injected into the exploration policy causes the actions to exceed the [-1,1] range in some components. But after inserting some print and assertion checks in the normalized box env stepping method, and running |
Thanks for this! I'll work on incorporating this into the documentation
later.
The NormalizedBoxEnv is so that the env expects actions in [-1, 1]. I think
this already happens by default for the gym envs, but if the action input
range is actually [-2, 2], then this will rescale the actions accordingly.
Like you said, another use case is also for clipping the noise. Frankly, I
bet you could remove it without affect performance too much but I haven't
tried.
…On Mon, Jun 24, 2019, 10:17 AM Daniel Seita ***@***.***> wrote:
One more thing the examples script has code like this:
https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L22-L24
and we are using Tanh policies:
https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L35-L39
Just wondering, is the NormalizedBoxEnv needed in this case? Perhaps it
was just added to let us know what we *could* do with it later? By
default it seems like we are not normalizing observations or returns. Thus,
NormalizedBoxEnv would only serve to clip actions in [-1,1] for each
component. But the tanh will naturally force it in that range anyway.
The only other possibility I can think of for the NormalizedBoxEnv is if
the extra noise injected into the exploration policy causes the actions to
exceed the [-1,1] range in some components. But after inserting some print
and assertion checks in the normalized box env stepping method, and running python
examples/ddpg.py, shows that no actions are outside the range so
presumably the action+noise for exploration is clipped somewhere before
that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJ4VZKL7L6ACHV4XVQU573P4D6QXANCNFSM4H2V57MA>
.
|
Hi, I was wondering what is the difference between the exploration policy and the evaluation policy? Which one is common used in RL paper? I mean, is the training curve in the SAC paper is based on the exploration policy which corresponds to 'expl/Average Returns'? Why rewards from evaluation policy tends to better than that from the exploration policy? I really look forward to your reply!
|
Hi @vitchyr
Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.
For installation, I actually didn't entirely follow the installation instructions, but here's what I did:
I took the master branch from 5565dd5 and then adjusted the
examples/td3.py
andexamples/ddpg.py
so that they also imported other MuJoCo environments. In addition, for TD3 only, I adjusted the hyperparameters in the "algorithm_kwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L71-L79
And TD3 uses this:
https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/td3.py#L104-L111
I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.
If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.
I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:
For this I used the following plotting script where I just call it like
python [script].py Ant-v2
and similarly for the other environments:Here are the curves. Left is the exploration policy, and right is the evaluation policy.
The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.
I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!
The text was updated successfully, but these errors were encountered: