-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actions normalization. How to implement it? #678
Comments
As you pointed out this information is in the documentation. Just below the line you mentioned there is an example of different action spaces. No need to touch the network here, just map your continuous actions from/to |
@Miffyli Thanks so much for the reply. If I understand correctly, I have to add a code like the one you suggested in the The |
From the documentation: Also related: #112 So, as a summary: |
This means that is better to (avoiding to use tanh as final layer):
Is it correct? |
Not really... You are creating the environment, so you know in advance the limits of the actions. |
@araffin Okay, I'll give a try, but in all the numerous attempts I made, I couldn't even get close to the sampling efficiency of the DeepMimic's PPO implementation (they don't use Adam as an optimizer but rather the SGD with momentum, but I don't think it can make all this difference). In my opinion, in their PPO implementation there are tricks that would be worth analyzing (especially regarding the normalization), because they may lead to an improvement of the current PPO2 algorithm. |
@araffin @Miffyli Okay, I tried both solutions and adding normalization directly in the network (through a Tanh layer), gave me (by far) best results. Telling the agent that the actions are in [-1, 1], without using Tanh, simply not works because this leads to saturating the actions (-1 or 1) from the first step evaluation and this leads the actor to immediately become unstable (and therefore learning proceed very very slowly, if proceed at all). |
just curious what to do if
|
in the real life, infinite does not exist, so you usually have a upper bound that has some meaning (e.g. torque limits for a robot).
This is research, I think there is an issue about that: #461 |
Hi you said ''adding normalization directly in the network (through a Tanh layer)''. Do you mean to use the tanh as the activation function? I don't know how to do it. Because I scaled the action between [-1,1] and met the same problem as you referred ''without using Tanh, simply not works because this leads to saturating the actions (-1 or 1) from the first step evaluation and this leads the actor to immediately become unstable''. |
Hi,
I'm struggling trying to replace the PPO implementation in DeepMimic with the PPO2 of the stable-baselines. The actor barely learn to walk for two or three steps and then falls. Now, I believe I have a problem of actions normalization (that is fully supported in the DeepMimic PPO implementation). From the documentation of stable baselines, I read:
How to make this is not clear at all. DeepMimic have a continuous observation and action space (197 real for observation and 36 real for action). I know the bounds of the actions, but the problem here is that the output of the MlpPolicy can have values by far bigger than 1 (in modulo).
We have 2 hidden layers of size 1024 and 512. Theoretically I have to normalize the output of the network, but AFAIK it's not possibile to add other layers after the output, am I right? And I think that the normalization must be done in the network (not in the environment) to make the backpropagation works correctly.
So, how I can normalize action in range [-1, 1]?
In addition, there's a way to show if the action space is sampled uniformerly (considering that is a 36D space)?
Thanks in advance.
P.S.: Any example of how to handle this with OpenAI baselines? I cannot found any on internet.
The text was updated successfully, but these errors were encountered: