-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PPO] Trajectory action out of range #564
Comments
I'm getting the same exact problem with my environment. Were you able to find a solution? |
@ZakSingh Not yet. The weird thing is that I don't run into the same problem with the DQN. |
@summer-yue PTAL? |
@egonina current rotation; ptal? |
Perhaps |
I think the problem is in the construction of the Discrete output distribution here. Can you link to a gist with the full traceback of the error? |
Don't think this will be helpful - Stack Trace. Also,
So, expected action to be in range [-1, 1], but got (-1.2548504, 0.55205715). As mentioned in here, PPO does not handle action bound clipping. |
Have you tried the workaround in #216 ? @kuanghuei can you PTAL as well since you're more familiar with PPO and have context on previous issues. Thanks! |
For my case, I just clipped action values in my env. |
You can alternatively pass a For example, if you are using discrete actions, the default is:
But instead you can use something like:
For a continuous network you could instead emit a |
You could also just build a complete |
Hello,
I'm trying to implement the PPO agent using a custom environment with a Discrete spaces object with bounds [0,4), but the agent policy is choosing a number out of range.
action_space = spaces.Discrete(4)
I created two new networks
and verified that the bounds of the action_spec from the resulting agent is [0,4)
However, during my training loop
The loop would run from anywhere between 0-400 iterations with no problems and eventually I end up getting an
InvalidArgumentError
:InvalidArgumentError: Received a label value of 4 which is outside the valid range of [0, 4). Label values: 4 4...[Op:SparseSoftmaxCrossEntropyWithLogits]
Upon further inspection of the trajectories, it seems that the the agent policy is outputting values outside of the action bounds.
action=<tf.Tensor: shape=(1, 60), dtype=int64, numpy= array([[4, 4, ...], dtype=int64)>
I initially thought it was a problem with the activation function with the networks and created a bounded ReLU function in order to limit it, but still the same problems.
Would this be an issue with my environment or the networks I setup?
The text was updated successfully, but these errors were encountered: