-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112
Comments
Hello,
This is not planned but we are opened to PR. Also, as for the huber loss (see #95 ), we would need to run several benchmark to assess the utility of such feature before merging it. |
Hello, I'm working on continuous control problems with asymmetric, bounded continuous action spaces. While gaussian policies offer descent performance, it often takes long time to train and the action distribution is often not 100% matching the problem space. Some real world continuous control problems would benefit from this. Mainly thinking about mechanical engine parts control, or industrial machine optimization (e.g. calibration). |
I'm testing the following (draft) implementation class BetaProbabilityDistribution(ProbabilityDistribution):
def __init__(self, flat):
self.flat = flat
print(flat)
# as per http://proceedings.mlr.press/v70/chou17a/chou17a.pdf
alpha = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
beta = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
self.dist = tf.distributions.Beta(concentration1=alpha, concentration0=beta, validate_args=True, allow_nan_stats=False)
def flatparam(self):
return self.flat
def mode(self):
return self.dist.mode()
def neglogp(self, x):
return tf.reduce_sum(-self.dist.log_prob(x), axis=-1)
def kl(self, other):
assert isinstance(other, BetaProbabilityDistribution)
return self.dist.kl_divergence(other.dist)
def entropy(self):
return self.dist.entropy()
def sample(self):
return self.dist.sample()
@classmethod
def fromflat(cls, flat):
return cls(flat) For now I've been able to run it with a custom policy like: pdtype = BetaProbabilityDistributionType(ac_space)
...
obs = self.processed_x
with tf.variable_scope("model"):
x = obs
...
x = tf.nn.relu(tf.layers.dense(x, 128, name='pi_fc'+str(i), kernel_initializer=U.normc_initializer(1.0)))
self.policy = tf.layers.dense(x, ac_space.shape[0], name='pi')
self.proba_distribution = pdtype.proba_distribution_from_flat(x)
x = obs
...
x = tf.nn.relu(tf.layers.dense(x, 128, name='vf_fc' + str(i), kernel_initializer=U.normc_initializer(1.0)))
value_fn = tf.layers.dense(x, 1, name='vf')
self.q_value = tf.layers.dense(value_fn, 1, name='q')
... I'm now running it with PPO1 & PPO2 against my benchmark environment (asym, bounded, continuous action space) to see how it compares with Gaussian. I'm running into troubles with TRPO, but I didn't have time to investigate further. Note: it still requires to rescale action from [0,1] to environment action space. This can be done manually, or it could be added a custom post-processing mechanism of the action. |
Well, after testing a bit, it doesn't seem to improve overall performance, at least on my environment (I didn't test on classic control tasks). It does seem to converge, but it's slower and average reward is lower than for Gaussian policy. |
I would say you need some hyperparameter tuning... The parameters present in the current implementation were tuned for gaussian policies, so it is not completely fair to compare them without tuning. |
@araffin I'll try to spend some time on that. Any idea of what hyperparam would be best to try tuning first? |
The best practice would be to use hyperband or hyperopt to do it automatically (see https://github.com/araffin/robotics-rl-srl#hyperparameter-search). Otherwise, with PPO, the hyperparameters that are the most important in my experience: n_steps (together with nminibatches), ent_coef (entropy coeff), lam (GAE lambda coeff). Additionally, you can also tune noptepochs, cliprange and the learning rate. |
@antoine-galataud can you share your implementation of the beta distribution ? |
@HareshMiriyala sure, I'll PR that soon. |
I don't think it's ready for a PR so here is the branch link: https://github.com/antoine-galataud/stable-baselines/tree/beta-pd This is based on Tensorflow Beta implementation and Improving Stochastic Policy Gradients in Continuous Control with Deep Usage: I didn't work on configuring what distribution to use in a generic manner, you have to use it in a custom policy. Ideally there should be way to choose between gaussian and beta in
You can refer to example above about creating a custom policy that uses it. |
@antoine-galataud before submitting a PR, please look at the contribution guide #148 (that would save time ;)) |
@antoine-galataud Thanks a bunch ! |
@antoine-galataud How are you handling scaling the sample from beta distribution (0,1) to the action space bounds ? |
@HareshMiriyala I do it like this: def step(self, action):
action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
... |
Thanks, where do you make this change ? I'm not so familiar with the code, can you guide me into which file and class you make this change in ? |
@araffin: Could you give an estimate for how long it will take until the beta distribution will be merged with master? Thanks in advance! |
@skervim well, I don't know as I'm not in charge if implementing it nor testing it. However that does not mean you cannot test it before (cf install from source in the doc). |
@araffin: Sorry!! I misunderstood your message:
@antoine-galataud: I don't know if it helps you, but there is also a beta distribution implemented in |
@HareshMiriyala the step() function is one that you implement when you write a custom gym env. you can also modify an existing env to see how it goes. @skervim I couldn’t dedicate time to testing (apart a quick one with a custom env). I also have to write unit tests. If you have access to continuous control environments (mujoco, ...) to give it a try that would definitely help. Apart from that, I’d like to provide a better integration with action value scaling and distribution type selection based on configuration. Maybe later, if we see any benefit with this implementation. It doesn’t prevent from testing it as is anyway. |
@skervim if you want to test on continuous envs for free (no mujoco licence required), I recommend you the pybullets envs (see the rl baselines zoo) |
@antoine-galataud It's legit/better to perform this operation in the step function of the environment? Or is better to put it in the network (updating the |
@HareshMiriyala I’ve seen rescaling performed in various parts of the code, depending on the env, the framework or the project. In my opinion, it shouldn’t impact overall performance, if rescaling output is consistently giving same output for a given input. |
There is an issue at open-ai baselines ( here ) about the advantages of a beta distribution over a diagonal gaussian distribution + clipping.
The relevant paper: Improving Stochastic Policy Gradients in Continuous Control with Deep
Reinforcement Learning using the Beta Distribution
Is it possible to add a beta distribution to the repository?
The text was updated successfully, but these errors were encountered: