-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving and restoring DDPG agent #162
Comments
I have the same issue |
Same here :( |
Hey, you can use tf.train.saver as it is described in here: https://www.tensorflow.org/programmers_guide/saved_model |
To achieve this I had to modify the code to actually use the provided diff --git a/baselines/ddpg/training.py b/baselines/ddpg/training.py
index 74a9b8f..103010d 100644
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -182,6 +182,10 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
logger.dump_tabular()
logger.info('')
logdir = logger.get_dir()
+
+ if saver is not None:
+ saver.save(sess, os.path.join(logdir, 'checkpoint', '{}_reach.ckpt'.format(epoch_episodes)))
+
if rank == 0 and logdir:
if hasattr(env, 'get_state'):
with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f: This saves the current tf graph in the |
I've managed to save the model, and load it in a later session. However upon load the returns are not jumping back to where they were when the model was trained/saved. So for example untrained the returns were 40. After training they were 100. I can load the saved model, and even inspect the model to verify its the new saved model, yet returns are at 40. Thoughts? Here is code change: diff --git a/baselines/ddpg/training.py b/baselines/ddpg/training.py
index 74a9b8f..39ec84d 100644
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -31,7 +32,7 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
# Set up logging stuff only for a single worker.
if rank == 0:
- saver = tf.train.Saver()
+ saver = tf.train.Saver(max_to_keep=100)
else:
saver = None
@@ -41,9 +42,20 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
episode_rewards_history = deque(maxlen=100)
with U.single_threaded_session() as sess:
# Prepare everything.
+
+ if restore == True:
+ logger.info("Restoring from saved model")
+ saver.restore(sess, tf.train.latest_checkpoint('./models/'))
+ else:
+ logger.info("Starting from scratch!")
+ sess.run(tf.global_variables_initializer()) # this should happen here and not in the agent right?
+
agent.initialize(sess)
sess.graph.finalize()
agent.reset()
obs = env.reset()
if eval_env is not None:
@@ -182,6 +194,11 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
logger.dump_tabular()
logger.info('')
logdir = logger.get_dir()
+
+ logger.info('saving model...')
+ saver.save(sess, './models/my_model', global_step=epoch, write_meta_graph=False)
+ logger.info('done saving model!')
+
if rank == 0 and logdir:
if hasattr(env, 'get_state'): diff --git a/baselines/ddpg/ddpg.py b/baselines/ddpg/ddpg.py
index e2d4950..9e8a2ad 100644
--- a/baselines/ddpg/ddpg.py
+++ b/baselines/ddpg/ddpg.py
@@ -323,7 +323,7 @@ class DDPG(object):
def initialize(self, sess):
self.sess = sess
- self.sess.run(tf.global_variables_initializer())
+ # self.sess.run(tf.global_variables_initializer()) // why does this happen here and not in trainer?
self.actor_optimizer.sync()
self.critic_optimizer.sync()
self.sess.run(self.target_init_updates)
If its not clear, I moved the global_var_init into trainer since we only want to use it on a fresh model. |
agent.reset() initializes all parameters randomly.
|
@freeze888 Not sure I follow - Also it looks like this method adjusts action/parameter noise in the DDPG class, which should not effect a restore? |
If you select the noise type as parameter noise, then there is a variable in agent called To restore your agent, you need to either add this parameter to the session or manually initialize your parameter noise std as your latest one. |
I am still confused: is editing the baseline code directly really the way to do it? In my opinion, this would be a flaw in the general design. Shouldn't there be at least something like a log method that gets called every n steps, so that one could just add saving there? |
@keithmgould I tried your solution, but it somehow didn't really work as you mentioned. Did you (or someone else) manage to correctly save & restore the DDPG agent? After all, training a RL algorithm without being able to load the trained model is not really useful in my opinion. |
@joellutz I agree. I believe it should be like the HER implementation in which there is support for train (save the best policy, the latest policy etc), play a policy etc. However HER has its own DDPG implementation. It would be probably a good idea to standardize the contributed algorithms in this repo in order to have similar structure and support for basic features (like save and restore policies). |
Hey all, Have not solved the problem, so as a workaround I used a different (non-Baseline) implementation of DDPG. If thats an option for you, check out Patrick Emami's implementation here. He also wrote a nice intro to the algorithm, here. I've found it easy to save and restore, and it trains just fine. Not sure whats going on with the Baseline version. |
@keithmgould Yeah I've already tried Emami's implementation. The code is much cleaner & shorter, but the techniques differ (e.g. no parameter noise, observation & layer normalization, critic-l2-regularization). The fact that the baseline implementation worked quite quickly for my environment & has by far better action space exploration (possibly due to parameter noise) are strong arguments for baseline. But saving & restoring the model just doesn't want to work with baseline... I tried extending Emami's implementation with the missing techniques, but I couldn't add the adaptive param noise & observation normalization there, as I'm relatively new to tensorflow, tflearn & RL in general. Does anyone know how to do param noise & observation normalization in tflearn (in Emami's implementation)? I don't really know what's going on with the tensorflow code in the baseline implementation to be honest. |
@chow0214 I saved the noise param in a pkl file, and the neural net weights with tensorflow's Saver. However, it still does not restore exactly where it left off. |
Not having the ability to save/restore is a really big issue, I agree. Is there anyway we can get an indication from the OpenAI people about whether this issue is something they intend to fix? It's also a big deal to switch away from baselines/DDPG given the other stuff they have implemented and I don't want to do so if they are going to fix the problem. |
I did some testruns with my environment, trying to save & restore the model. I captured the last state right before training & saving of the model, as well as the selected action right after that (in the new epoch-cycle). So I captured the action and the state this action was based on. Then I terminated the training, and started it again, but this time I restored the model from the previously saved file. At the env.reset() (right at the beginning of the episode) I injected the previously captured state. Then I captured the selected action right at the beginning, which should be exactly the same as the previously captured action, because the state each of them are based on is exactly the same. The two actions were exactly the same when having no parameter or action noise at all, so I think the saving & restoring might have worked. With parameter noise, the two actions weren't exactly the same, but this is maybe due to the randomness of the parameter noise which leads to a slightly different action. (Even though everything is seeded, the random generators are in a different state right at the beginning vs. right after the first epoch-cycle.) This is how I save & restore the model (all in baselines/ddpg/training.py, slightly modified compared to @keithmgould's solution)
The logs of my testrun without any param or action noise:
|
@joellutz thanks, seems like you included writing the meta graph in the save compared to @keithmgould 's solution. Did it make a difference when restoring from the meta graph? It says here it's not necessary: https://stackoverflow.com/questions/36195454/what-is-the-tensorflow-checkpoint-meta-file |
@jramak I don't know if writing the meta graph is really necessary or not, I followed this tutorial where they include it. I haven't tried if it still works without, as the process of capturing the state etc. is quite tedious for my environment. |
@joellutz Thanks~ Your method is very effective.
If you need to use the model in a service, I suggest you may use tf.InteractiveSession() instead of U.single_threaded_session(). |
@LaTinta This method will not save the information from the For my own use cases, I've set EDIT: Just saw this at the bottom of the README. Looks like they've added a
EDIT 2: Looks like DDPG and PPO1 don't use |
Seems https://stable-baselines.readthedocs.io/en/master/modules/ddpg.html#example |
FYI, as of commit 858afa8 these approaches no longer work. ddpg/training.py was removed. Looking at adapting them for the refactored codebase. |
I have it working (I think) in the latest codebase... except that my model performs poorly after loading from a checkpoint. Used similar approach as described above. I can't tell if that's due to an error in my code or just a bad training result. Does anyone have any ideas on how to verify that the model was loaded correctly? Code changes in this commit (I have a copy of openai baselines embedded in my repo for now): For my testing I'm using the approach of loading a 0-length training session followed by a --play. https://github.com/jrjbertram/jsbsim_rl/blob/master/replay.sh I verified that the code to reload the model is being executed. Where I'm confused / suspicious is that during training my rollout return and return_history curves look pretty good... nice logarithmic shape. When I replay though my actions seem fairly random and I don't seem to be acting in a way that would collect any reward. |
@jrjbertram are your normalizing your environment? if so, there is a known issue:
source = https://github.com/openai/baselines#saving-loading-and-visualizing-models |
* Add policy_kwargs with tests. * Add documentation. * mark unused variable
@r7vme The stable-baselines library doesn't seem to solve the RunningMeanStd issue. It changes the normalize_observation default value to False so the RunningMeanStd is not used. When I enable normalize_observation the model does not restore correctly. |
Has there been an update on how to properly save DDPG models? |
Can someone please tell me how to save and load a model in the DDPG implementation?
The text was updated successfully, but these errors were encountered: