Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving and restoring DDPG agent #162

Open
Sumsamkhan opened this issue Oct 9, 2017 · 26 comments
Open

Saving and restoring DDPG agent #162

Sumsamkhan opened this issue Oct 9, 2017 · 26 comments

Comments

@Sumsamkhan
Copy link

Can someone please tell me how to save and load a model in the DDPG implementation?

@watts4speed
Copy link

I have the same issue

@xmanatee
Copy link

Same here :(

@hfurkanbozkurt
Copy link

Hey, you can use tf.train.saver as it is described in here: https://www.tensorflow.org/programmers_guide/saved_model
Or you can return the agent from the function. Keep in mind that, before you return from function, do not forget to finish the episode. Otherwise, you cannot use the environment directly.

@haudren
Copy link

haudren commented May 29, 2018

To achieve this I had to modify the code to actually use the provided tf.train.Saver as follows:

diff --git a/baselines/ddpg/training.py b/baselines/ddpg/training.py
index 74a9b8f..103010d 100644
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -182,6 +182,10 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
             logger.dump_tabular()
             logger.info('')
             logdir = logger.get_dir()
+
+            if saver is not None:
+                saver.save(sess, os.path.join(logdir, 'checkpoint', '{}_reach.ckpt'.format(epoch_episodes)))
+
             if rank == 0 and logdir:
                 if hasattr(env, 'get_state'):
                     with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:

This saves the current tf graph in the /tmp/openai-xxx/checkpoint at every epoch.

@keithmgould
Copy link

keithmgould commented Jun 14, 2018

I've managed to save the model, and load it in a later session. However upon load the returns are not jumping back to where they were when the model was trained/saved. So for example untrained the returns were 40. After training they were 100. I can load the saved model, and even inspect the model to verify its the new saved model, yet returns are at 40. Thoughts? Here is code change:

diff --git a/baselines/ddpg/training.py b/baselines/ddpg/training.py
index 74a9b8f..39ec84d 100644
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -31,7 +32,7 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa

     # Set up logging stuff only for a single worker.
     if rank == 0:
-        saver = tf.train.Saver()
+        saver = tf.train.Saver(max_to_keep=100)
     else:
         saver = None

@@ -41,9 +42,20 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
     episode_rewards_history = deque(maxlen=100)
     with U.single_threaded_session() as sess:
         # Prepare everything.
+
+        if restore == True:
+            logger.info("Restoring from saved model")
+            saver.restore(sess, tf.train.latest_checkpoint('./models/'))
+        else:
+            logger.info("Starting from scratch!")
+            sess.run(tf.global_variables_initializer()) # this should happen here and not in the agent right?
+
         agent.initialize(sess)

         sess.graph.finalize()

         agent.reset()
         obs = env.reset()
         if eval_env is not None:
@@ -182,6 +194,11 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
             logger.dump_tabular()
             logger.info('')
             logdir = logger.get_dir()
+
+            logger.info('saving model...')
+            saver.save(sess, './models/my_model', global_step=epoch, write_meta_graph=False)
+            logger.info('done saving model!')
+
             if rank == 0 and logdir:
                 if hasattr(env, 'get_state'):
diff --git a/baselines/ddpg/ddpg.py b/baselines/ddpg/ddpg.py
index e2d4950..9e8a2ad 100644
--- a/baselines/ddpg/ddpg.py
+++ b/baselines/ddpg/ddpg.py
@@ -323,7 +323,7 @@ class DDPG(object):

     def initialize(self, sess):
         self.sess = sess
-        self.sess.run(tf.global_variables_initializer())
+        # self.sess.run(tf.global_variables_initializer()) // why does this happen here and not in trainer?
         self.actor_optimizer.sync()
         self.critic_optimizer.sync()
         self.sess.run(self.target_init_updates)

If its not clear, I moved the global_var_init into trainer since we only want to use it on a fresh model.

@freeze888
Copy link

agent.reset() initializes all parameters randomly.
I made new function agent.reset_test() and use it instead of agent.reset()
----------------------------------- in train function()
saver.restore(sess,path)

        agent.initialize_test(sess)

        sess.graph.finalize()

        agent.reset_test()
        obs = env.reset()
        #if eval_env is not None:
        #    eval_obs = eval_env.reset()
        done = False
        episode_reward = 0.
        episode_step = 0
        episodes = 0
        t = 0

def reset_test(self):
    # Reset internal state after an episode is complete.
    if self.action_noise is not None:
        self.action_noise.reset()
    '''
    if self.param_noise is not None:
        self.sess.run(self.perturb_policy_ops, feed_dict={
            self.param_noise_stddev: self.param_noise.current_stddev,
        })
    '''

@keithmgould
Copy link

@freeze888 Not sure I follow - agent.reset() is called after every episode. I don't understand how this method (or modifying this method) could be the issue?

Also it looks like this method adjusts action/parameter noise in the DDPG class, which should not effect a restore?

@zhehuazhou
Copy link

zhehuazhou commented Jul 4, 2018

If you select the noise type as parameter noise, then there is a variable in agent called agent.param_noise.current_stddev, which is not a tensor, so when you save the trained agent and do the restoring, this parameter will not be recovered.

To restore your agent, you need to either add this parameter to the session or manually initialize your parameter noise std as your latest one.

@Daniel451
Copy link

I am still confused: is editing the baseline code directly really the way to do it? In my opinion, this would be a flaw in the general design. Shouldn't there be at least something like a log method that gets called every n steps, so that one could just add saving there?

@joellutz
Copy link

@keithmgould I tried your solution, but it somehow didn't really work as you mentioned. Did you (or someone else) manage to correctly save & restore the DDPG agent? After all, training a RL algorithm without being able to load the trained model is not really useful in my opinion.

@iSaran
Copy link

iSaran commented Jul 19, 2018

@joellutz I agree. I believe it should be like the HER implementation in which there is support for train (save the best policy, the latest policy etc), play a policy etc. However HER has its own DDPG implementation. It would be probably a good idea to standardize the contributed algorithms in this repo in order to have similar structure and support for basic features (like save and restore policies).

@keithmgould
Copy link

Hey all,

Have not solved the problem, so as a workaround I used a different (non-Baseline) implementation of DDPG. If thats an option for you, check out Patrick Emami's implementation here. He also wrote a nice intro to the algorithm, here. I've found it easy to save and restore, and it trains just fine. Not sure whats going on with the Baseline version.

@joellutz
Copy link

joellutz commented Jul 20, 2018

@keithmgould Yeah I've already tried Emami's implementation. The code is much cleaner & shorter, but the techniques differ (e.g. no parameter noise, observation & layer normalization, critic-l2-regularization). The fact that the baseline implementation worked quite quickly for my environment & has by far better action space exploration (possibly due to parameter noise) are strong arguments for baseline. But saving & restoring the model just doesn't want to work with baseline...

I tried extending Emami's implementation with the missing techniques, but I couldn't add the adaptive param noise & observation normalization there, as I'm relatively new to tensorflow, tflearn & RL in general. Does anyone know how to do param noise & observation normalization in tflearn (in Emami's implementation)? I don't really know what's going on with the tensorflow code in the baseline implementation to be honest.

@jramak
Copy link

jramak commented Jul 21, 2018

@chow0214 I saved the noise param in a pkl file, and the neural net weights with tensorflow's Saver. However, it still does not restore exactly where it left off.

@watts4speed
Copy link

watts4speed commented Jul 23, 2018

Not having the ability to save/restore is a really big issue, I agree. Is there anyway we can get an indication from the OpenAI people about whether this issue is something they intend to fix? It's also a big deal to switch away from baselines/DDPG given the other stuff they have implemented and I don't want to do so if they are going to fix the problem.

@joellutz
Copy link

joellutz commented Jul 27, 2018

I did some testruns with my environment, trying to save & restore the model. I captured the last state right before training & saving of the model, as well as the selected action right after that (in the new epoch-cycle). So I captured the action and the state this action was based on. Then I terminated the training, and started it again, but this time I restored the model from the previously saved file. At the env.reset() (right at the beginning of the episode) I injected the previously captured state. Then I captured the selected action right at the beginning, which should be exactly the same as the previously captured action, because the state each of them are based on is exactly the same.

The two actions were exactly the same when having no parameter or action noise at all, so I think the saving & restoring might have worked. With parameter noise, the two actions weren't exactly the same, but this is maybe due to the randomness of the parameter noise which leads to a slightly different action. (Even though everything is seeded, the random generators are in a different state right at the beginning vs. right after the first epoch-cycle.)

This is how I save & restore the model (all in baselines/ddpg/training.py, slightly modified compared to @keithmgould's solution)

# ...
with U.single_threaded_session() as sess:
    # Prepare everything.
    
    if restore == True:
        logger.info("Restoring from saved model")
        saver = tf.train.import_meta_graph(savingModelPath + "ddpg_test_model.meta")
        saver.restore(sess, tf.train.latest_checkpoint(savingModelPath))
    else:
        logger.info("Starting from scratch!")
        sess.run(tf.global_variables_initializer()) # this should happen here and not in the agent right?


    agent.initialize(sess)
    sess.graph.finalize()

    # ...

            # in the epoch_cycles loop (after the training of the model)
            # Saving the trained model
            if(saver is not None):
                logger.info("saving the trained model")
                start_time_save = time.time()
                saver.save(sess, savingModelPath + "ddpg_test_model")
                logger.info('runtime saving: {}s'.format(time.time() - start_time_save))

            logger.info('runtime epoch-cycle {0}: {1}s'.format(cycle, time.time() - start_time_cycle))

        mpi_size = MPI.COMM_WORLD.Get_size()
        # ...

The logs of my testrun without any param or action noise:

# (last rollout step)
selected (unscaled) action: [-0.012 0.005 0.007 -0.002]
Training the Agent
saving the trained model
runtime saving: 2.47790503502s
runtime epoch-cycle 0: 722.832417965s
selected (unscaled) action: [ 0.001 -0.001 0.009 -0.035]
# (run aborted)

# (new run with restore=True and injected state)
Restoring from saved model
INFO:tensorflow:Restoring parameters from ~/Documents/saved_models_OpenAI_gym/ddpg_test_model
selected (unscaled) action: [ 0.001 -0.001 0.009 -0.035]

@jramak
Copy link

jramak commented Jul 27, 2018

@joellutz thanks, seems like you included writing the meta graph in the save compared to @keithmgould 's solution. Did it make a difference when restoring from the meta graph? It says here it's not necessary: https://stackoverflow.com/questions/36195454/what-is-the-tensorflow-checkpoint-meta-file

@joellutz
Copy link

@jramak I don't know if writing the meta graph is really necessary or not, I followed this tutorial where they include it. I haven't tried if it still works without, as the process of capturing the state etc. is quite tedious for my environment.

@LaTinta
Copy link

LaTinta commented Aug 2, 2018

@joellutz Thanks~ Your method is very effective.
If someone wants to save and load DDPG model, there are some steps:

  1. use saver.save() to save tf session;
  2. move sess.run(tf.global_variables_initializer()) from initialize in DDPG, to where you want to initializatize model parameters;
  3. use saver.restore() to load tf session, and use initialize() to load the session in DDPG model.

If you need to use the model in a service, I suggest you may use tf.InteractiveSession() instead of U.single_threaded_session().

@brendenpetersen
Copy link

brendenpetersen commented Sep 27, 2018

@LaTinta This method will not save the information from the RunningMeanStd object, correct? So, if normalize_observations=True (the default value), then the agent used at evaluation will not be the same as the one in training.

For my own use cases, I've set normalize_observations=False to avoid this issue (and because my observation space was already normalized, so it ended up hurting performance anyway). But with PPO1, for example, the RunningMeanStd object is always created (there is no setting to turn it off), so I don't think offline evaluation possible without changing the code.

EDIT: Just saw this at the bottom of the README. Looks like they've added a TfRunningMeanStd class that saves the necessary state as part of the compute graph. Still have to change their code but it should be trivial.

NOTE: At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in baselines/common/vec_env/vec_normalize.py. This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default

EDIT 2: Looks like DDPG and PPO1 don't use VecNormalize, but rather use mpi_running_mean_std.RunningMeanStd, which has no TensorFlow analog. So, I still currently see no way of saving an observation-normalizing DDPG/PPO1 policy without more significant code changes.

@r7vme
Copy link

r7vme commented Oct 1, 2018

Seems stable-baselines DDPG implementation provides ability to save/load.

https://stable-baselines.readthedocs.io/en/master/modules/ddpg.html#example

@jrjbertram
Copy link

FYI, as of commit 858afa8 these approaches no longer work. ddpg/training.py was removed. Looking at adapting them for the refactored codebase.

@jrjbertram
Copy link

I have it working (I think) in the latest codebase... except that my model performs poorly after loading from a checkpoint. Used similar approach as described above. I can't tell if that's due to an error in my code or just a bad training result. Does anyone have any ideas on how to verify that the model was loaded correctly?

Code changes in this commit (I have a copy of openai baselines embedded in my repo for now):
https://github.com/jrjbertram/jsbsim_rl/commit/6825c0c277e94d24e3ecb1450eef82dda8b5793d

For my testing I'm using the approach of loading a 0-length training session followed by a --play.

https://github.com/jrjbertram/jsbsim_rl/blob/master/replay.sh

I verified that the code to reload the model is being executed.

Where I'm confused / suspicious is that during training my rollout return and return_history curves look pretty good... nice logarithmic shape. When I replay though my actions seem fairly random and I don't seem to be acting in a way that would collect any reward.

@Sohojoe
Copy link

Sohojoe commented Nov 23, 2018

@jrjbertram are your normalizing your environment? if so, there is a known issue:

NOTE: At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in baselines/common/vec_env/vec_normalize.py. This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default

source = https://github.com/openai/baselines#saving-loading-and-visualizing-models

AdamGleave pushed a commit to HumanCompatibleAI/baselines that referenced this issue Feb 18, 2019
* Add policy_kwargs with tests.

* Add documentation.

* mark unused variable
@joyce-fang
Copy link

joyce-fang commented Mar 29, 2019

@r7vme The stable-baselines library doesn't seem to solve the RunningMeanStd issue. It changes the normalize_observation default value to False so the RunningMeanStd is not used. When I enable normalize_observation the model does not restore correctly.
EDIT:
Seems like it was fixed last week hill-a@06f5843

@DanielTakeshi
Copy link

Has there been an update on how to properly save DDPG models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests