Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some benchmarks on six MuJoCo-v2 environments for DDPG and TD3 #63

Open
DanielTakeshi opened this issue Jun 22, 2019 · 3 comments
Open

Comments

@DanielTakeshi
Copy link

DanielTakeshi commented Jun 22, 2019

Hi @vitchyr

Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.

For installation, I actually didn't entirely follow the installation instructions, but here's what I did:

  • I used a Python 3.6.7 pip virtualenv, and just manually installed the packages I saw in your installation yml file. I used torch 0.4.1 as recommended.
  • I actually used MuJoCo 2.0, so I was using the -v2 instances of the environments.
  • I used gym 0.12.5 and mujoco-py 2.0.2.2

I took the master branch from 5565dd5 and then adjusted the examples/td3.py and examples/ddpg.py so that they also imported other MuJoCo environments. In addition, for TD3 only, I adjusted the hyperparameters in the "algorithm_kwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L71-L79

And TD3 uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/td3.py#L104-L111

I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.

If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.

I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:

$ ls -lh data/
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Walker2d-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Walker2d-v2
$ ls -lh data/rlkit-ddpg-Ant-v2/
drwxrwxr-x 2 daniel daniel 4.0K Jun 20 20:49 rlkit-ddpg-Ant-v2_2019_06_20_20_49_44_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_20_53_49_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_44_22_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_49_37_0000--s-0
$ 

// other env results presented in a similar manner

For this I used the following plotting script where I just call it like python [script].py Ant-v2 and similarly for the other environments:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import pandas as pd
import os
import numpy as np
from os.path import join

# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
error_region_alpha = 0.25


def smoothed(x, w):
    """Smooth x by averaging over sliding windows of w, assuming sufficient length.
    """
    if len(x) <= w:
        return x
    smooth = []
    for i in range(1, w):
        smooth.append( np.mean(x[0:i]) )
    for i in range(w, len(x)+1):
        smooth.append( np.mean(x[i-w:i]) )
    assert len(x) == len(smooth), "lengths: {}, {}".format(len(x), len(smooth))
    return np.array(smooth)


def plot(args):
    """Load the progress csv file, and plot.

    Plot:
      'exploration/Returns Mean',
      'exploration/num steps total',
      'evaluation/Returns Mean',
      'evaluation/num steps total',
    """
    nrows, ncols = 1, 2
    fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
                           figsize=(11*ncols,6*nrows))

    algorithms = sorted([x for x in os.listdir('data/') if args.env in x])
    assert len(algorithms) == 2
    colors = ['blue', 'red']

    for idx,alg in enumerate(algorithms):
        print('Currently on algorithm: ', alg)
        alg_dir = join('data', alg)
        progfiles = sorted([
                join(alg_dir, x, 'progress.csv') for x in os.listdir(alg_dir)
        ])
        expl_returns = []
        eval_returns = []
        expl_steps = []
        eval_steps = []

        for prog in progfiles:
            df = pd.read_csv(prog, delimiter = ',')

            expl_ret = df['exploration/Returns Mean'].tolist()
            expl_returns.append(expl_ret)
            eval_ret = df['evaluation/Returns Mean'].tolist()
            eval_returns.append(eval_ret)

            expl_sp = df['exploration/num steps total'].tolist()
            expl_steps.append(expl_sp)
            eval_sp = df['evaluation/num steps total'].tolist()
            eval_steps.append(eval_sp)

        expl_returns = np.array(expl_returns)
        eval_returns = np.array(eval_returns)
        xs = expl_returns.shape[1]
        expl_ret_mean = np.mean(expl_returns, axis=0)
        eval_ret_mean = np.mean(eval_returns, axis=0)
        expl_ret_std = np.mean(expl_returns, axis=0)
        eval_ret_std = np.mean(eval_returns, axis=0)

        w = 10
        label0 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(expl_ret_mean[-w:]))
        label1 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(eval_ret_mean[-w:]))
        ax[0,0].plot(np.arange(xs), smoothed(expl_ret_mean, w=w),
                     color=colors[idx], label=label0)
        ax[0,1].plot(np.arange(xs), smoothed(eval_ret_mean, w=w),
                     color=colors[idx], label=label1)

        # This can be noisy.
        if False:
            ax[0,0].fill_between(np.arange(xs),
                                 expl_ret_mean-expl_ret_std,
                                 expl_ret_mean+expl_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])
            ax[0,1].fill_between(np.arange(xs),
                                 eval_ret_mean-eval_ret_std,
                                 eval_ret_mean+eval_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])

    for i in range(2):
        ax[0,i].tick_params(axis='x', labelsize=ticksize)
        ax[0,i].tick_params(axis='y', labelsize=ticksize)
        leg = ax[0,i].legend(loc="best", ncol=1, prop={'size':legendsize})
        for legobj in leg.legendHandles:
            legobj.set_linewidth(5.0)
    ax[0,0].set_title('{} (Exloration)'.format(args.env), fontsize=ysize)
    ax[0,1].set_title('{} (Evaluation)'.format(args.env), fontsize=ysize)

    plt.tight_layout()
    figname = 'fig-{}.png'.format(args.env)
    plt.savefig(figname)
    print("\nJust saved: {}".format(figname))


if __name__ == "__main__":
    pp = argparse.ArgumentParser()
    pp.add_argument('env', type=str)
    args = pp.parse_args()
    plot(args)

Here are the curves. Left is the exploration policy, and right is the evaluation policy.

fig-Ant-v2

fig-HalfCheetah-v2

fig-Hopper-v2

fig-InvertedPendulum-v2

fig-Reacher-v2

fig-Walker2d-v2

The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.

I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!

@DanielTakeshi
Copy link
Author

One more thing the examples script has code like this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L22-L24

and we are using Tanh policies:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L35-L39

Just wondering, is the NormalizedBoxEnv needed in this case? Perhaps it was just added to let us know what we could do with it later? By default it seems like we are not normalizing observations or returns. Thus, NormalizedBoxEnv would only serve to clip actions in [-1,1] for each component. But the tanh will naturally force it in that range anyway.

The only other possibility I can think of for the NormalizedBoxEnv is if the extra noise injected into the exploration policy causes the actions to exceed the [-1,1] range in some components. But after inserting some print and assertion checks in the normalized box env stepping method, and running python examples/ddpg.py, shows that no actions are outside the range so presumably the action+noise for exploration is clipped somewhere before that.

@vitchyr
Copy link
Collaborator

vitchyr commented Jun 26, 2019 via email

@ZhenhuiTang
Copy link

Hi, I was wondering what is the difference between the exploration policy and the evaluation policy? Which one is common used in RL paper? I mean, is the training curve in the SAC paper is based on the exploration policy which corresponds to 'expl/Average Returns'? Why rewards from evaluation policy tends to better than that from the exploration policy?

I really look forward to your reply!

Hi @vitchyr

Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.

For installation, I actually didn't entirely follow the installation instructions, but here's what I did:

  • I used a Python 3.6.7 pip virtualenv, and just manually installed the packages I saw in your installation yml file. I used torch 0.4.1 as recommended.
  • I actually used MuJoCo 2.0, so I was using the -v2 instances of the environments.
  • I used gym 0.12.5 and mujoco-py 2.0.2.2

I took the master branch from 5565dd5 and then adjusted the examples/td3.py and examples/ddpg.py so that they also imported other MuJoCo environments. In addition, for TD3 only, I adjusted the hyperparameters in the "algorithm_kwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L71-L79

And TD3 uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/td3.py#L104-L111

I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.

If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.

I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:

$ ls -lh data/
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Walker2d-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Walker2d-v2
$ ls -lh data/rlkit-ddpg-Ant-v2/
drwxrwxr-x 2 daniel daniel 4.0K Jun 20 20:49 rlkit-ddpg-Ant-v2_2019_06_20_20_49_44_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_20_53_49_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_44_22_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_49_37_0000--s-0
$ 

// other env results presented in a similar manner

For this I used the following plotting script where I just call it like python [script].py Ant-v2 and similarly for the other environments:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import pandas as pd
import os
import numpy as np
from os.path import join

# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
error_region_alpha = 0.25


def smoothed(x, w):
    """Smooth x by averaging over sliding windows of w, assuming sufficient length.
    """
    if len(x) <= w:
        return x
    smooth = []
    for i in range(1, w):
        smooth.append( np.mean(x[0:i]) )
    for i in range(w, len(x)+1):
        smooth.append( np.mean(x[i-w:i]) )
    assert len(x) == len(smooth), "lengths: {}, {}".format(len(x), len(smooth))
    return np.array(smooth)


def plot(args):
    """Load the progress csv file, and plot.

    Plot:
      'exploration/Returns Mean',
      'exploration/num steps total',
      'evaluation/Returns Mean',
      'evaluation/num steps total',
    """
    nrows, ncols = 1, 2
    fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
                           figsize=(11*ncols,6*nrows))

    algorithms = sorted([x for x in os.listdir('data/') if args.env in x])
    assert len(algorithms) == 2
    colors = ['blue', 'red']

    for idx,alg in enumerate(algorithms):
        print('Currently on algorithm: ', alg)
        alg_dir = join('data', alg)
        progfiles = sorted([
                join(alg_dir, x, 'progress.csv') for x in os.listdir(alg_dir)
        ])
        expl_returns = []
        eval_returns = []
        expl_steps = []
        eval_steps = []

        for prog in progfiles:
            df = pd.read_csv(prog, delimiter = ',')

            expl_ret = df['exploration/Returns Mean'].tolist()
            expl_returns.append(expl_ret)
            eval_ret = df['evaluation/Returns Mean'].tolist()
            eval_returns.append(eval_ret)

            expl_sp = df['exploration/num steps total'].tolist()
            expl_steps.append(expl_sp)
            eval_sp = df['evaluation/num steps total'].tolist()
            eval_steps.append(eval_sp)

        expl_returns = np.array(expl_returns)
        eval_returns = np.array(eval_returns)
        xs = expl_returns.shape[1]
        expl_ret_mean = np.mean(expl_returns, axis=0)
        eval_ret_mean = np.mean(eval_returns, axis=0)
        expl_ret_std = np.mean(expl_returns, axis=0)
        eval_ret_std = np.mean(eval_returns, axis=0)

        w = 10
        label0 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(expl_ret_mean[-w:]))
        label1 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(eval_ret_mean[-w:]))
        ax[0,0].plot(np.arange(xs), smoothed(expl_ret_mean, w=w),
                     color=colors[idx], label=label0)
        ax[0,1].plot(np.arange(xs), smoothed(eval_ret_mean, w=w),
                     color=colors[idx], label=label1)

        # This can be noisy.
        if False:
            ax[0,0].fill_between(np.arange(xs),
                                 expl_ret_mean-expl_ret_std,
                                 expl_ret_mean+expl_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])
            ax[0,1].fill_between(np.arange(xs),
                                 eval_ret_mean-eval_ret_std,
                                 eval_ret_mean+eval_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])

    for i in range(2):
        ax[0,i].tick_params(axis='x', labelsize=ticksize)
        ax[0,i].tick_params(axis='y', labelsize=ticksize)
        leg = ax[0,i].legend(loc="best", ncol=1, prop={'size':legendsize})
        for legobj in leg.legendHandles:
            legobj.set_linewidth(5.0)
    ax[0,0].set_title('{} (Exloration)'.format(args.env), fontsize=ysize)
    ax[0,1].set_title('{} (Evaluation)'.format(args.env), fontsize=ysize)

    plt.tight_layout()
    figname = 'fig-{}.png'.format(args.env)
    plt.savefig(figname)
    print("\nJust saved: {}".format(figname))


if __name__ == "__main__":
    pp = argparse.ArgumentParser()
    pp.add_argument('env', type=str)
    args = pp.parse_args()
    plot(args)

Here are the curves. Left is the exploration policy, and right is the evaluation policy.

fig-Ant-v2

fig-HalfCheetah-v2

fig-Hopper-v2

fig-InvertedPendulum-v2

fig-Reacher-v2

fig-Walker2d-v2

The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.

I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants