`SubprocVecEnv` speedup does not scale linearly compared with `DummyVecEnv` #608

zuoxingdong · 2018-09-21T11:27:16Z

I made some toy benchmark by creating 16 environments for both SubprocVecEnv and DummyVecEnv. And collect 1000 time steps by firstly reset the environment and feed random action sampled from action space within a for loop.

It turns out the speed of the simulator step is quite crucial for total speedup. For example, HalfCheetah-v2 is roughly 1.5-2x faster and 'FetchPush-v1' could be 7-9x faster. I guess it depends on the dynamics where cheetah is simpler.

For classic control environments like CartPole-v1, it seems using DummyVecEnv is much better, since the speedup is ~0.2x, i.e. 5x slower than DummyVecEnv.

I am considering that if it is feasible to scale up the speedup further to be approximately linear with the number of environments ? Or the main reason is coming from computing overhead in the Process ?

The text was updated successfully, but these errors were encountered:

pzhokhov · 2018-09-22T01:38:34Z

I think the most important factor for speedup is ratio of compute time (simulator step time) to the communication time (time it takes to send observations and actions through the pipes); the larger this ratio - the better speedup one can achieve.
The first thing I'd try is the same experiment using ShmemVecEnv, which eliminates some of the communication overhead by using shared arrays... Anything further will likely require a detailed analysis on where exactly is time being spent.

pzhokhov · 2018-09-22T01:39:23Z

Similar discussion here: #600

zuoxingdong · 2018-09-22T21:55:37Z

@pzhokhov Thanks a lot for your detailed explanation, on my local server, it seems another potential factor is many simulators are quite fast enough e.g. classic control environments CartPole, Pendulum, the step itself is already super fast, even for most Mujoco environments where observations are most low-dimensional raw configurations, maybe that's also one of the reasons why I didn't see a significant speedup.

By the way, I have a very dump question, is there a reason why in VecEnv the parallelization is done by Python multiprocessing package and in some algorithms like TRPO, PPO1, HER they use MPI instead ? To my understanding, MPI can be used for distributed systems (multiple computing nodes) but Multiprocessing and MPI considered to be lower level so might be potentially faster than the latter ?

zuoxingdong · 2018-09-22T22:41:43Z

@pzhokhov Given the case that some simulators step is not often too much a bottleneck, would it be a good idea to parallelize chunks of sub-environments in parallel instead of open one Process for each sub-environments. For example, let's say HalfCheetah-v2 is already quite fast for step and the actions are low-dimensional, and we have 64 batched environments, then let's say rather than creating 64 Processes, how about create 16 Processes instead and each handles 4 environments in serial. In this case, given that action dimensions sending to workers is low and observations are updated to shared array, will this be more efficient in theory ?

And since different environments might have different complexity on step, maybe it can be done by providing an option to choose the chunk size (default: 1).

zuoxingdong · 2018-09-23T14:02:22Z

@pzhokhov Also if it could be beneficial for further speedup by using asyncio (Python 3.7+) for I/O concurrency combined with multiprocessing ?

pzhokhov · 2018-09-25T19:27:43Z

chunks of sub-environments per process instead of one per process is a great idea! I'd be very interested to see the results of that (how much faster does venv.step() method become with different types of sub-environments. asyncio - also a good one, although I'd rather keep things compatible with python 3.6 and below. Anyways, if you feel like implementing any of these - do not let me stop you from submitting a PR :)
To the MPI vs multiprocessing question - the VecEnv configuration (master process updating neural net, subprocesses running env.step) is especially beneficial for conv nets and atari-like envs, because then the updates from relatively large batches can be computed on a GPU fast (much faster than if every process were to run gradient computation on its own - several processes actively interacting with GPU is usually not a great setup). In principle, the same communication pattern can be done with MPI, but it is a little more involved, and requires MPI installed. On the other hand, in mujoco-like environments (when using non-pixel observations - positions and velocities of joints) neural nets are relatively small, so batching data to compute the update does not give much of a speed-up; on the other hand, with MPI you can actually run the experiment on a distributed machine - that's why, for instance, HER uses MPI over SubprocVecEnv. For TRPO and PPO1 the choice could have been done either way; in fact, PPO2 can use both. I don't know relative latency of MPI communication over pipes, I suspect that those should be similar; but have never measured it / seen the measurements.

zuoxingdong · 2018-09-26T21:19:21Z

@pzhokhov Thanks a lot for your detailed comments. For the chunk-based parallel VecEnv, I quickly made an implementation and a benchmark on my laptop (4-cores) on CartPole-v1, showing as attached image

For this preliminary benchmark results, it seems the 'chunk' idea works quite reasonably, tomorrow I'll try out on the server (DGX-1) for more number of environment and different environment types.

zuoxingdong · 2018-09-27T09:08:37Z

There is an additional benchmark on some Mujoco environments (tested on DGX-1)

zuoxingdong · 2018-09-27T09:49:10Z

Here is the benchmark code

Firstly, one should copy and paste the code from #620 to single file named chunk_version.py and original sub-process VecEnv in another file named original_version.py

from functools import partial
import gym
from time import time
import numpy as np

env_id = 'Ant-v2'

def make_env(seed):
    env = gym.make(env_id)
    env.seed(seed)
    
    return env

list_ori_time = []
list_chunk_time1 = []
list_chunk_time2 = []
list_chunk_time4 = []
list_chunk_time6 = []
list_chunk_time8 = []

num_envs = [8, 20, 32, 45, 64, 80, 100, 128]
for n in num_envs:
    make_fns = [partial(make_env, seed=i) for i in range(n)]

    list_ori_time.append(get_ori_time(make_fns))
    list_chunk_time1.append(get_chunk_time(make_fns, 1))
    list_chunk_time2.append(get_chunk_time(make_fns, 2))
    list_chunk_time4.append(get_chunk_time(make_fns, 4))
    list_chunk_time6.append(get_chunk_time(make_fns, 6))
    list_chunk_time8.append(get_chunk_time(make_fns, 8))

from chunk_version import SubprocChunkVecEnv
from original_version import SubprocVecEnv

def get_chunk_time(make_fns, size=1):
    chunk_times = []
    for _ in range(10):
        t = time()
        env = SubprocChunkVecEnv(make_fns, size)

        env.reset()

        env.step([env.action_space.sample()]*len(make_fns))

        chunk_times.append(time() - t)
        
        env.close()
        del env

    return np.mean(chunk_times)


def get_ori_time(make_fns):
    ori_times = []
    for _ in range(10):
        t = time()
        env = SubprocVecEnv(make_fns)

        env.reset()

        env.step([env.action_space.sample()]*len(make_fns))

        ori_times.append(time() - t)
        
        env.close()
        del env

    return np.mean(ori_times)

Plotting the benchmark

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

plt.plot(num_envs, list_ori_time, label='SubprocVecEnv')
plt.plot(num_envs, list_chunk_time1, label='Chunk-1')
plt.plot(num_envs, list_chunk_time2, label='Chunk-2')
plt.plot(num_envs, list_chunk_time4, label='Chunk-4')
plt.plot(num_envs, list_chunk_time6, label='Chunk-6')
plt.plot(num_envs, list_chunk_time8, label='Chunk-8')

plt.legend(loc='upper left')
plt.xlabel('Number of sub-environments')
plt.ylabel('Mean (30) time duration for single `step`')
plt.title(f'Benchmark of chunk-based VecEnv in {env_id}')
plt.xticks(num_envs)
plt.show()

pzhokhov · 2018-10-04T17:30:34Z

Thanks for the thorough investigation, @zuoxingdong! Looks like the mujoco envs behave very similarly to each other. Could you also benchmark it on one or two atari envs (since that's where we mostly use SubprocVecEnv presently)? We can then include the plots in the docs - something like step() time as a function of number of subenvs for cartpole, for mujoco ant, and for some atari game. If you have extra time, it would be interesting to put DummyVecEnv and ShmemVecEnv on the same plots :) Also, let's merge the PR

zuoxingdong · 2018-10-08T11:57:35Z

@pzhokhov

Here is the benchmark in two Atari games Pong-v4 and MontezumaRevenge-v0, it seems for Atari games, the step() does need more computation time than communication overhead. All chunk versions take longer time than SubprocVecEnv. Thus it shows that the current implementation of parallelized environments is the best option for Atari games.

zuoxingdong · 2018-10-08T13:15:32Z

@pzhokhov Here are some benchmark adding DummyVecEnv and ShmemVecEnv

zuoxingdong · 2018-10-08T13:17:04Z

@pzhokhov I did some experiment, it seems batched environment for Mujoco does not work well as single one, just wondering if there might be some grounding reasons why Atari games can benefits from batched data but Mujoco.

zuoxingdong · 2018-10-09T11:18:32Z

In Pong-v4, it shows that DummyVecEnv suffers a lot from time consumption, and ShmemVecEnv does not benefit significant speedup yet.

… tree (openai#608) * Parallelized sampling from the replay buffer and building the segment tree.

zuoxingdong mentioned this issue Sep 26, 2018

Add Chunk-based VecEnv for further speedup: fix #608 #620

Open

dennywangtenk mentioned this issue Jun 23, 2019

Improve Utilization of GPU notadamking/RLTrader#10

Open

shwang pushed a commit to shwang/baselines that referenced this issue Jan 10, 2020

Parallelized sampling from the replay buffer and building the segment…

c7084c8

… tree (openai#608) * Parallelized sampling from the replay buffer and building the segment tree.

tristandeleu mentioned this issue Jan 29, 2020

Chunking in vector environments openai/gym#1795

Closed

UntotaufUrlaub mentioned this issue Mar 10, 2022

SubprocVecEnv in_series DLR-RM/stable-baselines3#815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`SubprocVecEnv` speedup does not scale linearly compared with `DummyVecEnv` #608

`SubprocVecEnv` speedup does not scale linearly compared with `DummyVecEnv` #608

zuoxingdong commented Sep 21, 2018 •

edited

Loading

pzhokhov commented Sep 22, 2018

pzhokhov commented Sep 22, 2018

zuoxingdong commented Sep 22, 2018

zuoxingdong commented Sep 22, 2018

zuoxingdong commented Sep 23, 2018 •

edited

Loading

pzhokhov commented Sep 25, 2018

zuoxingdong commented Sep 26, 2018 •

edited

Loading

zuoxingdong commented Sep 27, 2018 •

edited

Loading

zuoxingdong commented Sep 27, 2018

pzhokhov commented Oct 4, 2018

zuoxingdong commented Oct 8, 2018

zuoxingdong commented Oct 8, 2018

zuoxingdong commented Oct 8, 2018 •

edited

Loading

zuoxingdong commented Oct 9, 2018

SubprocVecEnv speedup does not scale linearly compared with DummyVecEnv #608

SubprocVecEnv speedup does not scale linearly compared with DummyVecEnv #608

Comments

zuoxingdong commented Sep 21, 2018 • edited Loading

pzhokhov commented Sep 22, 2018

pzhokhov commented Sep 22, 2018

zuoxingdong commented Sep 22, 2018

zuoxingdong commented Sep 22, 2018

zuoxingdong commented Sep 23, 2018 • edited Loading

pzhokhov commented Sep 25, 2018

zuoxingdong commented Sep 26, 2018 • edited Loading

zuoxingdong commented Sep 27, 2018 • edited Loading

zuoxingdong commented Sep 27, 2018

pzhokhov commented Oct 4, 2018

zuoxingdong commented Oct 8, 2018

zuoxingdong commented Oct 8, 2018

zuoxingdong commented Oct 8, 2018 • edited Loading

zuoxingdong commented Oct 9, 2018

`SubprocVecEnv` speedup does not scale linearly compared with `DummyVecEnv` #608

`SubprocVecEnv` speedup does not scale linearly compared with `DummyVecEnv` #608

zuoxingdong commented Sep 21, 2018 •

edited

Loading

zuoxingdong commented Sep 23, 2018 •

edited

Loading

zuoxingdong commented Sep 26, 2018 •

edited

Loading

zuoxingdong commented Sep 27, 2018 •

edited

Loading

zuoxingdong commented Oct 8, 2018 •

edited

Loading