PPO2 is not using the cpu cores effectively #600

akhilsanand · 2018-09-19T12:09:07Z

Hello,
I am trying to run the PPO2 on a 32 core AMD threadripper cpu using the vectorised environment.
I am trying to run 64 threads in parallel, but its not effectively using all the 64 cores. it uses only less than 50% of the cpu resources. Is it a problem the multiprocessing implementation??

also I am curious to know why ppo2 is not parallelised using MPI like the trpo implementation.

Thanks

pzhokhov · 2018-09-19T21:15:52Z

Hi @akhilsanand ! Actually, ppo2 is parallellized using MPI now (it can work with both vecenv parallelization and with MPI), to run with MPI use mpirun -np 64 python -m baselines.run --alg=ppo2 --num_env=1 ...
To your question though - there could be a couple of reasons. When running with vectorized environments, the parallel part is really only computing env.step for each environment, whereas the neural network gradient computation and updates are not parallelized (well, technically, they can be via tensorflow internal magic, but I don't know how efficient that is). So, if env.step takes less time than computing gradients, the vecenv sub processes will be idling, which is likely what is going on in your case. The other possibility (may happen for environments with observations of large dimensionality) is that communication overhead in SubprocVecEnv is large (i.e. takes relatively long time for master process to receive observations through pipes).
As for recommendations - I would try using ppo2 with MPI; if you specifically want to make it work with vectorized environments - try replacing SubprocVecEnv with ShmemVecEnv (optimized version that sends observations to the master process via shared arrays) in baselines/common/cmd_util.py, here:

baselines/baselines/common/cmd_util.py

Line 41 in 85be745

    
           if num_env > 1: return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])

(don't forget to import it from baselines.common.vec_env.shmem_vec_env)
Hope this helps!

akhilsanand · 2018-09-20T15:23:36Z

hello @pzhokhov, Thanks a lot for your detailed reply.
I dont think the slowness of the vectorized implementation is due to gradient computation because I am using a GPU for gradient computation and I could see it take only very less time for gradient computation. Also I have tired it on a 10 core machine and could use almost 85% of the cpu and on a 4 core machine with more than 90% efficiency. I notice a decrease in efficiency with the increase in cpu cores. It could be because of the communication overhead as my environment observation space is bit large.

And regarding the solutions you have suggested,
Firstly I tried using ShmemVecEnv and could not see any significant improvement.
And then I tried with MPI with 32 workers and using only one GPU with 12 GB of memory, then I ran into GPU memory issues. So I have tried MPI (32 workers) without using GPU, but with this configuration since all the MPI workers are computing the gradient separately the whole learning process is slower compared to vectorised implementation. Please correct me if there is something wrong in my understanding.

But I am curious to know why MPI based implementations are different from the vectorized implementation where the workers are only collecting the data parallely and the whole gradient calculation is be performed by a single GPU. This seems to be much more efficient since I dont have many GPUs compared to cpu cores. I would like to know what is the limitation for a similar implementation with MPI.

Thanks

pzhokhov · 2018-09-20T22:38:35Z

Hm... Okay, let's first consider non-MPI version. What you are saying is that it is not gradient computation that slows things down (because it is fast / you can see on the GPU profiler that's not a bottleneck), and yet most of the cores are underutilized. Also, ShmemVecEnv does not speed things up. In that case, I see two possibilities.

the vecenv subprocesses are actually waiting for master process to do something (this is not necessarily gradient computation on GPU, but can be something else - maybe moving data; maybe tensorflow for some arcane reason decides to put some operations on CPU and those are slow). If that's the case, you should see that one process consumes ~100 % of one core, whereas others (vecenv processes) are close to idling
all of the processes are waiting for some sort of data transfer / lock (even with ShmemVecEnv this is possible - simple example - if you are using env.render() somewhere - ShmemVecEnv does not send that over pipe. Actions are also sent over pipes even in ShmemVecEnv (so if those are large, that can be slow). Also, maybe something is wrong in usage of multiprocessing.Array - and it locks up so that subprocesses have to wait to write into it (this is not likely, but possible). In this case, all processes will consume about equally low percentage of their cores.
Principled solution to either case is to try to log timeline of activity of all of the subprocesses (save the system clock ticks when data transfer starts and ends into a file, and then match them)... somewhat of a pain in the ass.
If you have a reproducible example for such behavior, post it here, I can try to debug it.

Now regarding MPI and computation of gradients in one subprocess vs multiple - if you compare "apples to apples" with constant batch size (i.e. divide --num_env by N = number of MPI workers) the multiple gradient updates should not be much slower than in vecenv case (yes, you are doing N times more gradient computations in parallel, but all of them have batch size N times smaller). Also, at large (multi-machine) scale doing allreduce on parameter updates is more communication-efficient than sending all observations to master; and if network is large enough that sending large parameter updates is costly, then gradient computation likely takes longer than env.step(), so you'll want parallel gradient computation anyways. That being said, topology of parallelism in RL is not really a solved problem - to my knowledge, people experiment a lot with these.

akhilsanand · 2018-09-27T18:33:52Z

Hello @pzhokhov , sorry for the late reply as i was off for a week.
Thanks for the insights :-)

the first case is not valid as I cant see the situation of one process consuming 100% of a core and rest processes idling.
I would surely try debugging it the way you have suggested, but may be after few days as I can not find any time now.

meantime i just tried to use the chunk method but could not find any noticeable improvements with different no of workers [64,128,256and 320] . For all the cases I didnt find any real improvements with the chunk method.

Thanks

lihuang3 · 2018-10-09T13:22:38Z

Hi @pzhokhov, I was a little bit confused here regarding mpirun -np 64 python -m baselines.run --alg=ppo2 --num_env=1 ... mentioned above. Does it mean using 64 workers for only 1 environment (the same seed for all workers)? Or should I use --num_envs=64 instead for different environments?

Thank you

Antymon · 2019-06-29T12:14:46Z

I second the previous question in the sense of confusion around mpi & --num_env argument. Should I understand that --num_env greater than 1 should be used when mpi is NOT used and vice-versa, as they represent 2 approaches to doing the same thing? Or are the semantics not mutually exclusive? When PPO paper speaks of parallel actors which option does it boil down to?

* PEP8 fixes * Update changelog.rst

pzhokhov mentioned this issue Sep 22, 2018

SubprocVecEnv speedup does not scale linearly compared with DummyVecEnv #608

Open

shwang pushed a commit to shwang/baselines that referenced this issue Dec 10, 2019

Minor PEP8 fixes in DQN.py (openai#600)

79646cf

* PEP8 fixes * Update changelog.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO2 is not using the cpu cores effectively #600

PPO2 is not using the cpu cores effectively #600

akhilsanand commented Sep 19, 2018

pzhokhov commented Sep 19, 2018

akhilsanand commented Sep 20, 2018 •

edited

Loading

pzhokhov commented Sep 20, 2018 •

edited

Loading

akhilsanand commented Sep 27, 2018

lihuang3 commented Oct 9, 2018

Antymon commented Jun 29, 2019

PPO2 is not using the cpu cores effectively #600

PPO2 is not using the cpu cores effectively #600

Comments

akhilsanand commented Sep 19, 2018

pzhokhov commented Sep 19, 2018

akhilsanand commented Sep 20, 2018 • edited Loading

pzhokhov commented Sep 20, 2018 • edited Loading

akhilsanand commented Sep 27, 2018

lihuang3 commented Oct 9, 2018

Antymon commented Jun 29, 2019

akhilsanand commented Sep 20, 2018 •

edited

Loading

pzhokhov commented Sep 20, 2018 •

edited

Loading