-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPO2 is not using the cpu cores effectively #600
Comments
Hi @akhilsanand ! Actually, ppo2 is parallellized using MPI now (it can work with both vecenv parallelization and with MPI), to run with MPI use baselines/baselines/common/cmd_util.py Line 41 in 85be745
(don't forget to import it from baselines.common.vec_env.shmem_vec_env) Hope this helps! |
hello @pzhokhov, Thanks a lot for your detailed reply. And regarding the solutions you have suggested, But I am curious to know why MPI based implementations are different from the vectorized implementation where the workers are only collecting the data parallely and the whole gradient calculation is be performed by a single GPU. This seems to be much more efficient since I dont have many GPUs compared to cpu cores. I would like to know what is the limitation for a similar implementation with MPI. Thanks |
Hm... Okay, let's first consider non-MPI version. What you are saying is that it is not gradient computation that slows things down (because it is fast / you can see on the GPU profiler that's not a bottleneck), and yet most of the cores are underutilized. Also, ShmemVecEnv does not speed things up. In that case, I see two possibilities.
Now regarding MPI and computation of gradients in one subprocess vs multiple - if you compare "apples to apples" with constant batch size (i.e. divide --num_env by N = number of MPI workers) the multiple gradient updates should not be much slower than in vecenv case (yes, you are doing N times more gradient computations in parallel, but all of them have batch size N times smaller). Also, at large (multi-machine) scale doing allreduce on parameter updates is more communication-efficient than sending all observations to master; and if network is large enough that sending large parameter updates is costly, then gradient computation likely takes longer than env.step(), so you'll want parallel gradient computation anyways. That being said, topology of parallelism in RL is not really a solved problem - to my knowledge, people experiment a lot with these. |
Hello @pzhokhov , sorry for the late reply as i was off for a week.
meantime i just tried to use the chunk method but could not find any noticeable improvements with different no of workers [64,128,256and 320] . For all the cases I didnt find any real improvements with the chunk method. Thanks |
Hi @pzhokhov, I was a little bit confused here regarding Thank you |
I second the previous question in the sense of confusion around mpi & --num_env argument. Should I understand that --num_env greater than 1 should be used when mpi is NOT used and vice-versa, as they represent 2 approaches to doing the same thing? Or are the semantics not mutually exclusive? When PPO paper speaks of parallel actors which option does it boil down to? |
* PEP8 fixes * Update changelog.rst
Hello,
I am trying to run the PPO2 on a 32 core AMD threadripper cpu using the vectorised environment.
I am trying to run 64 threads in parallel, but its not effectively using all the 64 cores. it uses only less than 50% of the cpu resources. Is it a problem the multiprocessing implementation??
also I am curious to know why ppo2 is not parallelised using MPI like the trpo implementation.
Thanks
The text was updated successfully, but these errors were encountered: