Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Utilization of GPU #10

Open
notadamking opened this issue Jun 6, 2019 · 14 comments
Open

Improve Utilization of GPU #10

notadamking opened this issue Jun 6, 2019 · 14 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@notadamking
Copy link
Owner

This library achieves very high success rates, though it takes a very long time to optimize and train. This could be improved if we could figure out a way to utilize the GPU more during optimization/training, so the CPU can be less of a bottleneck. Currently, the CPU is being used for most of the intermediate environment calculations, while the GPU is used within the PPO2 algorithm during policy optimization.

I am currently optimizing/training on the following hardware:

  • AMD Threadripper 1920X 12 Core (24 Thread) CPU
  • Nvidia RTX 2080 8GB GPU
  • 16 GB 3000 Mhz RAM

The bottleneck on my system is definitely the CPU, which is surprising as this library takes advantage of the multi-threaded benefits of the Threadripper, and my GPU is staying around 1-10% utilization. I have some ideas on how this could be improved, but would like to start a conversation.

  1. Increase the size of the policy network (i.e. increase the number of hidden layers or increase the number of nodes in each layer)

  2. Do less work in each training loop, so the GPU loop is called more often.

I would love to hear what you guys think. Any ideas or knowledge is welcome to be shared here.

@notadamking notadamking added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jun 6, 2019
@notadamking notadamking self-assigned this Jun 6, 2019
@notadamking notadamking changed the title GPU Underutilization Improve Utilization of GPU Jun 7, 2019
@laneshetron
Copy link

Well, I believe you could swap out some of the numpy logic in your environment with tensorflow methods, which should be eagerly run on the GPU.
Also, I haven't done any profiling or anything, but I'd guess that fitting the SARIMAX model on each observation step is very slow. Perhaps it could be precomputed?

Just some ideas!

@notadamking
Copy link
Owner Author

  1. Great idea to replace some of the numpy logic with tensorflow. Though, I am curious to see how much of improvement this will yield, as there aren't many large calculations done in numpy. Perhaps more impactful, would be to replace the sklearn scaling methods with tensorflow methods, since we re-scale the entire data frame on each step.

  2. Pre-computing the SARIMAX is another great idea. Since it is currently calculated on each time step, any time we reset the environment we will be re-calculating all of the same SARIMAX predictions at each time step.

@archenroot
Copy link

archenroot commented Jun 8, 2019

I have 32 thread dual xeon with dual 1080 ti and cpu only 100& single thread, 2 others around 50, 2 others under 20 and rest under 10 percent, so low load on cpu super low, gpus are one constantly at 0 and other max 1 percent..:-)

@archenroot
Copy link

I didn't do full profiling, but one can use py-spy similar to iotop,top to watch behavior:
image

@botemple
Copy link
Contributor

Same here. how to fully utilized with multi-GPU usage to train the agents.
even could distributed-ready will be even better.

@TalhaAsmal
Copy link

@archenroot have you tried increasing the parallelism? Increase n_jobs in optimize.py to a value equal to the number of cores you have, and it should increase utilization.

@archenroot
Copy link

archenroot commented Jun 10, 2019

@TalhaAsmal - well, it doesn't work at least from 10, i tried even 64 :-) as reported in other issue, there is then problem with concurrency assess into sqlite. On the other hand I replaced sqlite with Postgres engine, but optuna doesn't at moment support (there is PR for it already waiting to merge) custom parameters (pool_size, etc.), so SQLAlchemy is failing as well on default config with Postgres, once optuna has merged new PR, we can achieve this heavy parallelizm, but it doesn't work as of now....

After 2 days, at around 400 trials finished with 4 threads :D (brutal race), also lot of those are PRUNED as marked at early stages as unpromising... will se what config I will get ... I think another 2-3 days...

@archenroot
Copy link

@TalhaAsmal - I will try today evening install custom optuna branch with requested fix for custom driver params with Postgres.

@TalhaAsmal
Copy link

@TalhaAsmal - I will try today evening install custom optuna branch with requested fix for custom driver params with Postgres.

@archenroot did you manage to try it with the custom optuna branch? I also ran into concurrency issues with sqlite, but since I have a very old CPU (2600k) I just reduced the parallelism to 2, with obvious negative consequences for speed.

@dennywangtenk
Copy link

  1. use SubprocVecEnv instead of DummyVecEnv should improve a lot in multiple-CPU environment.

Here are why:
According to baselines doc Vectorized Environments,

DummyVecEnv Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the current Python process.
SubprocVecEnv Creates a multiprocess vectorized wrapper for multiple environments, distributing each environment to its own process, allowing significant speed up when the environment is computationally complex.

In my own experiments with Atari games, using SubprocVecEnv improved performance 200% ~ 250%.

  1. And there is more, according @zuoxingdon method at here. "Chunk" based VecEnv could boost performance for additional 900% comparing to SubprocVecEnv! and implementation can be found here.

  2. Further improve GPU utilization, suggested here to increase env num to keep GPU busy.

@archenroot
Copy link

@dennywangtenk - sure I used the SubproceVecEnv, but did you test yourself before writing here?, some other things get broken and also sqlite is not storage for concurrency access, some other db yet doesn't work as Optuna doesn't have custom config (there is PR, but not released, you an build yourself...)

@dennywangtenk
Copy link

dennywangtenk commented Jun 23, 2019

@archenroot , I got similar error, it seems due to both optuna and baselines SubproceVecEnv are both using multiprocessing has some conflicts. set n_jobs = 1, force optuna on sequential.
May need to check SubproceVecEnv source code see if it's thread safe.

@Ruben-E
Copy link

Ruben-E commented Jun 27, 2019

@dennywangtenk does it actually make sense to set n_jobs to 1 and switch to SubprocVecEnv? Looks like all sub environment processes are using the same parameters for each trial when doing that. The goal of the optimize step is to find the optimal parameters and test with these as much as possible. Am I correct?

@TheDoctorAI
Copy link

TheDoctorAI commented Jul 1, 2019

I am now at Training for: 13144 time steps and my Titan GPU is still idling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants