Improve Utilization of GPU #10

notadamking · 2019-06-06T15:11:55Z

This library achieves very high success rates, though it takes a very long time to optimize and train. This could be improved if we could figure out a way to utilize the GPU more during optimization/training, so the CPU can be less of a bottleneck. Currently, the CPU is being used for most of the intermediate environment calculations, while the GPU is used within the PPO2 algorithm during policy optimization.

I am currently optimizing/training on the following hardware:

AMD Threadripper 1920X 12 Core (24 Thread) CPU
Nvidia RTX 2080 8GB GPU
16 GB 3000 Mhz RAM

The bottleneck on my system is definitely the CPU, which is surprising as this library takes advantage of the multi-threaded benefits of the Threadripper, and my GPU is staying around 1-10% utilization. I have some ideas on how this could be improved, but would like to start a conversation.

Increase the size of the policy network (i.e. increase the number of hidden layers or increase the number of nodes in each layer)
Do less work in each training loop, so the GPU loop is called more often.

I would love to hear what you guys think. Any ideas or knowledge is welcome to be shared here.

laneshetron · 2019-06-07T03:48:15Z

Well, I believe you could swap out some of the numpy logic in your environment with tensorflow methods, which should be eagerly run on the GPU.
Also, I haven't done any profiling or anything, but I'd guess that fitting the SARIMAX model on each observation step is very slow. Perhaps it could be precomputed?

Just some ideas!

notadamking · 2019-06-07T17:10:18Z

Great idea to replace some of the numpy logic with tensorflow. Though, I am curious to see how much of improvement this will yield, as there aren't many large calculations done in numpy. Perhaps more impactful, would be to replace the sklearn scaling methods with tensorflow methods, since we re-scale the entire data frame on each step.
Pre-computing the SARIMAX is another great idea. Since it is currently calculated on each time step, any time we reset the environment we will be re-calculating all of the same SARIMAX predictions at each time step.

archenroot · 2019-06-08T11:53:14Z

I have 32 thread dual xeon with dual 1080 ti and cpu only 100& single thread, 2 others around 50, 2 others under 20 and rest under 10 percent, so low load on cpu super low, gpus are one constantly at 0 and other max 1 percent..:-)

archenroot · 2019-06-08T12:29:07Z

I didn't do full profiling, but one can use py-spy similar to iotop,top to watch behavior:

botemple · 2019-06-10T18:46:46Z

Same here. how to fully utilized with multi-GPU usage to train the agents.
even could distributed-ready will be even better.

TalhaAsmal · 2019-06-10T19:03:33Z

@archenroot have you tried increasing the parallelism? Increase n_jobs in optimize.py to a value equal to the number of cores you have, and it should increase utilization.

archenroot · 2019-06-10T19:51:04Z

@TalhaAsmal - well, it doesn't work at least from 10, i tried even 64 :-) as reported in other issue, there is then problem with concurrency assess into sqlite. On the other hand I replaced sqlite with Postgres engine, but optuna doesn't at moment support (there is PR for it already waiting to merge) custom parameters (pool_size, etc.), so SQLAlchemy is failing as well on default config with Postgres, once optuna has merged new PR, we can achieve this heavy parallelizm, but it doesn't work as of now....

After 2 days, at around 400 trials finished with 4 threads :D (brutal race), also lot of those are PRUNED as marked at early stages as unpromising... will se what config I will get ... I think another 2-3 days...

archenroot · 2019-06-11T04:34:22Z

@TalhaAsmal - I will try today evening install custom optuna branch with requested fix for custom driver params with Postgres.

TalhaAsmal · 2019-06-12T18:39:49Z

@TalhaAsmal - I will try today evening install custom optuna branch with requested fix for custom driver params with Postgres.

@archenroot did you manage to try it with the custom optuna branch? I also ran into concurrency issues with sqlite, but since I have a very old CPU (2600k) I just reduced the parallelism to 2, with obvious negative consequences for speed.

dennywangtenk · 2019-06-23T20:29:52Z

use SubprocVecEnv instead of DummyVecEnv should improve a lot in multiple-CPU environment.

Here are why:
According to baselines doc Vectorized Environments,

DummyVecEnv Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the current Python process.
SubprocVecEnv Creates a multiprocess vectorized wrapper for multiple environments, distributing each environment to its own process, allowing significant speed up when the environment is computationally complex.

In my own experiments with Atari games, using SubprocVecEnv improved performance 200% ~ 250%.

And there is more, according @zuoxingdon method at here. "Chunk" based VecEnv could boost performance for additional 900% comparing to SubprocVecEnv! and implementation can be found here.
Further improve GPU utilization, suggested here to increase env num to keep GPU busy.

archenroot · 2019-06-23T20:51:41Z

@dennywangtenk - sure I used the SubproceVecEnv, but did you test yourself before writing here?, some other things get broken and also sqlite is not storage for concurrency access, some other db yet doesn't work as Optuna doesn't have custom config (there is PR, but not released, you an build yourself...)

dennywangtenk · 2019-06-23T21:57:05Z

@archenroot , I got similar error, it seems due to both optuna and baselines SubproceVecEnv are both using multiprocessing has some conflicts. set n_jobs = 1, force optuna on sequential.
May need to check SubproceVecEnv source code see if it's thread safe.

Ruben-E · 2019-06-27T08:28:49Z

@dennywangtenk does it actually make sense to set n_jobs to 1 and switch to SubprocVecEnv? Looks like all sub environment processes are using the same parameters for each trial when doing that. The goal of the optimize step is to find the optimal parameters and test with these as much as possible. Am I correct?

TheDoctorAI · 2019-07-01T06:25:43Z

I am now at Training for: 13144 time steps and my Titan GPU is still idling.

notadamking added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jun 6, 2019

notadamking self-assigned this Jun 6, 2019

notadamking changed the title ~~GPU Underutilization~~ Improve Utilization of GPU Jun 7, 2019

TalhaAsmal mentioned this issue Jun 12, 2019

TensorFlow Fail #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Utilization of GPU #10

Improve Utilization of GPU #10

notadamking commented Jun 6, 2019

laneshetron commented Jun 7, 2019

notadamking commented Jun 7, 2019

archenroot commented Jun 8, 2019 •

edited

Loading

archenroot commented Jun 8, 2019

botemple commented Jun 10, 2019

TalhaAsmal commented Jun 10, 2019

archenroot commented Jun 10, 2019 •

edited

Loading

archenroot commented Jun 11, 2019

TalhaAsmal commented Jun 12, 2019

dennywangtenk commented Jun 23, 2019

archenroot commented Jun 23, 2019

dennywangtenk commented Jun 23, 2019 •

edited

Loading

Ruben-E commented Jun 27, 2019

TheDoctorAI commented Jul 1, 2019 •

edited

Loading

Improve Utilization of GPU #10

Improve Utilization of GPU #10

Comments

notadamking commented Jun 6, 2019

laneshetron commented Jun 7, 2019

notadamking commented Jun 7, 2019

archenroot commented Jun 8, 2019 • edited Loading

archenroot commented Jun 8, 2019

botemple commented Jun 10, 2019

TalhaAsmal commented Jun 10, 2019

archenroot commented Jun 10, 2019 • edited Loading

archenroot commented Jun 11, 2019

TalhaAsmal commented Jun 12, 2019

dennywangtenk commented Jun 23, 2019

archenroot commented Jun 23, 2019

dennywangtenk commented Jun 23, 2019 • edited Loading

Ruben-E commented Jun 27, 2019

TheDoctorAI commented Jul 1, 2019 • edited Loading

archenroot commented Jun 8, 2019 •

edited

Loading

archenroot commented Jun 10, 2019 •

edited

Loading

dennywangtenk commented Jun 23, 2019 •

edited

Loading

TheDoctorAI commented Jul 1, 2019 •

edited

Loading