Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect timing information by default? #273

Closed
johnomotani opened this issue Oct 3, 2024 · 13 comments · Fixed by #276
Closed

Collect timing information by default? #273

johnomotani opened this issue Oct 3, 2024 · 13 comments · Fixed by #276
Labels
enhancement New feature or request

Comments

@johnomotani
Copy link
Collaborator

Proposal: collect and print some high-level timing data by default. TimerOutputs.jl (which we've used a little bit, but not routinely) has some nice features, like being able to annotate function definitions to gather timing data each time that function is called.

Output from a run might look something like this:

julia> run_moment_kinetics("examples/recycling-fraction/wall-bc_recyclefraction0.5_split3_SSPRK4.toml")
Starting setup   13:04:10
beginning time advance   13:04:11
writing moments output 2     t = 0.001    nstep = 100      13:04:24    residuals:  density 1.962       density_neutral 11.14      
writing distribution functions output 2     t = 0.001    nstep = 100      13:04:24
writing moments output 3     t = 0.002    nstep = 200      13:04:37    residuals:  density 1.891       density_neutral 9.027      
writing distribution functions output 3     t = 0.002    nstep = 200      13:04:37
writing moments output 4     t = 0.003    nstep = 300      13:04:50    residuals:  density 1.81        density_neutral 9.435      
writing distribution functions output 4     t = 0.003    nstep = 300      13:04:50
writing moments output 5     t = 0.004    nstep = 400      13:05:04    residuals:  density 1.822       density_neutral 9.399      
writing distribution functions output 5     t = 0.004    nstep = 400      13:05:04
writing moments output 6     t = 0.005    nstep = 500      13:05:17    residuals:  density 1.805       density_neutral 9.449      
writing distribution functions output 6     t = 0.005    nstep = 500      13:05:17
writing moments output 7     t = 0.006    nstep = 600      13:05:30    residuals:  density 1.812       density_neutral 9.521      
writing distribution functions output 7     t = 0.006    nstep = 600      13:05:30
writing moments output 8     t = 0.007    nstep = 700      13:05:43    residuals:  density 1.789       density_neutral 9.225      
writing distribution functions output 8     t = 0.007    nstep = 700      13:05:43
writing moments output 9     t = 0.008    nstep = 800      13:05:56    residuals:  density 1.793       density_neutral 9.333      
writing distribution functions output 9     t = 0.008    nstep = 800      13:05:56
writing moments output 10    t = 0.009    nstep = 900      13:06:10    residuals:  density 1.76        density_neutral 9.758      
writing distribution functions output 10    t = 0.009    nstep = 900      13:06:10
writing moments output 11    t = 0.01     nstep = 1000     13:06:23    residuals:  density 1.749       density_neutral 8.969      
writing distribution functions output 11    t = 0.01     nstep = 1000     13:06:23
finished file io         13:06:23
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                    Time                    Allocations      
                                                           ───────────────────────   ────────────────────────
                     Tot / % measured:                           133s / 100.0%           1.99GiB / 100.0%    

 Section                                           ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────
 moment_kinetics                                        1     133s  100.0%    133s   1.99GiB  100.0%  1.99GiB
   setup_moment_kinetics                                1    335ms    0.3%   335ms   89.3MiB    4.4%  89.3MiB
   time_advance!                                        1     132s   99.7%    132s   1.90GiB   95.6%  1.90GiB
     ssp_rk!                                        1.00k     132s   99.6%   132ms   1.87GiB   93.9%  1.92MiB
       euler_time_advance!                          4.00k     104s   78.2%  25.9ms   1.66GiB   83.6%   436KiB
         vpa_advection!                             4.00k    14.1s   10.6%  3.52ms    688KiB    0.0%     176B
         z_advection!                               4.00k    16.3s   12.3%  4.07ms    688KiB    0.0%     176B
         source_terms!                              4.00k    1.30s    1.0%   325μs    688KiB    0.0%     176B
         neutral_advection_z!                       4.00k    14.3s   10.8%  3.58ms    688KiB    0.0%     176B
         neutral_advection_vz!                      4.00k    13.5s   10.2%  3.37ms    688KiB    0.0%     176B
         source_terms_neutral!                      4.00k    771ms    0.6%   193μs    688KiB    0.0%     176B
         ion_charge_exchange_collisions_1V!         4.00k    12.7s    9.6%  3.18ms    565MiB   27.7%   145KiB
         neutral_charge_exchange_collisions_1V!     4.00k    9.49s    7.2%  2.37ms    565MiB   27.7%   145KiB
         ion_ionization_collisions_1V!              4.00k    12.7s    9.5%  3.16ms    565MiB   27.7%   145KiB
         krook_collisions!                          4.00k    3.99s    3.0%   998μs    688KiB    0.0%     176B
         external_ion_source!                       4.00k    4.41s    3.3%  1.10ms    688KiB    0.0%     176B
       apply_all_bcs_constraints_update_moments!    4.00k    22.7s   17.1%  5.67ms    201MiB    9.9%  51.5KiB
     write_data_to_ascii                               10   3.35μs    0.0%   335ns     0.00B    0.0%    0.00B
     write_all_moments_data_to_binary                  10   68.3ms    0.1%  6.83ms   5.20MiB    0.3%   532KiB
     write_all_dfns_data_to_binary                     10    116ms    0.1%  11.6ms   28.2MiB    1.4%  2.82MiB
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────

To get that output I've filtered out some function calls that took a very short amount of time, to avoid cluttering up the screen.

Questions:

  • would people find this useful or just annoying?
  • should it be something that we turn on optionally, or just printed by default?
  • do we want more detailed timing, e.g. include lower level functions? If we go too far with this, the overhead of doing timing might become noticeable though...
  • Is the output too long? We could exclude everything that takes less that 1% of the total run time for example.

With the current setup, the extra cost of timing is not noticable. I did an @benchmark on a run that takes about 13.5s, and the I couldn't see any difference difference with/without timing, on a sample of 18 runs of each.

There is an option for 'debug timers' that use a macro that will be removed by the compiler if 'debug timing' is not explicitly activated (so 'debug timers' have zero overhead for production runs). I'll add some of these to (optionally) time some MPI calls and LU-solves.

If we add this as a feature, I'll probably try to write out the timing information for each MPI process into the output file in case we want to look at it later. That will require a bit of work to gather all the timing info and write it in a sensible format to HDF5.

The motivation for this was that in preparation for trying to parallelise the kinetic electron solver a bit better, I wanted to be able to profile the code. Using Julia's @profile with MPI was giving me strange results - the run with profiling was a lot slower than a normal run, with an unexpectedly large amount of time spent in MPI calls. I think I've used @profile before and not had this problem, so not sure what's going on, but in thinking about a workaround it seemed like collecting some timing data might be generically useful.

@mrhardman
Copy link
Collaborator

mrhardman commented Oct 3, 2024

Questions:

  • would people find this useful or just annoying?

This looks very useful for expert users. However, I don't think this information is useful for a user that isn't also developing -- unless we at some point get to problems where specific choices of processor count are faster than others because of different splits. I don't think we are there yet because the distributed memory MPI is only for r and z and only one set of magic numbers are permitted.

  • should it be something that we turn on optionally, or just printed by default?

I would go for printing this optionally.

  • do we want more detailed timing, e.g. include lower level functions? If we go too far with this, the overhead of doing timing might become noticeable though...

I would not do this by default, and I would also be concerned that the timer would slow down the code.

  • Is the output too long? We could exclude everything that takes less that 1% of the total run time for example.

The present output looks acceptable. If data is removed, perhaps work noting that anything <1% is hidden.

@LucasMontoya4
Copy link
Collaborator

I agree with Michael. I do really like the idea of having this as an option (though I suppose the default should be that it is turned off). I sometimes worry about whether I've written an inefficient function or caused a memory leak, and this would be a very quick way of checking that compared to using @profile!

@johnomotani
Copy link
Collaborator Author

There is the point of view that at the moment all users are developers, so we could print by default for now, and make it optional at some point in the future. If you are a developer and regularly changing code, the timings are a useful sanity check that you didn't do something awful!

@mrhardman
Copy link
Collaborator

That depends on how difficult it is to make it optional. If you really want this as a default feature, it would be nice to have an option to turn off this output so that any future changes are limited to input option defaults.

@johnomotani
Copy link
Collaborator Author

That depends on how difficult it is to make it optional. If you really want this as a default feature, it would be nice to have an option to turn off this output so that any future changes are limited to input option defaults.

That would be an easy and sensible compromise. I'd probably stick an extra flag in the [output] section.

@johnomotani johnomotani linked a pull request Oct 8, 2024 that will close this issue
@mrhardman
Copy link
Collaborator

Not sure where best to record this question, so reopening this issue:

Why does the timer present strange information regarding total runtime in the two examples below? Specifically, why is the total measured field seemingly filled with numbers modified from 100% by an undetermined prefactor? Everything in the details moment_kinetics rows and below, seems consistent with the physical time of the run, but the header information does not seem clear.

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                         Time                    Allocations
                                                                ───────────────────────   ────────────────────────
                       Tot / % measured:                             17.1s / 875.4%           5.92GiB /  21.9%

Section                                                 ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
moment_kinetics                                              1     149s  100.0%    149s   1.29GiB  100.0%  1.29GiB
────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                   Time                    Allocations
                                                          ───────────────────────   ────────────────────────
                    Tot / % measured:                        5124096h /   0.0%          5.67GiB /  18.4%

Section                                           ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────────────────────────────
moment_kinetics                                        1    22.2s  100.0%   22.2s   1.04GiB  100.0%  1.04GiB

@mrhardman mrhardman reopened this Nov 27, 2024
@johnomotani
Copy link
Collaborator Author

Don't know, don't think I've ever seen that...

@mrhardman
Copy link
Collaborator

This example was made with fokker-planck-1D2V/fokker-planck-1D2V-even_nz-shorttest-nstep200.toml. It is not a specific problem with Fokker-Planck, I could get the same behaviour if I switched to numerical diffusion. It can be ignored, as some of the data looks correct, but very unsettling.

@johnomotani johnomotani added the enhancement New feature or request label Nov 29, 2024
@johnomotani
Copy link
Collaborator Author

I cannot reproduce this problem (on my Linux laptop) using the current master branch

$ bin/julia --project -O3 --check-bounds=no run_moment_kinetics.jl examples/fokker-planck-1D2V/fokker-planck-1D2V-even_nz-shorttest-nstep200.toml
  Activating project at `~/physics/moment_kinetics-master-clean`
Starting setup   14:04:56
setting up GL quadrature   14:06:10
beginning (boundary) weights calculation   14:06:10
finished (boundary) weights calculation   14:06:12
begin elliptic operator assignment   14:06:12
finished elliptic operator constructor assignment   14:06:12
finished LU decomposition initialisation   14:06:12
finished YY array calculation   14:06:12
beginning time advance   14:06:53
writing moments output 2     t = 0.001    nstep = 10       14:07:14
writing distribution functions output 2     t = 0.001    nstep = 10       14:07:14
writing moments output 3     t = 0.002    nstep = 20       14:07:35
writing distribution functions output 3     t = 0.002    nstep = 20       14:07:35
writing moments output 4     t = 0.003    nstep = 30       14:07:56
writing distribution functions output 4     t = 0.003    nstep = 30       14:07:56
writing moments output 5     t = 0.004    nstep = 40       14:08:17
writing distribution functions output 5     t = 0.004    nstep = 40       14:08:17
writing moments output 6     t = 0.005    nstep = 50       14:08:39
writing distribution functions output 6     t = 0.005    nstep = 50       14:08:39
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                         Time                    Allocations      
                                                                ───────────────────────   ────────────────────────
                       Tot / % measured:                             2706s /   8.2%           16.6GiB /  35.5%    

Section                                                 ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
moment_kinetics                                              1     223s  100.0%    223s   5.89GiB  100.0%  5.89GiB
  setup_moment_kinetics                                      1    88.8s   39.9%   88.8s   4.71GiB   80.1%  4.71GiB
  time_advance! step                                        50     106s   47.6%   2.12s    230MiB    3.8%  4.59MiB
    ssp_rk!                                                 50     106s   47.4%   2.11s    209MiB    3.5%  4.19MiB
      euler_time_advance!                                  200     105s   47.2%   526ms    207MiB    3.4%  1.04MiB
        vpa_advection!                                     200    278ms    0.1%  1.39ms   34.4KiB    0.0%     176B
        z_advection!                                       200    582ms    0.3%  2.91ms   34.4KiB    0.0%     176B
        vperp_advection!                                   200    559ms    0.3%  2.80ms   34.4KiB    0.0%     176B
        explicit_fokker_planck_collisions_weak_form!       200     104s   46.5%   518ms    207MiB    3.4%  1.04MiB
          fokker_planck_collision_operator_weak_form!    13.0k     103s   46.2%  7.91ms   66.7MiB    1.1%  5.26KiB
      apply_all_bcs_constraints_update_moments!            200    286ms    0.1%  1.43ms    634KiB    0.0%  3.17KiB
    write_all_moments_data_to_binary                         5    268ms    0.1%  53.6ms   13.4MiB    0.2%  2.68MiB
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
finished file io         14:08:39

@mrhardman
Copy link
Collaborator

Have you tried running the same experiments with multiple cores?

Can you explain why we output data

  Tot / % measured:                             2706s /   8.2% 

when the actual wallclock time is closer to

moment_kinetics                                              1     223s  100.0%    223s   5.89GiB  100.0%  5.89GiB

?

@johnomotani
Copy link
Collaborator Author

I think I might know the cause of this now. global_timer is a const global variable created in the timer_utils module. This probably means that the timer being quoted in the top line gets 'started' when the module is precompiled, and so the times reported are junk. For unrelated reasons, I decided it would be a good idea to reset the timers at the start of every run (see cc9f967 which will be included in #297). If my guess is right, that would fix the problem.

@mrhardman
Copy link
Collaborator

I have another question about the memory usage or allocations reporting. Can this information be used to estimate the maximum memory required for a job? It looks like this measures something else -- if arrays are allocated and deallocated often, does this measure only allocations, without counting that only some arrays are persistent and remain allocated between timesteps?

@johnomotani
Copy link
Collaborator Author

I have another question about the memory usage or allocations reporting. Can this information be used to estimate the maximum memory required for a job? It looks like this measures something else -- if arrays are allocated and deallocated often, does this measure only allocations, without counting that only some arrays are persistent and remain allocated between timesteps?

I think it just measures the allocations, and ignores deallocations (deallocations would be difficult to measure because they would only happen during garbage collection...). So nothing to do with maximum memory usage.

Even current memory usage is surprisingly hard to measure - I tried at one point and found some options (which I've now forgotten), but as far as I remember none of them seemed very accurate/reliable.

I think a very long time ago, for a different project, I looked for command line tools that would measure memory usage and found some reasonable things there - the only notes I can find are for looking at memory usage of Python code, but they use OS utilities, so there are probably command line tools to measure the memory used by some process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants