This document explains how to create performance profiles, i.e., measurements of where in the code time is spent, usually aggregated by function. There is a variety of open-source or proprietary tools available. Generally, there are three types of profilers:
- instrumenting: modify the compiled code to include code for measurement. Advantage: exact function call counts and times, no approximations. Disadvantage: instrumentation adds runtime overhead that can distort the result or disable compiler optimizations (e.g. functions aren't inlined), so short functions usually must be excluded.
- sampling: checking the CPU stack register in regular intervals to see which code is currently being executed. Advantage: usually low measurement runtime overhead. Disadvantage: stochastic results, imprecise especially for short functions, longer runtimes improve averages.
- emulating: running the program in a virtual environment. Advantage: exact and able to measure even small functions. Disadvantage: program can run significantly longer, no real world timing results but instruction cycle counts are generally directly translatable.
These tools have been tried for memilio and are described below:
- gperftools: sampling profiler that gives reliable results with little setup
- Score-P: large suite that supports both sampling and instrumenting. Special support for parallel code (OpenMP and MPI), including tracing to find load imbalances in the parallelization.
- Valgrind: suite that includes the Callgrind emulating profiler; detailed descripton will be added soon, until then see the official documentation.
How to use these tools, the general optimization loop:
- Run a benchmark to get a baseline runtime and identify the need for performance optimization, e.g., a change introduced a performance regression.
- Create a profile to see where time is lost; maybe compare with a profile before your changes.
- Optimize the code.
- Test the functionality.
- Confirm the optimization with a benchmark and profile.
- Repeat if necessary.
gperftools (formerly google performance tools) is a suite of performance tools. Here the focus is on the profiler. It's a small sampling profiler that is quick to setup and use on most systems. It's integrated into our build system for even greater convenience.
Basic steps:
- Install gperftools. It's available in many package managers (apt packages
google-perftools
andlibgoogle-perftools-dev
, spack or homebrew packagegperftools
). E.g., on the DLR-SC hpda cluster (insert the compiler version you are using):module load spack-user; module load PrgEnv/gcc13-openmpi; spack install gperftools%[email protected]; module load gperftools
. - configure memilio with cmake variable
MEMILIO_ENABLE_PROFILING=ON
. Note that compiling with profiling enabled does not incur any runtime overhead unless profling is also enabled at runtime (see step 4), so developers can just enable it always for convenience. - compile memilio.
- run the program with environment variable
CPUPROFILE=profile.out
set. - generate a human-readable annotated call graph:
pprof-symbolize --pdf <exe> profile.out > profile.pdf
. Check the documentation for other output formats.
This generates a profile of the whole program. For each function, the graph contains the number of times the function was sampled and how much time is spent in the function as a percentage of the total runtime. Both sample count and percentage are displayed as exclusive numbers (i.e. the function without child functions) and inclusive numbers (the function including its child functions).
Probably not all parts of the program are interesting to profile. It's possible to filter the profile, e.g., to look at specific functions, using the --focus
flag of pprof-symbolize
or by starting the profiling manually using the MEMILIO_PROFILER_START(<filename>)
and MEMILIO_PROFILER_STOP
macros in memilio/utils/profiler.h
. E.g., when profiling the abm_simulation
program, it may be a good idea to exclude pre- and postprocessing by starting profiling right before the call to Simulation::advance
.
Score-P is a tool that can be used to measure the perfomance of parallel code and supports profiling and tracing. For information on how to get it, please have a look at the Score-P quickstart. It can be installed with the spack manager (and probably many other package managers) or loaded as a module on hpc systems. For the visualization of the performance data you need Cube (information on installation, see here). It can load both Score-P profiles and Scalasca traces and you can find details about the loading and usage below or in the User Guide.
This file explains how to use Score-P with GCC. For other compilers this might look different or not work correctly. Once you have Score-P you can build the abm with:
export SCOREP_WRAPPER=off #disable scorep during cmake configure
cmake .. -DMEMILIO_ENABLE_OPENMP=ON -DCMAKE_CXX_COMPILER=${Path to scorep}/scorep-mpic++ -DCMAKE_C_COMPILER=${Path to scorep}/scorep-mpicc -DCMAKE_BUILD_TYPE=RelWithDebInfo -DMEMILIO_ENABLE_WARNINGS_AS_ERRORS=OFF
export SCOREP_WRAPPER=on #reenable scorep
export SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--pomp --thread=omp --mpp=none --instrument-filter=${Path to filter}/scorep-filter-abm"
cmake --build . -j --target abm_simulation
It is useful to use --target
to set the target to build, because Score-P has difficulties with some of the other targets.
The filter file is designed to include only the interesting information about the abm. It filters (i.e. measures) only the basic framework of the abm, that is only the simulation part up to the loop over time. The measurement of small and frequently called functions increases the overhead of the profiling. This mainly concerns the runtime and makes the profile unreliable, since the time measured by the profiler is not the time that the functions would take in normal operation. To prevent this, those functions are neglected in the current filter and the profile may give a hint on which part of the abm you should investigate further. To measure additional functions you add INCLUDE *Class::myfunction*
to the filter.
To avoid the build errors about "include style" and "gcc extension" set -DMEMILIO_ENABLE_WARNINGS_AS_ERRORS=OFF
during cmake configuration.
To run the basic analysis with profiling:
export OMP_NUM_THREADS=8 #or whatever
./bin/abm_simulation 1 <some result dir>
This gives a summary of how much time was spent in each function. There will be a directory named scorep_... with a .cubex file, that can be opened in the Cube visualization tool with cube scorep-folder/profile.cubex
. This can give a first indication for bottlenecks, then you don't need Scalasca.
Also, you can use scorep-score -r scorep-folder/profile.cubex
to get a first impression. The output for a profile without any filter looks like this (shortened to a few functions):
Estimated aggregate size of event trace: 267GB
Estimated requirements for largest trace buffer (max_buf): 267GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 267GB
(warning: The memory requirements cannot be satisfied by Score-P to avoid
intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
maximum supported memory or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 286,150,801,259 11,005,733,830 2854.35 100.0 0.26 ALL
USR 286,146,847,402 11,005,647,977 1626.05 57.0 0.15 USR
OMP 3,931,924 85,010 152.08 5.3 1788.98 OMP
COM 21,892 842 1076.21 37.7 1278156.85 COM
SCOREP 41 1 0.01 0.0 5326.90 SCOREP
USR 252,154,970,496 9,698,268,096 1292.41 45.3 0.13 mio::abm::Location&?mio::abm::Person::get_location()
USR 7,055,253,036 271,355,886 40.67 1.4 0.15 void?Eigen::internal::ignore_unused_variable(const?T&)??with?T?=?long?int?
USR 3,551,985,216 136,614,816 17.11 0.6 0.13 mio::abm::InfectionState?mio::abm::Person::get_infection_state(mio::abm::TimePoint)?const
USR 2,106,780,026 81,030,001 23.04 0.8 0.28 size_t?mio::flatten_index(const?MultiIndex&,?const?MultiIndex&)??with?MultiIndex?=?mio::Index<mio::abm::VirusVariant,?mio::AgeGroup>?
...
When changing the abm one should check how much memory is used by Score-P and that it does not exceed 4 GB which is the maximum value for SCOREP_TOTAL_MEMORY. Also when tracing SCOREP_TOTAL_MEMORY should be adjusted to the value proposed by scorep-score.
As you can see, the profile contains a lot of functions that are not of interest, for example small functions like get_location()
or internal functions like ignore_unused_variables()
. Those functions are filtered when running with the proposed filter:
Estimated aggregate size of event trace: 369kB
Estimated requirements for largest trace buffer (max_buf): 369kB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 4097kB
(hint: When tracing set SCOREP_TOTAL_MEMORY=4097kB to avoid intermediate flushes
or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 377,009 14,427 65.10 100.0 4512.48 ALL
USR 281,640 11,735 0.01 0.0 0.87 USR
OMP 75,144 1,850 7.01 10.8 3787.55 OMP
COM 20,184 841 0.00 0.0 4.74 COM
SCOREP 41 1 58.08 89.2 58080341.04 SCOREP
USR 39,240 1,635 0.00 0.0 0.28 Json::ValueType?Json::Value::type()?const
...
COM 4,032 168 0.00 0.0 0.63 void?mio::abm::Simulation::evolve_world(mio::abm::TimePoint)
COM 4,032 168 0.00 0.0 2.30 void?mio::abm::World::evolve(mio::abm::TimePoint,?mio::abm::TimeSpan)
COM 4,032 168 0.00 0.0 6.92 void?mio::abm::World::begin_step(mio::abm::TimePoint,?mio::abm::TimeSpan)
COM 4,032 168 0.00 0.0 5.40 void?mio::abm::World::interaction(mio::abm::TimePoint,?mio::abm::TimeSpan)
COM 4,032 168 0.00 0.0 5.10 void?mio::abm::World::migration(mio::abm::TimePoint,?mio::abm::TimeSpan)
The output is much shorter and most functions are filtered.
If the estimated memory requirements exceed 4 GB you need to remove the responsible function from the included functions in the filter. If you added more than one function to the filter, you can check which function is responsible for the increase of required space with the output of scorep-score
. The functions at the top of the table are most likely to be responsible. With scorep-score -f filter -r scorep-folder/profile.cubex
you can see the effect of the adjusted filter on the profile without running the experiment.
However, there are some functions that still appear in the profile that do not add overhead but are not of interest. If you checked that the overhead is acceptable you can set the filter for runtime with export SCOREP_FILTERING_FILE=${Path to filter}/scorep-filter-abm
to get a clearer profile:
Estimated aggregate size of event trace: 94kB
Estimated requirements for largest trace buffer (max_buf): 94kB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 4097kB
(hint: When tracing set SCOREP_TOTAL_MEMORY=4097kB to avoid intermediate flushes
or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 95,369 2,692 64.79 100.0 24066.57 ALL
OMP 75,144 1,850 6.95 10.7 3758.25 OMP
COM 20,184 841 0.00 0.0 4.98 COM
SCOREP 41 1 57.83 89.3 57830258.74 SCOREP
OMP 14,280 168 0.00 0.0 0.52 [email protected]:68
OMP 14,280 168 0.00 0.0 0.51 [email protected]:78
OMP 14,280 168 0.00 0.0 0.56 [email protected]:152
OMP 4,056 169 1.78 2.8 10549.08 ?$omp?for?@common_abm_loggers.h:175
OMP 4,056 169 0.00 0.0 1.30 ?$omp?implicit?barrier?@common_abm_loggers.h:180
OMP 4,032 168 1.54 2.4 9152.87 [email protected]:68
OMP 4,032 168 0.00 0.0 2.76 [email protected]:73
OMP 4,032 168 1.61 2.5 9571.18 [email protected]:78
OMP 4,032 168 0.00 0.0 2.75 [email protected]:122
OMP 4,032 168 2.02 3.1 12038.27 [email protected]:152
OMP 4,032 168 0.00 0.0 2.89 [email protected]:156
COM 4,032 168 0.00 0.0 0.70 void?mio::abm::Simulation::evolve_world(mio::abm::TimePoint)
COM 4,032 168 0.00 0.0 2.66 void?mio::abm::World::evolve(mio::abm::TimePoint,?mio::abm::TimeSpan)
COM 4,032 168 0.00 0.0 7.73 void?mio::abm::World::begin_step(mio::abm::TimePoint,?mio::abm::TimeSpan)
COM 4,032 168 0.00 0.0 5.49 void?mio::abm::World::interaction(mio::abm::TimePoint,?mio::abm::TimeSpan)
COM 4,032 168 0.00 0.0 5.17 void?mio::abm::World::migration(mio::abm::TimePoint,?mio::abm::TimeSpan)
SCOREP 41 1 57.83 89.3 57830258.74 abm_simulation
COM 24 1 0.00 0.0 532.58 void?mio::abm::Simulation::advance(mio::abm::TimePoint,?History&?...)??with?History?=?{mio::History<mio::abm::TimeSeriesWriter,?mio::abm::LogInfectionState>}?
For deeper analysis, use Scalasca for tracing:
export OMP_NUM_THREADS=8 #or whatever
scalasca -analyze -q -t ./bin/abm_simulation 1 <result dir> #run the program and create trace files
scalasca -examine scorep_abm_simulation_Ox8_trace #open cube to visualize the traces, this time with better information about load imbalances etc.
In comparison to the profiling above, it gives a timeline of events which is particularly useful for parallel programs. Unfortunately, there is no open source tool to visualize the traces, but apparently there is a plugin for cube available (see here) and the profile becomes more accurate.