Skip to content

Likwid Bench

Thomas Roehl edited this page Jun 25, 2015 · 13 revisions

likwid-bench: assembly microkernel benchmark suite

Introduction

likwid-bench is a benchmarking application together with a framework to enable rapid prototyping of multi-threaded assembly kernels. Adding a new benchmark is nothing more than to create a simple text file and recompile. The framework cares for threaded execution and pinning, data allocation and placement, time measurement and result presentation.

Build

Because likwid-bench uses x86-64 instructions in its benchmark kernels. In order to build it on a 32 bit machine, you have to set the COMPILER option in config.mk to GCCX86. To build it call:

make likwid-bench

Limitiations

likwid-bench supports up to 38 streams. Also notice that at the moment only plains streams are supported. This makes it not possible to emulate the behavior of multi dimensional data structures with their spatial locality.

Usage

likwid-bench out of the box has already a bunch of kernels included. You can use it as a basic bandwidth benchmarking tool.

You can get a help message with

$ likwid-bench -h

A list with all available benchmark kernels is available when calling:

$ likwid-bench -a

You have to specify a benchmark kernel you want to use. This kernel will operate on a number of streams. Streams are one dimensional arrays (or vectors). Lets assume you only use one workgroup (thread group), then all threads of a workgroup will divide the stream in portions and every thread will update its fraction of the total vector.

Each assembly kernel has a number of properties. These are:

  1. Number of streams
  2. The data type (DOUBLE, SINGLE, INT)
  3. number of flops it performs in one update
  4. number of bytes it transfers in one update
  5. the stride of one loop iteration

To output the properties of a test kernel call likwid-bench with the -l option:

$ likwid-bench -l copy
Name: copy
Number of streams: 2
Loop stride: 8
Flops: 0
Bytes: 16
Data Type: Double precision float

You have to specify how many threads you want to use, where these threads should be placed and how large the total data set should be. Per default the memory is allocated in the same domain as the threads are running in, optionally you can place the memory in another domain. All vectors are per default page aligned.

Lets try some examples to illustrate this. Get the default list of benchmark kernels (only a excerpt):

$ likwid-bench -a
clcopy
clload
clstore
copy
copy_mem
load
store
store_mem
stream
stream_mem
triad
triad_mem

In order to specify the number of threads and where these threads should be placed we already used the term thread domain. A thread domain is a number of threads sharing a topological entity. This can be a socket or a shared cache or a NUMA domain.

To get a list of thread domains call:

$ likwid-bench  -p
Domain 0:
        Tag S0: 0 1 2 3 4 5
Domain 1:
        Tag S1: 6 7 8 9 10 11
Domain 2:
        Tag C0: 0 1 2 3 4 5
Domain 3:
        Tag C1: 6 7 8 9 10 11
[...]

This a machine with two sockets and six CPUs per socket. There is a shared L3 cache which is equivalent to the socket domains. There are two socket groups S0 and S1 and two cache groups C0 and C1. Depending on a UMA or NUMA system, you have either one memory domain M0 covering all CPUs or two memory domains M0 and M1 similar to S0 and S1. Starting with Intel Haswell EP there can be more memory domains than socket domains if Cluster-on-Die is enabled. This splits up a socket in two memory domains.

The simplest form to run a benchmark is:

$ likwid-bench  -t copy -w S1:100kB
Allocate: Process running on core 6 - Vector length 6400 Offset 0
Allocate: Process running on core 6 - Vector length 6400 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy
--------------------------------------------------------------------------------
Using 1 work groups
Using 6 threads
--------------------------------------------------------------------------------
Group: 0 Thread 5 Global Thread 5 running on core 11 - Vector length 1064 Offset 5320
Group: 0 Thread 2 Global Thread 2 running on core 8 - Vector length 1064 Offset 2128
Group: 0 Thread 4 Global Thread 4 running on core 10 - Vector length 1064 Offset 4256
Group: 0 Thread 0 Global Thread 0 running on core 6 - Vector length 1064 Offset 0
Group: 0 Thread 3 Global Thread 3 running on core 9 - Vector length 1064 Offset 3192
Group: 0 Thread 1 Global Thread 1 running on core 7 - Vector length 1064 Offset 1064
--------------------------------------------------------------------------------
Cycles:			3211476582
CPU Clock:		2999807232
Time:			1.070561e+00 sec
Iterations:		15852792
Iterations per thread:	2642132
Size:			99840
Size per thread:	16640
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	263790458880
MByte/s:		246403.95
Cycles per update:	0.194790
Cycles per cacheline:	1.558316
--------------------------------------------------------------------------------

This example used the copy kernel and run it on socket 1 with all threads available there. The working set size is set to 100 kB (Unit can be either kB, KB, MB or GB). You get a output where the streams are placed, how the threads are pinned and and on what part of the vector each thread operates. The amount of iterations is determined automatically before running the actual benchmark to run the benchmark at least for one second. The tool reports beyond runtime the flop rate (if applicable) and the bandwidth in MB/s. Also reported is the cycles to perform one update and the cycles to update a full cache line.

NOTICE: The cycles are RDTSC cycles. On modern processors with Turbo mode the RDTSC clock is invariant. This means that it can be different from the actual clock. To make the cycles metrics meaningful you have to fix the frequency to the nominal frequency.

Let try a single threaded example with the stream benchmark in L1 cache:

$ likwid-bench  -t stream -w S1:20kB:1
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on core 6 - Vector length 848 Offset 0
--------------------------------------------------------------------------------
Cycles:			1849715976
CPU Clock:		2999786262
Time:			6.166159e-01 sec
Iterations:		3408209
Iterations per thread:	3408209
Size:			19968
Size per thread:	19968
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	68055117312
MByte/s:		110368.73
Cycles per update:	0.434875
Cycles per cacheline:	3.478998
--------------------------------------------------------------------------------

A workgroup is specified with <domain>:<size>:<nrThreads>. The number of threads is optional. An efficient spin waiting loop based barrier is employed to keep the overhead low.

The following example will do the same as above, but this time with two workgroups on two different sockets. Notice that also the memory is placed on each of the sockets according to the workgroups.

$ likwid-bench  -t stream -w S1:20kB:1 -w S0:20kB:1
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 0 - Vector length 853 Offset 0
Allocate: Process running on core 0 - Vector length 853 Offset 0
Allocate: Process running on core 0 - Vector length 853 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 2 work groups
Using 2 threads
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 848 Offset 0
Group: 1 Thread 0 Global Thread 1 running on core 6 - Vector length 848 Offset 0
--------------------------------------------------------------------------------
Cycles:			2348851791
CPU Clock:		2999805430
Time:			7.830014e-01 sec
Iterations:		5283062
Iterations per thread:	2641531
Size:			39936
Size per thread:	19968
Number of Flops:	8791015168
MFlops/s:		11227.33
Data volume (Byte):	105492182016
MByte/s:		134727.96
Cycles per update:	0.534376
Cycles per cacheline:	4.275004
--------------------------------------------------------------------------------

There is also the possibility to further specify in more detail how the memory should be allocated and placed. Per default every stream is allocated page aligned in the same domain the threads run in. You can change this with the following optional arguments:

$ likwid-bench  -t copy -w S1:1GB:2-0:S0,1:S0  -w S0:1GB:2
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy
--------------------------------------------------------------------------------
Using 2 work groups
Using 4 threads
--------------------------------------------------------------------------------
Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 33554432 Offset 33554432
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 33554432 Offset 0
Group: 1 Thread 1 Global Thread 3 running on core 7 - Vector length 33554432 Offset 33554432
Group: 1 Thread 0 Global Thread 2 running on core 6 - Vector length 33554432 Offset 0
--------------------------------------------------------------------------------
Cycles:			6853695501
CPU Clock:		2999786982
Time:			2.284727e+00 sec
Iterations:		76
Iterations per thread:	19
Size:			2000000000
Size per thread:	500000000
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	38000000000
MByte/s:		16632.18
Cycles per update:	2.885767
Cycles per cacheline:	23.086132
--------------------------------------------------------------------------------

This example runs the copy kernel on two threads per socket but overrides the default setting with placing all vectors in the socket 0 domain. Notice that you either specify no stream arguments or all stream arguments. This means that if your kernel operated on two stream you have to specify two streams in the optional memory arguments. The syntax is <domain>:<size>:[<threads>](<threads>])-::,:.... You can offset the array by a multiple of type. If the kernel operated on doubles you can offset the array by a multiple of sizeof(double)`. Notice that this offset is also checked against the stride of the loop. Offsetting can be of advantage if you think you have associativity problems.

Of course this is not too smart on a NUMA machine. If placing the data correctly and using all threads with the kernel employing non-temporal stores, you get the peak memory bandwidth of this system:

$ likwid-bench  -t copy_mem -w S1:1GB  -w S0:1GB  
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy_mem
--------------------------------------------------------------------------------
Using 2 work groups
Using 12 threads
--------------------------------------------------------------------------------
Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 11184808 Offset 11184808
Group: 1 Thread 2 Global Thread 8 running on core 8 - Vector length 11184808 Offset 22369616
Group: 1 Thread 0 Global Thread 6 running on core 6 - Vector length 11184808 Offset 0
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 11184808 Offset 0
Group: 1 Thread 4 Global Thread 10 running on core 10 - Vector length 11184808 Offset 44739232
Group: 1 Thread 1 Global Thread 7 running on core 7 - Vector length 11184808 Offset 11184808
Group: 1 Thread 5 Global Thread 11 running on core 11 - Vector length 11184808 Offset 55924040
Group: 0 Thread 5 Global Thread 5 running on core 5 - Vector length 11184808 Offset 55924040
Group: 1 Thread 3 Global Thread 9 running on core 9 - Vector length 11184808 Offset 33554424
Group: 0 Thread 2 Global Thread 2 running on core 2 - Vector length 11184808 Offset 22369616
Group: 0 Thread 4 Global Thread 4 running on core 4 - Vector length 11184808 Offset 44739232
Group: 0 Thread 3 Global Thread 3 running on core 3 - Vector length 11184808 Offset 33554424
Cycles: 15602213015 
Iterations: 100 
Size: 67108864 
Vectorlength: 11184808 
Time: 5.318840e+00 sec
MFlops/s:       0.00
MByte/s:        40375.03
Cycles per update:      13.949469
Cycles per cacheline:   111.595750
--------------------------------------------------------------------------------

Default benchmarks

likwid-bench already contains a number of basic benchmark kernel you can use out of the box.

These are:

  • copy Standard memcpy benchmark. A[i] = B[i]
  • copy_mem The same as above but with non temporal store.
  • load One load stream. This one does some software prefetching you can experimenet with.
  • store One store stream.
  • store_mem The same as above but with non temporal store.
  • stream Classical STREAM triad. A[i] = B[i] + a \* C[i]
  • stream_mem The same as above but with non temporal store.
  • triad Full vector triad. `A[i] = B[i] + C[i] * D[i]
  • triad_mem The same as above but with non temporal store.

Apart from these standard benchmarks there are special cache line versions for the basic data operations load, store and copy. These versions only execute one operation per cache line. Thereby the runtime is as far as possible reduced to the time needed for the data transfers inside the memory hierarchy. Use this benchmarks to measure the raw bandwidth of different memory levels.

  • clcopy
  • clload
  • clstore

Using likwid-bench together with likwid-perfctr

To measure hardware performance counter events, likwid-bench can be build to be instrumented with the LIKWID Marker API allowing to measure additional events.

To build likwid-bench for use with likwid-perfctr set the following switch in config.mk to true:

INSTRUMENT_BENCH = true

Call make distclean and rebuild. Now you can use both tools together. Of course in likwid-perfctr you still have to specify the cores you want to measure explicitly with -c. To indicate that likwid-bench was build with instrumentation it will output a message like Have you set -m for likwid-perfctr when running stand-alone. If wrapped by likwid-perfctr, there is a message like Using LIKWID.

The following measurements shows a multi socket Uncore measurement for the L3 cache with 2 threads running on each socket:

$ likwid-perfctr -c 0,1,6,7 -g L3CACHE likwid-bench -t copy_mem -w S1:1GB:2 -w S0:1GB:2
--------------------------------------------------------------------------------
CPU type:       Intel Core Westmere processor 
CPU clock:      2.93 GHz 
--------------------------------------------------------------------------------
Measuring group L3CACHE
--------------------------------------------------------------------------------
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Using likwid
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy_mem
--------------------------------------------------------------------------------
Using 2 work groups
Using 4 threads
--------------------------------------------------------------------------------
Group: 1 Thread 1 Global Thread 3 running on core 7 - Vector length 33554432 Offset 33554432
Group: 1 Thread 0 Global Thread 2 running on core 6 - Vector length 33554432 Offset 0
Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 33554432 Offset 33554432
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 33554432 Offset 0
--------------------------------------------------------------------------------
Cycles:			5056904547
CPU Clock:		2999809184
Time:			1.685742e+00 sec
Iterations:		68
Iterations per thread:	17
Size:			2000000000
Size per thread:	500000000
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	34000000000
MByte/s:		20169.16
Cycles per update:	2.379720
Cycles per cacheline:	19.037758
--------------------------------------------------------------------------------
+-----------------------+-------------+-------------+-------------+-------------+
|         Event         |   core 0    |   core 1    |   core 6    |   core 7    |
+-----------------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   | 5.40253e+09 | 5.45509e+09 | 4.1945e+09  | 4.19677e+09 |
| CPU_CLK_UNHALTED_CORE | 3.33546e+10 | 3.35159e+10 | 3.35228e+10 | 3.35342e+10 |
|    UNC_L3_HITS_ANY    | 3.82275e+07 |      0      | 3.86728e+07 |      0      |
|    UNC_L3_MISS_ANY    |  1.716e+09  |      0      | 1.71604e+09 |      0      |
|  UNC_L3_LINES_IN_ANY  | 8.47763e+08 |      0      | 8.46451e+08 |      0      |
| UNC_L3_LINES_OUT_ANY  | 8.47762e+08 |      0      | 8.46451e+08 |      0      |
+-----------------------+-------------+-------------+-------------+-------------+
+-----------------+------------+--------+------------+--------+
|     Metric      |   core 0   | core 1 |   core 6   | core 7 |
+-----------------+------------+--------+------------+--------+
| L3 request rate | 0.00707585 |   0    | 0.00921989 |   0    |
|  L3 miss rate   |  0.31763   |   0    |  0.409117  |   0    |
|  L3 miss ratio  |  44.8893   |   0    |  44.3733   |   0    |
+-----------------+------------+--------+------------+--------+

For Uncore events likwid-perfctr will only measure on one core per socket.

Adding benchmarks

To add new benchmarks you have to create test files in the directory <LIKWID_SRC>/bench/<ARCH> . The file must have the ending .ptt. Lets look on a copy benchmark bench/x86-86/copy.ptt. Later the benchmark will be name according to the file's name.

STREAMS 2
TYPE DOUBLE
FLOPS 0
BYTES 16
LOOP 8
movaps    FPR1, [STR0 + GPR1 * 8]
movaps    FPR2, [STR0 + GPR1 * 8 + 16]
movaps    FPR3, [STR0 + GPR1 * 8 + 32]
movaps    FPR4, [STR0 + GPR1 * 8 + 48]
movaps    [STR1 + GPR1 * 8], FPR1
movaps    [STR1 + GPR1 * 8 + 16], FPR2
movaps    [STR1 + GPR1 * 8 + 32], FPR3
movaps    [STR1 + GPR1 * 8 + 48], FPR4

The file consists of a header section and the actual loop kernel. The following header tags must be present (the order is arbitrary):

  • STREAM: The number of streams the benchmark needs.
  • TYPE: Can be one of DOUBLE, SINGLE or INT.
  • FLOPS: How many flops the kernel executes in for one scalar update.
  • BYTES: How many bytes need to be transferred per scalar update.

Everything else before the LOOP tag is taken as instruction code and placed before the actual loop code. Everything after LOOP is placed inside the loop kernel. The argument behind LOOP indicates the stride of the loop, means how many updates are performed in one loop iteration.

The BYTES parameter defines the number of bytes needed to perform a single scalar update operation. If you look at the C code of a copy benchmark it would look like this:

for (int GPR1 = 0; GPR1 < size; ++GPR1) {
    STR1[GPR1] = STR0[GPR1];
}

or in a more low level approach using a floating point register and accesses through dereferencing the pointer:

register double FPR1;
for (int GPR1_8 = 0; GPR1_8 < size; ++GPR1_8) {
    FPR1 = *(STR0 + GPR1_8);
    *(STR1 + GPR1_8) = FPR1;
}

In each iteration two double-precision values are handled, one is loaded and the other one stored. With a size of 8 Byte per double-precision value, this results in 16 Bytes per scalar loop iteration. In the high-level assembly language in the ptt files, one scalar update operation is:

movaps    FPR1, [STR0 + GPR1 * 8]
movaps    [STR1 + GPR1 * 8], FPR1

Don't get confused by unrolled loops in the ptt files, the BYTES as well as the FLOPS entry specify the number of Bytes respectively FLOPs for not unrolled loops.

The instruction code must be in Intel syntax, hence the source is the right argument and the destination the left one.

You can write plain x86-64 instruction code, but LIKWID provides some predefined labels to ease your job. The following list introduces all labels:

  • GPR1 - GPR16 : General-purpose registers
  • FPR1 - FPR16 : Floating-point registers
  • STR0 - STR10 : Registers with stream addresses
  • SCALAR : Double-precision constant
  • SSCALAR : Single-precision constant
  • ISCALAR : Integer constant

The loop counter is always placed in register GPR1!

Technically the text files in the ptt format is converted in an intermediate high level assembly format (PAS) and finally to assembly. Both intermediate formats, the .pas file and the .s are present in the build directory (e.g. ./GCC). The intermediate assembly format allows to provide different assembler backends, e.g. for masm. Still at the moment there is only a backend for gas.

After recompiling the benchmark code is generated and automatically included in likwid-bench.

Community Aspect

One idea behind likwid-bench beyond the ability of rapid prototyping and benchmarking of loop kernels is to enable a platform which helps to generate knowledge about what instructions code works on different platforms for certain algorithms. We want e.g. provide learning packages with collections of micro benchmarks showing the influence of different instructions and implementation types on performance. Users can then easily share their implementations and results can be easily compared on different processors. The typical targets for such packages are the e.g. :

  • Stencil kernels: Jacobi, Gauss Seidel
  • Stream and Full triad (4 vectors)
  • Add operations
  • basic data operations (load, store, copy)

Future Plans

  • more data access patterns should be supported apart from plain streams. E.g. multidimensional arrays for stencil kernels or CRS data formats for sparse matrix computations.
  • provide more assembler backends.
  • Provide more example packages.
  • Provide a perl skript which generates a bandwidth map based on likwid-bench measurements
Clone this wiki locally