-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
883 lines (805 loc) · 49.9 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
OMB (OSU Micro-Benchmarks)
--------------------------
The OSU Micro-Benchmarks use the GNU build system. Therefore you can simply
use the following steps to build the MPI benchmarks.
Example:
./configure CC=/path/to/mpicc CXX=/path/to/mpicxx
make
make install
CC and CXX can be set to other wrapper scripts as well to build OpenSHMEM or
UPC++ benchmarks as well. Based on this setting, configure will detect whether
your library supports MPI-1, MPI-2, MPI-3, OpenSHMEM, and UPC++ to compile the
corresponding benchmarks. See http://mvapich.cse.ohio-state.edu/benchmarks/ to
download the latest version of this package.
OMB also contains CUDA and OpenACC extensions to the benchmarks. CUDA extensions
can be enabled by configuring OMB with --enable-cuda option as shown below. Similarly
OpenACC extensions can be enabled using --enable-openacc option. The MPI library
used should be able to support MPI communication from buffers in GPU Device memory.
./configure CC=/path/to/mpicc
CXX=/path/to/mpicxx
--enable-cuda
--with-cuda-include=/path/to/cuda/include
--with-cuda-libpath=/path/to/cuda/lib
make
make install
More information about the CUDA extensions are given towards the end of the README.
This package also distributes UPC put, get, and collective benchmarks.
These are located in the upc subdirectory and can be compiled by the
following:
for bench in osu_upc_memput \
osu_upc_memget \
osu_upc_all_scatter \
osu_upc_all_reduce \
osu_upc_all_gather \
osu_upc_all_gather_all \
osu_upc_all_exchange \
osu_upc_all_broadcast \
osu_upc_all_barrier
do
echo "Compiling $bench..."
upcc $bench.c ../util/osu_util_pgas.c ../util/osu_util.c -o $bench
done
The MPI Multiple Bandwidth / Message Rate (osu_mbw_mr), OpenSHMEM Put Message
Rate (osu_oshm_put_mr), and OpenSHMEM Atomics (osu_oshm_atomics) tests are
intended to be used with block assigned ranks. This means that all processes
on the same machine are assigned ranks sequentially.
Rank Block Cyclic
----------------------
0 host1 host1
1 host1 host2
2 host1 host1
3 host1 host2
4 host2 host1
5 host2 host2
6 host2 host1
7 host2 host2
If you're using mpirun_rsh the ranks are assigned in the order they are seen in
the hostfile or on the command line. Please see your process managers'
documentation for information on how to control the distribution of the rank to
host mapping.
Point-to-Point MPI Benchmarks
-----------------------------
osu_latency - Latency Test
* The latency tests are carried out in a ping-pong fashion. The sender
* sends a message with a certain data size to the receiver and waits for a
* reply from the receiver. The receiver receives the message from the sender
* and sends back a reply with the same data size. Many iterations of this
* ping-pong test are carried out and average one-way latency numbers are
* obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are
* used in the tests. This test is available here.
osu_latency_mt - Multi-threaded Latency Test
* The multi-threaded latency test performs a ping-pong test with a single
* sender process and multiple threads on the receiving process. In this test
* the sending process sends a message of a given data size to the receiver
* and waits for a reply from the receiver process. The receiving process has
* a variable number of receiving threads (set by default to 2), where each
* thread calls MPI_Recv and upon receiving a message sends back a response
* of equal size. Many iterations are performed and the average one-way
* latency numbers are reported. This test is available here.
* "-t" option can be used to set the number of sender and receiver threads
to be used in a benchmark. Examples:
-t 4 // receiver threads = 4 and sender threads = 1
-t 4:6 // sender threads = 4 and receiver threads = 6
-t 2: // not defined
osu_bw - Bandwidth Test
* The bandwidth tests were carried out by having the sender sending out a
* fixed number (equal to the window size) of back-to-back messages to the
* receiver and then waiting for a reply from the receiver. The receiver
* sends the reply only after receiving all these messages. This process is
* repeated for several iterations and the bandwidth is calculated based on
* the elapsed time (from the time sender sends the first message until the
* time it receives the reply back from the receiver) and the number of bytes
* sent by the sender. The objective of this bandwidth test is to determine
* the maximum sustained date rate that can be achieved at the network level.
* Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) were
* used in the test. This test is available here.
osu_bibw - Bidirectional Bandwidth Test
* The bidirectional bandwidth test is similar to the bandwidth test, except
* that both the nodes involved send out a fixed number of back-to-back
* messages and wait for the reply. This test measures the maximum
* sustainable aggregate bandwidth by two nodes. This test is available here.
osu_mbw_mr - Multiple Bandwidth / Message Rate Test
* The multi-pair bandwidth and message rate test evaluates the aggregate
* uni-directional bandwidth and message rate between multiple pairs of
* processes. Each of the sending processes sends a fixed number of messages
* (the window size) back-to-back to the paired receiving process before
* waiting for a reply from the receiver. This process is repeated for
* several iterations. The objective of this benchmark is to determine the
* achieved bandwidth and message rate from one node to another node with a
* configurable number of processes running on each node. The test is
* available here.
osu_multi_lat - Multi-pair Latency Test
* This test is very similar to the latency test. However, at the same
* instant multiple pairs are performing the same test simultaneously.
* In order to perform the test across just two nodes the hostnames must
* be specified in block fashion.
Collective MPI Benchmarks
-------------------------
osu_allgather - MPI_Allgather Latency Test(*)
osu_allgatherv - MPI_Allgatherv Latency Test
osu_allreduce - MPI_Allreduce Latency Test
osu_alltoall - MPI_Alltoall Latency Test
osu_alltoallv - MPI_Alltoallv Latency Test
osu_barrier - MPI_Barrier Latency Test
osu_bcast - MPI_Bcast Latency Test
osu_gather - MPI_Gather Latency Test(*)
osu_gatherv - MPI_Gatherv Latency Test
osu_reduce - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter Latency Test
osu_scatter - MPI_Scatter Latency Test(*)
osu_scatterv - MPI_Scatterv Latency Test
Collective Latency Tests
* The latest OMB version includes benchmarks for various MPI blocking
* collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce,
* MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter,
* MPI_Scatter and vector collectives). These benchmarks work in the
* following manner. Suppose users run the osu_bcast benchmark with N
* processes, the benchmark measures the min, max and the average latency of
* the MPI_Bcast collective operation across N processes, for various
* message lengths, over a large number of iterations. In the default
* version, these benchmarks report the average latency for each message
* length. Additionally, the benchmarks offer the following options:
* "-f" can be used to report additional statistics of the benchmark,
such as min and max latencies and the number of iterations.
* "-m" option can be used to set the minimum and maximum message length
to be used in a benchmark. In the default version, the benchmarks
report the latencies for up to 1MB message lengths. Examples:
-m 128 // min = default, max = 128
-m 2:128 // min = 2, max = 128
-m 2: // min = 2, max = default
* "-x" can be used to set the number of warmup iterations to skip for each
message length.
* "-i" can be used to set the number of iterations to run for each message
length.
* "-M" can be used to set per process maximum memory consumption. By
default the benchmarks are limited to 512MB allocations.
Support for CUDA Managed Memory
---------------------------------
The following benchmarks have been extended to evaluate performance of MPI communication
from and to buffers allocated using CUDA Managed Memory.
* osu_bibw - Bidirectional Bandwidth Test
* osu_bw - Bandwidth Test
* osu_latency - Latency Test
* osu_mbw_mr - Multiple Bandwidth / Message Rate Test
* osu_multi_lat - Multi-pair Latency Test
* osu_allgather - MPI_Allgather Latency Test
* osu_allgatherv - MPI_Allgatherv Latency Test
* osu_allreduce - MPI_Allreduce Latency Test
* osu_alltoall - MPI_Alltoall Latency Test
* osu_alltoallv - MPI_Alltoallv Latency Test
* osu_bcast - MPI_Bcast Latency Test
* osu_gather - MPI_Gather Latency Test
* osu_gatherv - MPI_Gatherv Latency Test
* osu_reduce - MPI_Reduce Latency Test
* osu_reduce_scatter - MPI_Reduce_scatter Latency Test
* osu_scatter - MPI_Scatter Latency Test
* osu_scatterv - MPI_Scatterv Latency Test
In addition to support for communications to and from GPU memories allocated
using CUDA or OpenACC, we now provide additional capability of performing
communications to and from buffers allocated using the CUDA Managed Memory concept.
CUDA Managed (or Unified) Memory allows applications to allocate memory on either CPU
or GPU memories using the cudaMallocManaged() call. This allows user oblivious transfer
of the memory buffer between the CPU or GPU. Currently, we offer benchmarking with CUDA
Managed Memory using the tests mentioned above.
These benchmarks have additional options:
* "M" allocates a send or receive buffer as managed for point to point communication.
* "-d managed" uses managed memory buffers to perform collective communications.
Non-Blocking Collective MPI Benchmarks
--------------------------------------
osu_iallgather - MPI_Iallgather Latency Test
osu_iallgatherv - MPI_Iallgatherv Latency Test
osu_iallreduce - MPI_Iallreduce Latency Test
osu_ialltoall - MPI_Ialltoall Latency Test
osu_ialltoallv - MPI_Ialltoallv Latency Test
osu_ialltoallw - MPI_Ialltoallw Latency Test
osu_ibarrier - MPI_Ibarrier Latency Test
osu_ibcast - MPI_Ibcast Latency Test
osu_igather - MPI_Igather Latency Test
osu_igatherv - MPI_Igatherv Latency Test
osu_ireduce - MPI_Ireduce Latency Test
osu_iscatter - MPI_Iscatter Latency Test
osu_iscatterv - MPI_Iscatterv Latency Test
Non-Blocking Collective Latency Tests
* In addition to the blocking collective latency tests, we provide several
* non-blocking collectives as mentioned above. These evaluate the same
* metrics as the blocking operations as well as the additional metric
* `overlap'. This is defined as the amount of computation that can be
* performed while the communication progresses in the background.
* These benchmarks have the additional option:
* "-t" set the number of MPI_Test() calls during the dummy computation, set
CALLS to 100, 1000, or any number > 0.
One-sided MPI Benchmarks
------------------------
osu_put_latency - Latency Test for Put with Active/Passive Synchronization
* The put latency benchmark includes window initialization operations
* (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
* synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
* the origin process calls MPI_Put to directly place data of a certain size
* in the remote process's window and then waiting on a synchronization call
* (MPI_Win_complete) for completion. The remote process participates in
* synchronization with MPI_Win_post and MPI_Win_wait calls. Several
* iterations of this test is carried out and the average put latency
* numbers is reported. The latency includes the synchronization time also.
* For passive synchronization, suppose users run with MPI_Win_lock/unlock,
* the origin process calls MPI_Win_lock to lock the target process's window
* and calls MPI_Put to directly place data of certain size in the window.
* Then it calls MPI_Win_unlock to ensure completion of the Put and release
* lock on the window. This is carried out for several iterations and the
* average time for MPI_Lock + MPI_Put + MPI_Unlock calls is measured. The
* default window initialization and synchronization operations are
* MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
* options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
* "-x" can be used to set the number of warmup iterations to
skip for each message length.
* "-i" can be used to set the number of iterations to run for
each message length.
osu_get_latency - Latency Test for Get with Active/Passive Synchronization
* The get latency benchmark includes window initialization operations
* (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
* synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
* the origin process calls MPI_Get to directly fetch data of a certain size
* from the target process's window into a local buffer. It then waits on a
* synchronization call (MPI_Win_complete) for local completion of the Gets.
* The remote process participates in synchronization with MPI_Win_post and
* MPI_Win_wait calls. Several iterations of this test is carried out and
* the average get latency numbers is reported. The latency includes the
* synchronization time also. For passive synchronization, suppose users run
* with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock
* the target process's window and calls MPI_Get to directly read data of
* certain size from the window. Then it calls MPI_Win_unlock to ensure
* completion of the Get and releases lock on remote window. This is carried
* out for several iterations and the average time for MPI_Lock + MPI_Get +
* MPI_Unlock calls is measured. The default window initialization and
* synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
* benchmark offers the following options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate " use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization
* The put bandwidth benchmark includes window initialization operations
* (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
* synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
* the test is carried out by the origin process calling a fixed number of
* back-to-back MPI_Puts on remote window and then waiting on a
* synchronization call (MPI_Win_complete) for their completion. The remote
* process participates in synchronization with MPI_Win_post and
* MPI_Win_wait calls. This process is repeated for several iterations and
* the bandwidth is calculated based on the elapsed time and the number of
* bytes put by the origin process. For passive synchronization, suppose
* users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock
* to lock the target process's window and calls a fixed number of
* back-to-back MPI_Puts to directly place data in the window. Then it calls
* MPI_Win_unlock to ensure completion of the Puts and release lock on
* remote window. This process is repeated for several iterations and the
* bandwidth is calculated based on the elapsed time and the number of bytes
* put by the origin process. The default window initialization and
* synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
* benchmark offers the following options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization
* The get bandwidth benchmark includes window initialization operations
* (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
* synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
* the test is carried out by origin process calling a fixed number of
* back-to-back MPI_Gets and then waiting on a synchronization call
* (MPI_Win_complete) for their completion. The remote process participates
* in synchronization with MPI_Win_post and MPI_Win_wait calls. This process
* is repeated for several iterations and the bandwidth is calculated based
* on the elapsed time and the number of bytes received by the origin
* process. For passive synchronization, suppose users run with
* MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the
* target process's window and calls a fixed number of back-to-back MPI_Gets
* to directly get data from the window. Then it calls MPI_Win_unlock to
* ensure completion of the Gets and release lock on the window. This
* process is repeated for several iterations and the bandwidth is
* calculated based on the elapsed time and the number of bytes read by the
* origin process. The default window initialization and synchronization
* operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers
* the following options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization.
osu_put_bibw - Bi-directional Bandwidth Test for Put with Active
Synchronization
* The put bi-directional bandwidth benchmark includes window initialization
* operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
* and synchronization operations (MPI_Win_Post/Start/Complete/Wait and
* MPI_Win_fence). This test is similar to the bandwidth test, except that
* both the processes involved send out a fixed number of back-to-back
* MPI_Puts and wait for their completion. This test measures the maximum
* sustainable aggregate bandwidth by two processes. The default window
* initialization and synchronization operations are MPI_Win_allocate and
* MPI_Win_Post/Start/Complete/Wait. The benchmark offers the following
* options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
osu_acc_latency - Latency Test for Accumulate with Active/Passive
Synchronization
* The accumulate latency benchmark includes window initialization
* operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
* and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
* the origin process calls MPI_Accumulate to combine data from the local
* buffer with the data in the remote window and store it in the remote
* window. The combining operation used in the test is MPI_SUM. The origin
* process then waits on a synchronization call (MPI_Win_complete) for
* completion of the operations. The remote process waits on a MPI_Win_wait
* call. Several iterations of this test are carried out and the average
* accumulate latency number is obtained. The latency includes the
* synchronization time also. For passive synchronization, suppose users
* run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to
* lock the target process's window and calls MPI_Accumulate to combine data
* from a local buffer with the data in the remote window and store it in
* the remote window. Then it calls MPI_Win_unlock to ensure completion of
* the Accumulate and release lock on the window. This is carried out for
* several iterations and the average time for MPI_Lock + MPI_Accumulate +
* MPI_Unlock calls is measured. The default window initialization and
* synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
* benchmark offers the following options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
osu_cas_latency - Latency Test for Compare and Swap with Active/Passive
Synchronization
* The Compare_and_swap latency benchmark includes window initialization
* operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
* and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with
* MPI_Win_Post/Start/Complete/Wait,the origin process calls
* MPI_Compare_and_swap to place one element from origin buffer to target
* buffer. The initial value in the target buffer is returned to the
* calling process. The origin process then waits on a synchronization call
* (MPI_Win_complete) for local completion of the operations. The remote
* process waits on a MPI_Win_wait call. Several iterations of this test are
* carried out and the average Compare_and_swap latency number is obtained.
* The latency includes the synchronization time also. For passive
* synchronization, suppose users run with MPI_Win_lock/unlock, the origin
* process calls MPI_Win_lock to lock the target process's window and calls
* MPI_Compare_and_swap to place one element from origin buffer to target
* buffer. The initial value in the target buffer is returned to the calling
* process. Then it calls MPI_Win_flush to ensure completion of the
* Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on
* the window. This is carried out for several iterations and the average
* time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The
* default window initialization and synchronization operations are
* MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
* options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
osu_fop_latency - Latency Test for Fetch and Op with Active/Passive
Synchronization
* The Fetch_and_op latency benchmark includes window initialization
* operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
* and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
* the origin process calls MPI_Fetch_and_op to increase the element in
* target buffer by 1. The initial value from the target buffer is returned
* to the calling process. The origin process waits on a synchronization
* call (MPI_Win_complete) for completion of the operations. The remote
* process waits on a MPI_Win_wait call. Several iterations of this test are
* carried out and the average Fetch_and_op latency number is obtained. The
* latency includes the synchronization time also. For passive
* synchronization, suppose users run with MPI_Win_lock/unlock, the origin
* process calls MPI_Win_lock to lock the target process's window and calls
* MPI_Compare_and_swap to place one element from origin buffer to target
* buffer. The initial value in the target buffer is returned to the calling
* process. Then it calls MPI_Win_flush to ensure completion of the
* Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on
* the window. This is carried out for several iterations and the average
* time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The
* default window initialization and synchronization operations are
* MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
* options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive
Synchronization
* The Get_accumulate latency benchmark includes window initialization
* operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
* and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
* MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
* MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
* synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
* the origin process calls MPI_Get_accumulate to combine data from the
* local buffer with the data in the remote window and store it in the
* remote window. The combining operation used in the test is MPI_SUM. The
* initial value from the target buffer is returned to the calling process.
* The origin process waits on a synchronization call (MPI_Win_complete) for
* local completion of the operations. The remote process waits on a
* MPI_Win_wait call. Several iterations of this test are carried out and
* the average get accumulate latency number is obtained. The latency
* includes the synchronization time also. For passive synchronization,
* suppose users run with MPI_Win_lock/unlock, the origin process calls
* MPI_Win_lock to lock the target process's window and calls
* MPI_Get_accumulate to combine data from a local buffer with the data in
* the remote window and store it in the remote window. The initial value
* from the target buffer is returned to the calling process. Then it calls
* MPI_Win_unlock to ensure completion of the Get_accumulate and release
* lock on the window. This is carried out for several iterations and the
* average time for MPI_Lock + MPI_Get_accumulate + MPI_Unlock calls is
* measured. The default window initialization and synchronization
* operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers
* the following options:
* "-w create" use MPI_Win_create to create an MPI Window object.
* "-w allocate" use MPI_Win_allocate to create an MPI Window object.
* "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
* object.
* "-s lock" use MPI_Win_lock/unlock synchronizations calls.
* "-s flush" use MPI_Win_flush synchronization call.
* "-s flush_local" use MPI_Win_flush_local synchronization call.
* "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
* "-s pscw" use Post/Start/Complete/Wait synchronization calls.
* "-s fence" use MPI_Win_fence synchronization call.
Point-to-Point OpenSHMEM Benchmarks
-----------------------------------
osu_oshm_put.c - Latency Test for OpenSHMEM Put Routine
* This benchmark measures latency of a shmem putmem operation for different
* data sizes. The user is required to select whether the communication
* buffers should be allocated in global memory or heap memory, through a
* parameter. The test requires exactly two PEs. PE 0 issues shmem putmem to
* write data at PE 1 and then calls shmem quiet. This is repeated for a
* fixed number of iterations, depending on the data size. The average
* latency per iteration is reported. A few warm-up iterations are run
* without timing to ignore any start-up overheads. Both PEs call shmem
* barrier all after the test for each message size.
osu_oshm_put_nb.c - Latency Test for OpenSHMEM Non-blocking Put Routine
* This benchmark measures the non-blocking latency of a shmem putmem_nbi
* operation for different data sizes. The user is required to select
* whether the communication buffers should be allocated in global
* memory or heap memory, through a parameter. The test requires exactly
* two PEs. PE 0 issues shmem putmem_nbi to write data at PE 1 and then calls
* shmem quiet. This is repeated for a fixed number of iterations, depending
* on the data size. The average latency per iteration is reported.
* A few warm-up iterations are run without timing to ignore any start-up
* overheads. Both PEs call shmem barrier all after the test for each message size.
osu_oshm_get.c - Latency Test for OpenSHMEM Get Routine
* This benchmark is similar to the one above except that PE 0 does a shmem
* getmem operation to read data from PE 1 in each iteration. The average
* latency per iteration is reported.
osu_oshm_get_nb.c - Latency Test for OpenSHMEM Non-blocking Get Routine
* This benchmark is similar to the one above except that PE 0 does a shmem
* getmem_nbi operation to read data from PE 1 in each iteration. The average
* latency per iteration is reported.
osu_oshm_put_mr.c - Message Rate Test for OpenSHMEM Put Routine
* This benchmark measures the aggregate uni-directional operation rate of
* OpenSHMEM Put between pairs of PEs, for different data sizes. The user
* should select for communication buffers to be in global memory and heap
* memory as with the earlier benchmarks. This test requires number of PEs
* to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on,
* where n is the total number of PEs. The first PE in each pair issues
* back-to-back shmem putmem operations to its peer PE. The total time for
* the put operations is measured and operation rate per second is reported.
* All PEs call shmem barrier all after the test for each message size.
osu_oshm_put_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Put Routine
* This benchmark measures the aggregate uni-directional operation rate of
* OpenSHMEM Non-blocking Put between pairs of PEs, for different data sizes.
* The user should select for communication buffers to be in global memory
* and heap memory as with the earlier benchmarks. This test requires number
* of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on,
* where n is the total number of PEs. The first PE in each pair issues
* back-to-back shmem putmem_nbi operations to its peer PE until the window
* size. A call to shmem_quite is placed after the window loop to ensure
* completion of the issued operations. The total time for the non-blocking
* put operations is measured and operation rate per second is reported.
* All PEs call shmem barrier all after the test for each message size.
osu_oshm_get_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Get Routine
* This benchmark measures the aggregate uni-directional operation rate of
* OpenSHMEM Non-blocking Get between pairs of PEs, for different data sizes.
* The user should select for communication buffers to be in global memory
* and heap memory as with the earlier benchmarks. This test requires number
* of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on,
* where n is the total number of PEs. The first PE in each pair issues
* back-to-back shmem getmem_nbi operations to its peer PE until the window
* size. A call to shmem_quite is placed after the window loop to ensure
* completion of the issued operations. The total time for the non-blocking
* put operations is measured and operation rate per second is reported.
* All PEs call shmem barrier all after the test for each message size.
osu_oshm_put_overlap.c - Non-blocking Message Rate Overlap Test
* This benchmark measures the aggregate uni-directional operations rate
* overlap for OpenSHMEM Put between paris of PEs, for different data sizes.
* The user should select for communication buffers to be in global memory
* and heap memory as with the earlier benchmarks. This test requires number
* of PEs. The benchmarks prints statistics for different phases of
* communication, computation and overlap in the end.
osu_oshm_atomics.c - Latency and Operation Rate Test for OpenSHMEM Atomics Routines
* This benchmark measures the performance of atomic fetch-and-operate and
* atomic operate routines supported in OpenSHMEM for the integer
* and long datatypes. The buffers can be selected to be in heap memory or global
* memory. The PEs are paired like in the case of Put Operation Rate
* benchmark and the first PE in each pair issues back-to-back atomic
* operations of a type to its peer PE. The average latency per atomic
* operation and the aggregate operation rate are reported. This is
* repeated for each of fadd, finc, add, inc, cswap, swap, set, and fetch
* routines.
Collective OpenSHMEM Benchmarks
-------------------------------
osu_oshm_collect - OpenSHMEM Collect Latency Test
osu_oshm_fcollect - OpenSHMEM FCollect Latency Test
osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test
osu_oshm_reduce - OpenSHMEM Reduce Latency Test
osu_oshm_barrier - OpenSHMEM Barrier Latency Test
Collective Latency Tests
* The latest OMB Version includes benchmarks for various OpenSHMEM
* collective operations (shmem_collect, shmem_broadcast, shmem_reduce and
* shmem_barrier). These benchmarks work in the following manner. Suppose
* users run the osu_oshm_broadcast benchmark with N processes, the
* benchmark measures the min, max and the average latency of the
* shmem_broadcast collective operation across N processes, for various
* message lengths, over a large number of iterations. In the default
* version, these benchmarks report the average latency for each message
* length. Additionally, the benchmarks offer the following options:
* "-f" can be used to report additional statistics of the benchmark,
such as min and max latencies and the number of iterations.
* "-m" option can be used to set the maximum message length to be used in a
benchmark. In the default version, the benchmarks report the
latencies for up to 1MB message lengths.
* "-i" can be used to set the number of iterations to run for each message
length.
Point-to-Point UPC Benchmarks
-----------------------------
osu_upc_memput.c - Put Latency
* This benchmark measures the latency of upc put operation between multiple
* UPC threads. In this bench- mark, UPC threads with ranks less than
* (THREADS/2) issues upc memput operations to peer UPC threads. Peer
* threads are identified as (MYTHREAD+THREADS/2). This is repeated for a
* fixed number of iterations, for varying data sizes. The average latency
* per iteration is reported. A few warm-up iterations are run without
* timing to ignore any start-up overheads. All UPC threads call upc barrier
* after the test for each message size.
osu_upc_memget.c - Get Latency
* This benchmark is similar as the osu upc put benchmark that is described
* above. The difference is that the shared string handling function is upc
* memget. The average get operation latency per iteration is reported.
Collective UPC Benchmarks
-------------------------
osu_upc_all_barrier - UPC Barrier Latency Test
osu_upc_all_broadcast - UPC Broadcast Latency Test
osu_upc_all_scatter - UPC Scatter Latency Test
osu_upc_all_gather - UPC Gather Latency Test
osu_upc_all_gather_all - UPC GatherAll Latency Test
osu_upc_all_reduce - UPC Reduce Latency Test
osu_upc_all_exchange - UPC Exchange Latency Test
Collective Latency Tests
* The latest OMB Version includes benchmarks for various UPC collective
* operations (upc_all_barrier, upc_all_broadcast, upc_all_scatter,
* upc_all_gather, upc_all_gather_all, osu_upc_all_reduce, and
* upc_all_exchange). These benchmarks work in the following manner. Suppose
* users run the osu_upc_all_broadcast benchmark with N processes, the
* benchmark measures the min, max and the average latency of the
* upc_all_broadcast collective operation across N processes, for various
* message lengths, over a large number of iterations. In the default
* version, these benchmarks report the average latency for each message
* length. Additionally, the benchmarks offer the following options: "-f"
* can be used to report additional statistics of the benchmark, such as min
* and max latencies and the number of iterations. "-m" option can be used
* to set the maximum message length to be used in a benchmark. In the
* default version, the benchmarks report the latencies for up to 1MB
* message lengths. "-i" can be used to set the number of iterations to run
* for each message length.
Point-to-Point UPC++ Benchmarks
-------------------------------
osu_upcxx_async_copy_put.c - Put Latency
* This benchmark measures the latency of the UPC++ async_copy operation
* between multiple UPC++ threads. In this benchmark, UPC+ threads with
* ranks less than (THREADS/2) issues UPC++ async_copy from local to remote
* memory on peer threads. Peer threads are identified as
* (MYTHREAD+THREADS/2). This is repeated for a fixed number of iterations,
* for varying data sizes. The average latency per iteration is reported. A
* few warm-up iterations are run without timing to ignore any start-up
* overheads. All UPC++ threads call barrier after the test for each message
* size.
osu_upcxx_async_copy_get.c - Get Latency
* This benchmark is similar as the osu_upcxx_async_copy_put benchmark that
* is described above. The difference is that the async_copy operation
* copies from remote to local memory. The average get operation latency per
* iteration is reported.
Collective UPC++ Benchmarks
---------------------------
osu_upcxx_allgather - UPC++ Allgather Latency Test
osu_upcxx_alltoall - UPC++ Alltoall Latency Test
osu_upcxx_bcast - UPC++ Broadcast Latency Test
osu_upcxx_gather - UPC++ Gather Latency Test
osu_upcxx_reduce - UPC++ Reduce Latency Test
osu_upcxx_scatter - UPC++ Scatter Latency Test
Collective Latency Tests
* The latest OMB Version includes benchmarks for various UPC++ collective
* operations (upcxx_allgather, upcxx_alltoall, upcxx_bcast, upcxx_gather,
* upcxx_reduce, and upcxx_scatter). These benchmarks work in the following
* manner. Suppose users run the osu_upcxx_bcast benchmark with N processes,
* the benchmark measures the min, max and the average latency of the
* upcxx_bcast collective operation across N processes, for various message
* lengths, over a large number of iterations. In the default version, these
* benchmarks report the average latency for each message length.
* Additionally, the benchmarks offer the following options:
* "-f" can be used to report additional statistics of the benchmark, such
* as min and max latencies and the number of iterations.
* "-m" option can be used to set the maximum message length to be used in a
* benchmark. In the default version, the benchmarks report the latencies
* for up to 1MB message lengths.
* "-i" can be used to set the number of iterations to run for each message
* length.
Startup Benchmarks
------------------
osu_init.c - This benchmark measures the minimum, maximum, and average time
* each process takes to complete MPI_Init.
osu_hello.c - This is a simple hello world program. Users can take advantage of
* this to time it takes for all processes to execute MPI_Init +
* MPI_Finalize.
*
* Example:
* - time mpirun_rsh -np 2 -hostfile hostfile osu_hello
CUDA and OpenACC Extensions to OMB
----------------------------------
CUDA Extensions to OMB can be enable by configuring the benchmark suite with
--enable-cuda option as shown below. Similarly, OpenACC Extensions can be
enabled by specifying the --enable-openacc option. The MPI library used should
be able to support MPI communication from buffers in GPU Device memory.
./configure CC=/path/to/mpicc
CXX=/path/to/mpicxx
--enable-cuda
--with-cuda-include=/path/to/cuda/include
--with-cuda-libpath=/path/to/cuda/lib
make
make install
The following benchmarks have been extended to evaluate performance of
MPI communication using buffers on NVIDIA GPU devices.
osu_bibw - Bidirectional Bandwidth Test
osu_bw - Bandwidth Test
osu_latency - Latency Test
osu_mbw_mr - Multiple Bandwidth / Message Rate Test
osu_multi_lat - Multi-pair Latency Test
osu_put_latency - Latency Test for Put
osu_get_latency - Latency Test for Get
osu_put_bw - Bandwidth Test for Put
osu_get_bw - Bandwidth Test for Get
osu_put_bibw - Bidirectional Bandwidth Test for Put
osu_acc_latency - Latency Test for Accumulate
osu_cas_latency - Latency Test for Compare and Swap
osu_fop_latency - Latency Test for Fetch and Op
osu_allgather - MPI_Allgather Latency Test
osu_allgatherv - MPI_Allgatherv Latency Test
osu_allreduce - MPI_Allreduce Latency Test
osu_alltoall - MPI_Alltoall Latency Test
osu_alltoallv - MPI_Alltoallv Latency Test
osu_bcast - MPI_Bcast Latency Test
osu_gather - MPI_Gather Latency Test
osu_gatherv - MPI_Gatherv Latency Test
osu_reduce - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter Latency Test
osu_scatter - MPI_Scatter Latency Test
osu_scatterv - MPI_Scatterv Latency Test
osu_iallgather - MPI_Iallgather Latency Test
osu_iallgatherv - MPI_Iallgatherv Latency Test
osu_iallreduce - MPI_Iallreduce Latency Test
osu_ialltoall - MPI_Ialltoall Latency Test
osu_ialltoallv - MPI_Ialltoallv Latency Test
osu_ialltoallw - MPI_Ialltoallw Latency Test
osu_ibcast - MPI_Ibcast Latency Test
osu_igather - MPI_Igather Latency Test
osu_igatherv - MPI_Igatherv Latency Test
osu_ireduce - MPI_Ireduce Latency Test
osu_iscatter - MPI_Iscatter Latency Test
osu_iscatterv - MPI_Iscatterv Latency Test
If both CUDA and OpenACC support is enabled you can switch between the modes
using the -d [cuda|openacc] option to the benchmarks. Whether a process
allocates its communication buffers on the GPU device or on the host can be
controlled at run-time. Use the -h option for more help.
./osu_latency -h
Usage: osu_latency [options] [RANK0 RANK1]
RANK0 and RANK1 may be `D' or `H' which specifies whether
the buffer is allocated on the accelerator device or host
memory for each mpi rank
options:
-d TYPE accelerator device buffers can be of TYPE `cuda' or `openacc'
-h print this help message
Each of the pt2pt benchmarks takes two input parameters. The first parameter
indicates the location of the buffers at rank 0 and the second parameter
indicates the location of the buffers at rank 1. The value of each of these
parameters can be either 'H' or 'D' to indicate if the buffers are to be on the
host or on the device respectively. When no parameters are specified, the
buffers are allocated on the host. The collective benchmarks will use buffers
allocated on the device if the -d option is used otherwise the buffers will be
allocated on the host.
Examples:
- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_latency D D
In this run, the latency test allocates buffers at both rank 0 and rank 1 on
the GPU devices.
- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_bw D H
In this run, the bandwidth test allocates buffers at rank 0 on the GPU device
and buffers at rank 1 on the host.
Setting GPU affinity
--------------------
GPU affinity for processes is set before MPI_Init is called in the benchmarks.
The process rank on a node is normally used to do this and different MPI
launchers expose this information through different environment variables. The
benchmarks use an environment variable called LOCAL_RANK to get this
information.
Starting with OMB v5.4.4, the benchmarks automatically identify the process rank
on a node for MVAPICH2 when launched with mpirun_rsh. However, a script like
below can be used to export this environment variable when using OMB to work
with other MPI launchers and libraries.
#!/bin/bash
export LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK
exec $*
A copy of this script is installed as get_local_rank alongside the benchmarks.
It can be used as follows:
mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank \
./osu_latency D D