Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hwloc and LIKWID Versions #682

Open
pauleonix opened this issue Jan 28, 2020 · 22 comments
Open

hwloc and LIKWID Versions #682

pauleonix opened this issue Jan 28, 2020 · 22 comments
Assignees

Comments

@pauleonix
Copy link

pauleonix commented Jan 28, 2020

It would be very cool to have minimum/maximum Version numbers for these in the readme. For both libraries there can be build problems if the wrong version is loaded. E.g. it doesn't work with [email protected], but with [email protected]. If I then add [email protected] there are build errors again. I have no other Versions of likwid installed and I would like to know which one will work (newer or older than 4.3.2) before I try to build likwid.
Also it would be nice to know how DASH benefits exactly from these (and other) libraries, to know if they are worth the effort for a given project.

@pauleonix
Copy link
Author

pauleonix commented Jan 28, 2020

To be more precise: with [email protected] and [email protected] I get

[  2%] Building C object dart-impl/base/CMakeFiles/dart-base.dir/src/internal/domain_locality.c.o
/home/spielix/dash/dart-impl/base/src/hwinfo.c: In function ‘dart_hwinfo’:
/home/spielix/dash/dart-impl/base/src/hwinfo.c:148:27: error: ‘dart_hwinfo_t’ {aka ‘struct <anonymous>’} has no member named ‘num_sockets’; did you mean ‘num_cores’?
       hw.num_numa    = hw.num_sockets;
                           ^~~~~~~~~~~
                           num_cores
/home/spielix/dash/dart-impl/base/src/hwinfo.c:151:53: error: ‘dart_hwinfo_t’ {aka ‘struct <anonymous>’} has no member named ‘num_sockets’; did you mean ‘num_cores’?
       hw.num_cores   = topo->numCoresPerSocket * hw.num_sockets;
                                                     ^~~~~~~~~~~
                                                     num_cores
make[2]: *** [dart-impl/base/CMakeFiles/dart-base.dir/src/hwinfo.c.o] Error 1

@pauleonix
Copy link
Author

Without likwid but with [email protected] I get

[  3%] Building C object dart-impl/base/CMakeFiles/dart-base.dir/src/internal/domain_locality.c.o
/home/spielix/dash/dart-impl/base/src/hwinfo.c: In function ‘dart_hwinfo’:
/home/spielix/dash/dart-impl/base/src/hwinfo.c:313:33: error: ‘struct hwloc_obj’ has no member named ‘memory’
     hw.system_memory_bytes = obj->memory.total_memory / BYTES_PER_MB;
                                 ^~
/home/spielix/dash/dart-impl/base/src/hwinfo.c:319:33: error: ‘struct hwloc_obj’ has no member named ‘memory’
       hw.numa_memory_bytes = obj->memory.total_memory / BYTES_PER_MB;
                                 ^~
make[2]: *** [dart-impl/base/CMakeFiles/dart-base.dir/src/hwinfo.c.o] Error 1

@devreal
Copy link
Member

devreal commented Jan 28, 2020

@Spielix You're right, we should document the minimum requirements. We should also support hwloc v2 eventually. I'll see what I can do.

In the meantime, do you need support for either likwid or hwloc-2?

@pauleonix
Copy link
Author

pauleonix commented Jan 28, 2020

Like I said, first of all I would like to know what difference it makes to have these. I mean when one knows the library one may be able to imagine for what I it may be used in DASH, but not every use-case one can imagine may be implemented (yet), etc.
I would like to know the benefits and - if there are any - the drawbacks of building DASH with these 3rd party libraries enabled.

@devreal
Copy link
Member

devreal commented Jan 28, 2020

@Spielix Ahh I see, sorry I didn't fully grasp your question. You should be able to safely build DASH without these two libraries. They are mainly used to query information on the machine you're running on, e.g., the number of cores and the size of memory. In most cases, none of that is crucial for using DASH though and DASH will fallback to Linux APIs to query some of this information if neither Likwid nor hwloc is available. You're safe disabling both Likwid and hwloc entirely and build without them...

@pauleonix
Copy link
Author

pauleonix commented Jan 29, 2020

When I build using [email protected] and run dash-test-mpi with e.g. -host mynode:4, the tests need a very long time (producing more than 11,123 lines of output) and ultimately seem to fail:

[    0 ERROR ] [ 1212202943.391 ] locality.c               :617  !!! DART: dart__base__locality__domain_group ! group subdomain .0.0.1.0.1.0.0.0 with invalid parent domain .0.0.1.0.1.0.0.0                 
 ^[[0;31m[  ERROR   ] ^[[m[UNIT 0] in [=  0 LOG =]               TestBase.h : 287 | -==- Test case finished at unit 0

When I leave away the :4 everything works fine, as without hwloc. As my application(s) are quite communication heavy, I was thinking about having only one process per node instead of one per NUMA node either way. So it's not the end of the world, but I still would like to know if this behavior is to be expected.

@pauleonix
Copy link
Author

pauleonix commented Jan 31, 2020

I would love to get some comment on this. Is this

  • b/c one shouldn't use :number_of_slots,
  • b/c of the hwloc Version,
  • b/c of some configuration issues on my side,
  • or is there a bug in dash-test-mpi or even DASH itself?

@devreal
Copy link
Member

devreal commented Jan 31, 2020

It is hard to say what is going on just from the error you posted. Could you give some more information your platform, your MPI, and which test exactly fails?

@pauleonix
Copy link
Author

pauleonix commented Jan 31, 2020

Well, you not knowing where it comes from is pretty much enough information for me to just drop hwloc.

As you may want to go further:

  • The single node I ran the test on has 2x Intel Xeon Platinum 8168 @2.70GHz
  • I use [email protected], [email protected] and [email protected].
  • I also tried to run the test in an multi-node setting with more than one slot assigned to each node, but it took so long that I canceled it b/c I thought it was having the same problem and was producing just a ton of output.
  • Here you have all 11,123 lines of output:
    slurm-696394.out.txt
  • I run the test using sbatch -w mp-skl2s24c --wrap "`which mpirun` -host mp-skl2s24c:4 -x LD_LIBRARY_PATH ./dash/dash-test-mpi"

@devreal
Copy link
Member

devreal commented Jan 31, 2020

It seems that the locality part of the runtime trips over something in your setup. Unfortunately, I do not know enough about that part to quickly figure things out. Here is the relevant code (https://github.com/dash-project/dash/blob/development/dart-impl/base/src/locality.c#L611):

      if (group_subdomain_tag_len <= group_parent_domain_tag_len) {
        /* Indicates invalid parameters, usually caused by multiple units
         * mapped to the same domain to be grouped.
         */
        DART_LOG_ERROR("dart__base__locality__domain_group ! "
                       "group subdomain %s with invalid parent domain %s",
                       group_subdomain_tags[sd], group_parent_domain_tag);

AFAICS, the hwloc part is only really relevant if you plan to split teams based on hardware information (grouping all units on one node into a team for example). If not it's safe to ignore hwloc...

@devreal
Copy link
Member

devreal commented Jan 31, 2020

Maybe @fuchsto can shed some light on what is going wrong here?

@devreal
Copy link
Member

devreal commented Jan 31, 2020

@Spielix You mentioned that the test run takes significantly longer if you place four units on the same node. That is surprising because most of the tests are single-threaded. Can you make sure that the processes are not bound to the same core? Can you try running with --bind-to none passed to MPI?

The amount of output is expected, that's the normal test output.

@pauleonix
Copy link
Author

pauleonix commented Jan 31, 2020

The bind-to none doesn't seem to change anything. I tested it again (on a different node) and with the error the output is again over 10k lines, while when I don't specify the number of slots, I only get 3k lines of output (with all tests passed.)

EDIT There is ca a factor of two in runtime.

@pauleonix
Copy link
Author

This node has 2x Intel Xeon Silver 4110 @2.10GHz. With error it takes about 3 minutes, without it takes about 1.5 minutes.

@pauleonix
Copy link
Author

@devreal
Copy link
Member

devreal commented Jan 31, 2020

@Spielix If you don't specify :4 the test runs with a single unit only. Many of the tests require at least 2 units, some more, so naturally the output is significantly smaller. That might actually also explain the longer runtime...

@pauleonix
Copy link
Author

I guess the test that is failing is one of the ones not being run with only one slot?

@devreal
Copy link
Member

devreal commented Jan 31, 2020

@pauleonix
Copy link
Author

It wouldn't be too surprising if this was a setup issue, as the admins are mostly working on single nodes. We had problems with the MPI setup before. Although I would have thought that this would only show when one uses more than one node...

@devreal
Copy link
Member

devreal commented Jan 31, 2020

Can you try to launch one unit per node to see if the issue persists there? (if multi-node runs are part of your use-case)

@pauleonix
Copy link
Author

pauleonix commented Feb 1, 2020

I can, but the the only type I have 4 nodes of is knl, so the single cores are very slow (the network is slow too.). When I tried it I got a seemingly different error. The problem is that when I try doing the test with the non-hwloc build, it runs for over 8 hours on the 4 nodes and that is the maximum amount of time I can use. As it didn't take long for the hwloc version to error on the 4 nodes, I still would guess that the error is not happening w/o hwloc. You can confirm this in the output if you want:
slurm-696581_hwlocerror_4nodes.out.txt
slurm-696625_4nodes_timeout.out.txt

@pauleonix
Copy link
Author

pauleonix commented Feb 2, 2020

As you can see in the new issue I have found out what stopped the non-hwloc run from working (I thought that it couldn't be that slow/that much to test). I actually didn't use the right branch. With the development branch the code works fine on the four nodes w/o hwloc meaning that the error appearing in "slurm-696581_hwlocerror_4nodes.out.txt" is also due to hwloc. I don't know if or how it is related to the case with several units on one node though.

devreal added a commit that referenced this issue Feb 28, 2020
Fix compilation issues #682 hwloc 2.x and likwid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants