Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random fail of test_mpi_coarsen_2d #51

Open
fmilthaler opened this issue Jun 8, 2015 · 9 comments
Open

Random fail of test_mpi_coarsen_2d #51

fmilthaler opened this issue Jun 8, 2015 · 9 comments
Assignees

Comments

@fmilthaler
Copy link
Collaborator

Following the merge of cmake-enable-mpi-option into master, the test test_mpi_coarsen_2d was failing on travis (travis build), with ENABLE_VTK=TRUE and ENABLE_MPI not set, thus it was built with MPI support.
The same build (from the same commit/merge), but with ENABLE_MPI=TRUE and ENABLE_VTK not set, thus with VTK support, passed all the test (travis build).

@ggorman
Copy link
Contributor

ggorman commented Jun 8, 2015

Fixed.

@ggorman ggorman closed this as completed Jun 8, 2015
@fmilthaler fmilthaler reopened this Jun 25, 2015
@fmilthaler
Copy link
Collaborator Author

Unfortunately, this test failed once again, here is the Travis log.
The build that failed had all configure options enabled (implicitly and explicitly), that is:

  • ENABLE_VTK=TRUE
  • ENABLE_MPI=TRUE
  • ENABLE_OPENMP=TRUE

@fmilthaler
Copy link
Collaborator Author

Unfortunately we are experiencing this quite a few times, here is another failure for test_mpi_coarsen_2d.

Also, test_mpi_adapt_3d has fallen over (with config option ENABLE_OPENMP=FALSE), e.g. in this build.

Finally, test_coarsen_3d failed (with config option ENABLE_MPI=FALSE) in this build.

Note, all of those failures don't occur systematically, the failures are more or less random, as there are a number of builds that passed all tests with the same config options.

@fmilthaler
Copy link
Collaborator Author

Correction: test_mpi_adapt_3d seems to fail systematically iff ENABLE_OPENMP=FALSE and passes otherwise.

The other failing tests (as stated above) are experienced randomly.

@ggorman
Copy link
Contributor

ggorman commented Jul 2, 2015

Can you confirm you have that the right way around?

Can you reproduce this by hand using a simple bash loop? Run it with -v so we have an actual error message to consider.

Gerard (mobile)

On 2 Jul 2015, at 18:46, Frank Milthaler <[email protected]mailto:[email protected]> wrote:

Correction: test_mpi_adapt_3d seems to fail systematically iff ENABLE_OPENMP=FALSE and passes otherwise.

The other failing tests (as stated above) are experienced randomly.


Reply to this email directly or view it on GitHubhttps://github.com//issues/51#issuecomment-118108084.

@fmilthaler
Copy link
Collaborator Author

Good call: test_mpi_adapt_3d does not fail systematically!
Here are the results:
The test passes with a 64% success rate (based on 100 runs).
The test fails/crashes with the following error message:

Initial quality:
Quality mean:  0.376974
Quality min:   0.13183
/data/fmilthaler/pragmatic/include/Smooth.h, 1400 failing with tol=-nan
test_mpi_adapt_3d: /data/fmilthaler/pragmatic/include/Smooth.h:1402: bool Smooth<real_t, dim>::generate_location_3d(index_t, const real_t*, double*) const [with real_t = double; int dim = 3; index_t = int]: Assertion `tol>-10*double(2.22044604925031308085e-16L)' failed.
[tichy:00368] *** Process received signal ***
[tichy:00368] Signal: Aborted (6)
[tichy:00368] Signal code:  (-6)
[tichy:00368] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7fbe46412d40]
[tichy:00368] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fbe46412cc9]
[tichy:00368] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fbe464160d8]
[tichy:00368] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb86) [0x7fbe4640bb86]
[tichy:00368] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fc32) [0x7fbe4640bc32]
[tichy:00368] [ 5] test_mpi_adapt_3d(_ZNK6SmoothIdLi3EE20generate_location_3dEiPKdPd+0x654) [0x52e57c]
[tichy:00368] [ 6] test_mpi_adapt_3d(_ZN6SmoothIdLi3EE27optimisation_linf_3d_kernelEi+0xca2) [0x523eb8]
[tichy:00368] [ 7] test_mpi_adapt_3d(_ZN6SmoothIdLi3EE24optimisation_linf_kernelEi+0x20) [0x514f9a]
[tichy:00368] [ 8] test_mpi_adapt_3d(_ZN6SmoothIdLi3EE17optimisation_linfEid+0x744) [0x4feac0]
[tichy:00368] [ 9] test_mpi_adapt_3d(main+0x6cf) [0x4e85bd]
[tichy:00368] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fbe463fdec5]
[tichy:00368] [11] test_mpi_adapt_3d() [0x4e7da9]
[tichy:00368] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 368 on node tichy exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I'll check the other two tests tomorrow.

@grokos
Copy link
Collaborator

grokos commented Jul 3, 2015

I see it segfaults inside generate_location() in Smoothing. I bet if you compile the code with debug support you will get an assertion failure - the one related to "tol > -DBL_EPSILON" or something like that. This is a problem I faced months ago and I haven't been able to fix it so far. It is not a multi-threading issue, the problem occurs even with one thread. Actually, if you want to make test_mpi_adapt_3d to fail systematically, just run it with OMP_NUM_THREADS=1. Multi-threading changes the order of operations, making the test pass sometimes and fail others.

@ggorman
Copy link
Contributor

ggorman commented Jul 3, 2015

I may have a fix for this in my branch for boundary coarsening. I found that if highly anisotropic elements were generated the metric may have non positive determinant due to roundoff.

Frank - can you perform the same test with my branch?

Cheers
Gerard

On 3 Jul 2015, at 06:36, gr409 <[email protected]mailto:[email protected]> wrote:

I see it segfaults inside generate_location() in Smoothing. I bet if you compile the code with debug support you will get an assertion failure - the one related to "tol > -DBL_EPSILON" or something like that. This is a problem I faced months ago and I haven't been able to fix it so far. It is not a multi-threading issue, the problem occurs even with one thread. Actually, if you want to make test_mpi_adapt_3d to fail systematically, just run it with OMP_NUM_THREADS=1. Multi-threading changes the order of operations, making the test pass sometimes and fail others.


Reply to this email directly or view it on GitHubhttps://github.com//issues/51#issuecomment-118240798.

@fmilthaler
Copy link
Collaborator Author

Here we go: based on 500 runs of test_mpi_adapt_3d, here are the following results:

  • in master, configured with ENABLE_MPI=TRUE, ENABLE_OPENMP=TRUE: 498 runs passed
  • in master, configured with ENABLE_MPI=TRUE, ENABLE_OPENMP=FALSE: 338 runs passed
  • in coarsen_surface, configured with ENABLE_MPI=TRUE, ENABLE_OPENMP=TRUE: 472 runs passed
  • in coarsen_surface, configured with ENABLE_MPI=TRUE, ENABLE_OPENMP=FALSE: the first 9 passed, and then it ran indefinitely at the 10th run.

The failures are the same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants