-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes in TensorFlow via L1TMuonEndCapTrackProducer #32894
Comments
assign core, l1 |
New categories assigned: core,l1 @Dr15Jones,@smuzaffar,@rekovic,@makortel,@jmduarte you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
These are likely caused by #32813 that was merged after the previous ROOT master IB (where all workflows succeeded), but why do these crash only in ROOT master IBs, and not in default or ROOT622 IBs? |
Adding @riga @mialiu149 in case they would have any insight |
We are seeing the same crash in the slc7_ppc64le_gcc9 IBs, which are ROOT622, so not quite unique to ROOT master. Edited to add: not seeing it in cc8_ppc64le_gcc9, but are seeing it in cc8_aarch64_gcc9 CMSSW_11_2_X IBs |
Since this was backported, we are also seeing it in the CMSSW_11_2_X IBs, specifically for cc8_aarch64_gcc9 and cc8_amd64_gcc9. |
Thanks Dan for pointing out the problem is more widespread, I updated the title |
unassign core The problem appears to be specific to this L1 code, with some random component on where it appears. |
I did some testing using CMSSW_11_3_ROOT6_X_2021-02-11-2300 and the failed workflows. It appears to be some threading issue once again. Once I run with single thread there are no more crashes. However, I don't really understand why it's happening. |
In all cases the failing workflows seem to be limited to "FastSim+pileup". |
In the core software meeting a hypothesis of the crash being caused by running out of memory (and The contribution from this TensorFlow model is modest though |
Hi @makortel , thanks a lot! I tested some things locally as well to see if I can prevent the crash, but it didn't help much. The only thing I could do was to prevent the crash from So in order to make it so that we only have one instance of both shared between stream instances, what would be the best practice? Is it done by Once again thanks for the help in debugging this! |
@riga thanks a lot! This is super useful. I'll try to implement it this way. |
@makortel , I also ran valgrind for workflow |
Thanks! I see several
(which sounds a bit scary given that jemalloc is very picky about correct frees) Then there is
which itself doesn't really help much, but confirms that there is a problem in the |
@riga Nice! I'd like to give specific comments, what would be your preferred way for them? Issue to https://github.com/cms-ml/documentation/? |
@makortel Yes, feel free to open an issue there. |
The invalid frees must be coming from initialization of statics in tensorflow, so I guess we're getting those every time tensorflow is loaded? |
Attn @jmduarte |
With #32128 there is also an option to to put the graph and session into an EventSetup product, that may be easier to deal with in some cases than the |
Thanks for all the comments once again. I tried to implement the I'm putting the crash report below in case it gives you more information about this. I will try to implement crash report:
|
The crash is here where thread local int is being accessed. That to me implies some sort of dll related problem. |
Just wanted to let you know that I implemented Looking at the new ROOT master IBs, I don't see the RelVals crashing anymore. Do you know what changed to fix this issue? Is this still an issue on our side that we should fix? |
Indeed the crashes were visible in CMSSW_11_3_ROOT6_X_2021-02-15-2300 but not anymore in CMSSW_11_3_ROOT6_X_2021-02-16-2300 or after that. The latter IB was a full build. The same holds for the default IBs for |
Thanks! I did not really expect that to fix the problem, but it would reduce the overall memory usage of the jobs that use |
Still popping up in the 11_2_X IBs, most recently CMSSW_11_2_X_2021-02-21-0000/lib/cc8_amd64_gcc9. |
Thanks for the replies (and sorry for the delay on my side).
Ok, thanks. I can see that it's still popping up in some of the 11_2_X IBs, but I don't know what we can do about that.
Thanks! It's a good idea overall. I think we'll implement it in some way soon, but we need to do some validation before submitting a PR. |
@eyigitba Did you follow up on this issue? |
@eyigitba Can you please let us know about the status of fixing the issue? Are you still planning to implement the ideas mentioned in this thread? |
Hi @cecilecaillol , sorry for missing the earlier message. I haven't worked on this for some time. Our understanding was that the |
The per-stream memory usage of |
There are several workflows segfaulting in CMSSW_11_3_ROOT6_X_2021-02-11-2300, e.g. 25400.{0,17,18} step 1
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_11_3_ROOT6_X_2021-02-11-2300/pyRelValMatrixLogs/run/25400.0_ZEE_13+FS_ZEE_13_UP15_PU25+HARVESTUP15FS+MINIAODMCUP15FS/step1_ZEE_13+FS_ZEE_13_UP15_PU25+HARVESTUP15FS+MINIAODMCUP15FS.log#/
The text was updated successfully, but these errors were encountered: