-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
malloc(): unsorted double linked list corrupted errors from DelphesPythia8_EDM4HEP #136
Comments
Tagging @juliagonski who reported the same issue. |
Hi @bistapf Could you attach, link, paste:
Thanks! |
I can reproduce locally with https://raw.githubusercontent.com/key4hep/k4SimDelphes/refs/heads/main/examples/edm4hep_output_config.tcl, but not with the debug build. |
Hi @andresailer , here's the output config: edm4hep_output_config.tcl. So you're saying with setting up, e.g., |
Thanks! And yes when adding |
Valgrind, plus some hot patching via LD_PRELOAD
This buffer PS: Does not seem to be the thing causing this, at least not by itself... |
Running with a local standalone installation of Delphes and the same inputs/config worked fine. So I don't think it's a Delphes issue? Tagging @selvaggi and @pavel-demin anyway for the buffer comment above. |
@bistapf How did you compile delphes? |
I followed the instructions from the workbook here - but I used Pythia Btw, I cannot confirm that adding the debug flag fixes it, for me it throws the |
Maybe the whole issue is so flaky that one really has to run many times to draw conclusions. |
Yes, I'm afraid that might be the case. Judging by the fraction of condor jobs that have the error, the chance for it to fail is quite high though (unacceptably high unfortunately). With the last set of 100 jobs I tested, only 17 worked and all the others had the Also do you have any idea why this suddenly popped out of (seemingly) nowhere? I ran ~500 jobs without any issues on 28.10 still, same setup, then from 29.10 this started happening. So I thought at first it could be a problem with the new release @jmcarcell kindly made available for me, but reverting back to the one from 3.10 also did not fix it. Maybe it could still be related somehow though? |
What is the difference in stacks between 03.10. and 28.10.? Are they completely separate? Or are they sharing some packages? Have you tried using the debug stack for running on the batch system? Does that also have the same failure rate? |
I think the difference should only be the The only thing I remember changing after that is fixing the edm4hep output config to have the correct new name for the MCReco collection, i.e. replacing Debug stack on batch I haven't tried. Locally it failed 100% of the time for me, but I can submit 100 jobs with that and see what we get. |
Between October 19 and 23, I made some changes around the memory allocation code in some parts of the Tcl code in Delphes: https://github.com/delphes/delphes/commits/master/external/tcl If I am not mistaken, the 2024-10-03 key4hep release contains an older version of the Delphes code without these changes, and the 2024-10-28 key4hep release contains a newer version of the Delphes code with these changes. So I would say that these changes neither solve this problem nor cause it. I will try to reproduce this problem and see what I can do about it. |
Thanks @pavel-demin ! The standalone Delphes test I ran was using the latest master, so after this commit. Out of 100 test jobs with debug stack and 3.10 release 76 crashed immediately, so similar rate. I noticed that most of them (63) have the Indeed @juliagonski had reported to me by email that removing this module solved the issue for them, but I think this is strange because some of the jobs do still work even with the module. |
Just quickly checking the Delphes that is used in both cases it looks like $ source /cvmfs/sw.hsf.org/key4hep/setup.sh
AlmaLinux/RockyLinux/RHEL 9 detected
Setting up the Key4hep software stack release latest-opt from CVMFS
Use the following command to reproduce the current environment:
source /cvmfs/sw.hsf.org/key4hep/setup.sh -r 2024-10-28
If you have any issues, comments or requests, open an issue at https://github.com/key4hep/key4hep-spack/issues
Tip: A new -d flag can be used to access debug builds, otherwise the default is the optimized build
$ which DelphesPythia8
/cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/delphes/master-ic3lyz/bin/DelphesPythia8 It also looks like we are not using a tagged release for Delphes in our current releases. Is this on purpose? |
I did some testing on my side and managed to reproduce the crashes with the following commands:
and
These two releases use the the same Delphes version from 2024-10-03. I also tried the latest nightly build that uses a newer Delphes version:
With this nightly release, |
Thanks a lot, @pavel-demin ! Indeed I can confirm that with the nightly build I was able to run 100 jobs succesfully. So do we need a new release with the latest Delphes version, @jmcarcell ? Thanks! |
Could we have a proper delphes tag for that, please (@pavel-demin @selvaggi)? Or is there some developments that need to be done first? |
I have just tagged the current version of the code as a pre-release 3.5.1pre11. Is it OK for you? |
I think that should work. Thanks a lot. |
Would the release then also include the latest k4SimDelphes build, i.e. the fix from #137? |
I created a new tag (v00-07-03) and this should be picked up for the next release, via key4hep/key4hep-spack#669 |
Hi I also got the same error with even 3.5.1pre11, but I am using just DelphesHepMC2 rather than DelphesPythia8. I am wondering if you could share more details on how the problem could be fixed? Thank you! |
If you get the problem with the If instead you still get the error for a reader that this repository provides, can you share a few more details on how to reproduce the issue? (software environment, inputs, commands, ...) |
Adding to @Kenny-Jia 's last comment: I tried using locally built versions of Delphes tag 3.5.1pre11 and k4SimDelphes tag v00-07-03 on top of the latest key4hep release, still got the With the nightlies I haven't seen the issue again, so maybe the problem is elsewhere after all? Edit: Using the local tagged versions on top of the nightly stack seems to work. But I'm not sure what this tells us - perhaps the way I'm trying to use the local versions is not working correctly. I have checked that Here's the commands I followed in any case: |
Just updating on this that I have run by now some thousands of jobs with the nightlies and the |
Hi @bistapf, I think it should be possible to make a release based on the current "series" just picking up a newer version of Delphes and the latest tag of k4simdelphes just for checking that the necessary changes have landed where they need to land. |
@bistapf I'll make a new build soon. Having a look at this the version that is used in the releases @pavel-demin Could we get a released version of delphes? The last one that is not a pre (3.5.0) is more than 3 years old and this is what is being used by spack: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/delphes/package.py#L25 |
I am planning to prepare a new Delphes release soon but I am afraid that I will not have time to do it in the coming weeks. I would also like to understand this problem better and make sure that it is fixed before making the new release. At the moment I am a bit confused with all the different comments and I do not understand if the problem is completely fixed or if it still reappears from time to time. |
No problem, this was more a general comment, for the release builds I can use the latest pre version. |
@bistapf try out the new release by sourcing the setup script
|
Any updates on this @bistapf? |
Hi @jmcarcell , all, sorry for the delay. Unfortunately I have to report that it seems the issue is not fully solved yet. Using When running the tester locally, the chance that it runs through appears to be 50/50. This is when I run the script multiple times in a row on the same interactive node. Could you also give it a try to make sure the problem isn't on my end somehow? Thanks! I attach the collection of logs for a job that failed, as well as one that worked again. The only thing I have noticed is that the |
As previously suggested by @andresailer, I have just fixed the buffer length in I have also tagged the new version of the code with this fix as pre-release 3.5.1pre12. Could you please test this new version to see if this fix resolves the issue? |
Since about ~ 2 weeks ago, a large fraction (more than 2/3) of DelphesPythia8_EDM4HEP batch jobs submitted with EventProducer fail with the following error:
malloc(): unsorted double linked list corrupted
This happens during the initialization of the Delphes modules. Some of the jobs however still run fine with exactly the same configuration.
The jobs are submitted from lxplus, on AlmaLinux 9. The error happens with both latest key4hep releases, so
-r 2024-10-03
and-r 2024-10-28
.I've attached a zip file with all the log files for a job that failed (
condor_job.000000125.7228699.61.x
), and for comparison a.log
for one that worked (log_successful_job.log
- perhaps it depends on the condor node whether the error occurs?). The job config and the script that failed are also included (job_desc_lhep8.cfg
andjob000000083.sh
).malloc_error_logs.zip
I will test still whether the error also occurs when running locally, or only on condor and report back. - Edit: Confirming that locally this script fails.
The text was updated successfully, but these errors were encountered: