Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high throughput calculation problem #5776

Open
16 tasks
JTaozhang opened this issue Dec 27, 2024 · 1 comment
Open
16 tasks

high throughput calculation problem #5776

JTaozhang opened this issue Dec 27, 2024 · 1 comment
Labels
Questions Raise your quesiton! We will answer it.

Comments

@JTaozhang
Copy link

Describe the bug

Hi There,

Recently, I am trying to calculate a batch of jobs using abacus 3.8.5. I meet some problems like this.
I adopted a circulation to calculate those jobs within one slurm job script. Intially, those jobs works well in the first few jobs, then it starts to report mpi and segmentation errors and the program didn't stop spontaneously. if I locate error-reported job, clean those extra core.xxx file and resubmit it, this job still can be performed normally and successfully. So I doubt it may be due to the unreleased memory of last calculation which conflicts with the current running job. Could you help me to check whether is there any shortcomings with abacus. By the way, my abacus works well on single jobs.

details descriptions:

error log:
==== backtrace (tid: 34710) ====
0 0x0000000000050a75 ucs_debug_print_backtrace() ???:0
1 0x0000000000052f11 ucp_ep_match_remove_ep() ???:0
2 0x0000000000058477 ucp_wireup_remote_connected() ???:0
3 0x0000000000059c0e ucp_wireup_send_request() ???:0
4 0x000000000004a3de uct_ud_ep_process_rx() ???:0
5 0x0000000000051825 uct_ud_mlx5_ep_t_delete() ???:0
6 0x000000000002cd6a ucp_worker_progress() ???:0
7 0x000000000000a7a1 mlx_ep_progress() mlx_ep.c:0
8 0x00000000000229cd ofi_cq_progress() osd.c:0
9 0x0000000000022957 ofi_cq_readfrom() osd.c:0
10 0x00000000006d0b30 fi_cq_read() /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_eq.h:385
11 0x000000000021fd19 MPIDI_Progress_test() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:145
12 0x000000000021fa56 MPID_Progress_test() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:216
13 0x000000000021fa56 MPID_Progress_wait() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:277
14 0x000000000089c8c0 MPIR_Wait_impl() /build/impi/_buildspace/release/../../src/mpi/request/wait.c:38
15 0x00000000003f21c7 MPID_Wait() /build/impi/_buildspace/release/../../src/mpid/ch4/include/mpidpost.h:191
16 0x00000000003ea9f0 MPIC_Sendrecv() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:345
17 0x00000000000f5adb MPIR_Allgather_intra_brucks() /build/impi/_buildspace/release/../../src/mpi/coll/allgather/allgather_intra_brucks.c:84
18 0x00000000000f79cb MPIR_Allgather_intra_auto() /build/impi/_buildspace/release/../../src/mpi/coll/allgather/allgather.c:129
19 0x000000000030fff1 MPIR_Comm_split_impl() /build/impi/_buildspace/release/../../src/mpi/comm/comm_split.c:161
20 0x000000000030fff1 PMPI_Comm_split() /build/impi/_buildspace/release/../../src/mpi/comm/comm_split.c:477
21 0x00000000004f5ab1 Parallel_Global::split_diag_world() /share/home/zhangtao/software/abacus-develop-3.8.5/source/module_base/parallel_global.cpp:60
22 0x0000000000721f92 Driver::reading() /share/home/zhangtao/software/abacus-develop-3.8.5/source/driver.cpp:134
23 0x0000000000721d59 Driver::init() /share/home/zhangtao/software/abacus-develop-3.8.5/source/driver.cpp:34
24 0x0000000000456c29 main() ???:0
25 0x0000000000022555 __libc_start_main() ???:0
26 0x0000000000456ad9 _start() ???:0

srun: error: node049: task 44: Segmentation fault (core dumped)
[node049:34717:0:34717] ud_ep.c:263 Fatal: UD endpoint 0x251c490 to : unhandled timeout error
==== backtrace (tid: 34717) ====

log file:
/share/home/zhangtao/work/WTe2/train/abacus_dataset/structure/117

                          ABACUS v3.8.4

           Atomic-orbital Based Ab-initio Computation at UStc                    

                 Website: http://abacus.ustc.edu.cn/                             
           Documentation: https://abacus.deepmodeling.com/                       
              Repository: https://github.com/abacusmodeling/abacus-develop       
                          https://github.com/deepmodeling/abacus-develop         
                  Commit: unknown

Thu Dec 26 20:29:12 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : CPU / Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Warning: the number of valence electrons in pseudopotential > 6 for W: [Xe] 4f14 5d4 6s2
Warning: the number of valence electrons in pseudopotential > 6 for Te: [Kr] 4d10 5s2 5p4
Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

UNIFORM GRID DIM : 240 * 128 * 360
UNIFORM GRID DIM(BIG) : 48 * 32 * 90
DONE(0.498146 SEC) : SETUP UNITCELL
DONE(0.522707 SEC) : INIT K-POINTS

Self-consistent calculations for electrons

SPIN KPOINTS PROCESSORS THREADS NBASE
4 8 112 112 6696

Use Systematically Improvable Atomic bases

ELEMENT ORBITALS NBASE NATOM XC
W 4s2p2d2f1g-8au 43 36
Te 2s2p2d1f-7au 25 72

Initial plane wave basis and FFT box

DONE(0.618356 SEC) : INIT PLANEWAVE
DONE(1.62949 SEC) : LOCAL POTENTIAL

SELF-CONSISTENT :

START CHARGE : atomic
DONE(2.98984 SEC) : INIT SCF
ITER TMAGX TMAGY TMAGZ AMAG ETOT/eV EDIFF/eV DRHO TIME/s
GE1 -1.46e-02 -5.38e-04 9.80e+00 1.10e+01 -4.37568219e+05 0.00000000e+00 1.0888e-01 48.77
GE2 6.82e-02 -6.09e-03 -3.55e+00 3.98e+00 -4.37586699e+05 -1.84800427e+01 5.8743e-02 44.35
GE3 -2.13e-02 -2.54e-02 -1.71e-01 5.24e-01 -4.37586448e+05 2.51946962e-01 1.2509e-02 44.25
GE4 8.52e-03 8.13e-03 3.27e-01 5.81e-01 -4.37585673e+05 7.74657372e-01 6.6829e-03 44.19
GE5 8.57e-03 6.24e-03 4.13e-02 2.05e-01 -4.37586335e+05 -6.61918743e-01 3.6264e-03 44.18
GE6 -2.65e-03 -2.63e-03 -8.83e-02 1.93e-01 -4.37586894e+05 -5.59086419e-01 1.7292e-03 44.16
GE7 -5.85e-03 -2.89e-03 -5.85e-03 9.95e-02 -4.37586988e+05 -9.45896138e-02 1.1845e-03 44.19
GE8 -2.42e-03 -1.25e-03 -1.43e-03 7.04e-02 -4.37587032e+05 -4.31605742e-02 4.5332e-04 44.21
GE9 -5.11e-04 -4.69e-04 -1.78e-03 1.83e-02 -4.37587034e+05 -2.04659039e-03 2.3070e-04 44.20
GE10 8.73e-06 -1.25e-04 -1.20e-03 6.58e-03 -4.37587034e+05 -7.84764105e-04 1.6803e-04 44.17
GE11 -9.33e-05 -7.78e-05 4.43e-05 3.71e-03 -4.37587035e+05 -7.94247330e-04 1.1264e-04 44.13
GE12 -5.78e-05 -2.84e-05 -2.38e-04 1.62e-03 -4.37587036e+05 -5.04367135e-04 7.7132e-05 44.40
GE13 -3.76e-06 2.84e-06 -9.98e-05 8.58e-04 -4.37587036e+05 -5.42701164e-04 5.6086e-05 44.18
GE14 1.18e-05 4.11e-06 4.62e-05 3.57e-04 -4.37587037e+05 -7.15291460e-04 4.1150e-05 44.19
GE15 -5.05e-06 -3.37e-06 -6.68e-06 2.54e-04 -4.37587038e+05 -6.69437661e-04 3.1977e-05 44.14
GE16 -6.90e-06 -3.26e-06 -2.68e-05 2.09e-04 -4.37587038e+05 -6.80805247e-04 2.3916e-05 44.10
GE17 3.37e-06 2.86e-06 -2.46e-06 2.05e-04 -4.37587039e+05 -4.32145611e-04 1.7428e-05 44.19
GE18 2.81e-06 1.16e-06 1.24e-05 6.65e-05 -4.37587039e+05 -3.02412699e-04 1.2341e-05 44.18
GE19 -1.82e-06 -1.92e-06 -3.11e-06 8.48e-05 -4.37587039e+05 -3.14875808e-04 7.4783e-06 44.12
GE20 -1.72e-06 -8.66e-07 -9.34e-06 4.56e-05 -4.37587040e+05 -1.44921833e-04 4.7970e-06 44.13
GE21 8.77e-08 1.05e-07 -1.61e-06 2.68e-05 -4.37587040e+05 -1.20670611e-04 3.4081e-06 44.18
GE22 6.72e-07 3.79e-09 3.60e-06 2.17e-05 -4.37587040e+05 -6.42332626e-05 2.1660e-06 44.08
GE23 5.45e-08 3.27e-08 4.99e-07 8.56e-06 -4.37587040e+05 -3.96082367e-05 1.5053e-06 43.98
GE24 -2.22e-07 2.03e-07 -1.52e-06 1.02e-05 -4.37587040e+05 -2.43522961e-05 9.9533e-07 44.01
GE25 8.76e-09 6.03e-08 -5.18e-08 3.16e-06 -4.37587040e+05 -9.98646435e-06 6.7719e-07 44.18
GE26 7.53e-08 -4.90e-09 5.24e-07 2.78e-06 -4.37587040e+05 -6.37074039e-06 5.0779e-07 44.27
GE27 2.65e-08 -2.79e-08 2.14e-07 1.66e-06 -4.37587040e+05 -2.97978340e-06 3.8380e-07 44.17
GE28 -1.22e-08 3.03e-08 -1.27e-07 1.16e-06 -4.37587040e+05 -2.66389201e-06 3.2717e-07 44.27
GE29 -6.58e-09 2.22e-08 -7.80e-08 6.86e-07 -4.37587040e+05 -2.39101372e-06 2.5727e-07 44.17
GE30 4.15e-09 -1.27e-08 5.75e-08 5.84e-07 -4.37587040e+05 -1.97479142e-06 2.0120e-07 44.18
GE31 4.83e-11 -7.68e-09 7.09e-09 2.29e-07 -4.37587040e+05 -1.50639903e-06 1.6062e-07 44.16
GE32 -3.96e-09 4.05e-09 -4.81e-08 4.34e-07 -4.37587040e+05 -1.35394753e-06 1.2674e-07 44.12
GE33 2.59e-09 8.59e-09 2.80e-08 1.99e-07 -4.37587040e+05 -3.02348947e-06 9.9503e-08 44.03
TIME STATISTICS

CLASS_NAME NAME TIME/s CALLS AVG/s PER/%

             total                              1643.33 11       149.39  100.00 

Driver reading 0.19 1 0.19 0.01
Input_Conv Convert 0.00 1 0.00 0.00
Driver driver_line 1643.14 1 1643.14 99.99
UnitCell check_tau 0.00 1 0.00 0.00
ESolver_KS_LCAO before_all_runners 1.35 1 1.35 0.08
PW_Basis_Sup setuptransform 0.08 1 0.08 0.00
PW_Basis_Sup distributeg 0.07 1 0.07 0.00
mymath heapsort 0.01 3 0.00 0.00
Charge_Mixing init_mixing 0.00 1 0.00 0.00
PW_Basis_K setuptransform 0.06 1 0.06 0.00
PW_Basis_K distributeg 0.06 1 0.06 0.00
PW_Basis setup_struc_factor 0.15 1 0.15 0.01
NOrbital_Lm extra_uniform 0.34 577 0.00 0.02
Mathzone_Add1 SplineD2 0.01 577 0.00 0.00
Mathzone_Add1 Cubic_Spline_Interpolation 0.17 577 0.00 0.01
ppcell_vl init_vloc 0.27 1 0.27 0.02
Ions opt_ions 1641.56 1 1641.56 99.89
ESolver_KS_LCAO runner 1641.56 1 1641.56 99.89
ESolver_KS_LCAO before_scf 1.35 1 1.35 0.08
Vdwd2 energy 0.08 1 0.08 0.00
atom_arrange search 0.00 1 0.00 0.00
atom_arrange Atom_input 0.00 1 0.00 0.00
atom_arrange grid_d.init 0.00 1 0.00 0.00
Grid Build_Hash_Table 0.00 1 0.00 0.00
Grid Construct_Adjacent_expand 0.00 1 0.00 0.00
Grid Construct_Adjacent_expand_periodic 0.00 108 0.00 0.00
Grid_Technique init 0.03 1 0.03 0.00
Grid_BigCell grid_expansion_index 0.01 1 0.01 0.00
Grid_Driver Find_atom 0.00 756 0.00 0.00
Record_adj for_2d 0.05 1 0.05 0.00
LCAO_domain grid_prepare 0.00 1 0.00 0.00
Veff initialize_HR 0.00 1 0.00 0.00
OverlapNew initialize_SR 0.00 1 0.00 0.00
EkineticNew initialize_HR 0.00 1 0.00 0.00
NonlocalNew initialize_HR 0.01 1 0.01 0.00
Charge set_rho_core 0.00 1 0.00 0.00
Charge atomic_rho 1.30 2 0.65 0.08
PW_Basis_Sup recip2real 4.44 545 0.01 0.27
PW_Basis_Sup gathers_scatterp 3.24 545 0.01 0.20
Potential init_pot 0.39 1 0.39 0.02
Potential update_from_charge 12.28 34 0.36 0.75
Potential cal_fixed_v 0.01 1 0.01 0.00
PotLocal cal_fixed_v 0.01 1 0.01 0.00
Potential cal_v_eff 12.20 34 0.36 0.74
H_Hartree_pw v_hartree 0.62 34 0.02 0.04
PW_Basis_Sup real2recip 5.56 506 0.01 0.34
PW_Basis_Sup gatherp_scatters 4.53 506 0.01 0.28
PotXC cal_v_eff 11.47 34 0.34 0.70
XC_Functional v_xc 11.39 34 0.33 0.69
Potential interpolate_vrs 0.06 34 0.00 0.00
H_Ewald_pw compute_ewald 0.00 1 0.00 0.00
HSolverLCAO solve 1446.02 33 43.82 87.99
HamiltLCAO updateHk 132.29 264 0.50 8.05
OperatorLCAO init 131.33 792 0.17 7.99
Veff contributeHR 126.26 33 3.83 7.68
Gint_interface cal_gint 0.00 165 0.00 0.00
Gint_k transfer_pvpR 126.26 33 3.83 7.68
OverlapNew calculate_SR 0.18 1 0.18 0.01
OverlapNew contributeHk 0.78 264 0.00 0.05
EkineticNew contributeHR 0.18 33 0.01 0.01
EkineticNew calculate_HR 0.18 1 0.18 0.01
NonlocalNew contributeHR 3.60 33 0.11 0.22
NonlocalNew calculate_HR 3.54 1 3.54 0.22
OperatorLCAO contributeHk 0.73 264 0.00 0.04
HSolverLCAO hamiltSolvePsiK 1118.38 264 4.24 68.06
DiagoElpa elpa_solve 1110.15 264 4.21 67.55
elecstate cal_dm 78.13 33 2.37 4.75
psiMulPsiMpi pdgemm 77.63 264 0.29 4.72
DensityMatrix cal_DMR 0.81 33 0.02 0.05
ElecStateLCAO psiToRho 115.68 33 3.51 7.04
Gint transfer_DMR 7.14 33 0.22 0.43
Charge_Mixing get_drho 0.02 33 0.00 0.00
Charge mix_rho 3.58 32 0.11 0.22
Charge Broyden_mixing 0.73 32 0.02 0.04
ESolver_KS_LCAO after_scf 177.98 1 177.98 10.83
ModuleIO write_rhog 2.22 1 2.22 0.14
ModuleIO output_HSR 74.82 1 74.82 4.55
ModuleIO save_HSR_sparse 74.69 1 74.69 4.55
cal_r_overlap_R init 54.67 1 54.67 3.33
ORB_gaunt_table init_Gaunt_CH 0.01 1 0.01 0.00
ORB_gaunt_table Calc_Gaunt_CH 0.00 3738 0.00 0.00
ORB_gaunt_table init_Gaunt 0.03 1 0.03 0.00
ORB_gaunt_table Get_Gaunt_SH 0.01 78408 0.00 0.00
Center2_Orb cal_ST_Phi12_R 51.02 1569 0.03 3.10
cal_r_overlap_R out_rR_other 45.58 1 45.58 2.77
ESolver_KS_LCAO after_all_runners 0.12 1 0.12 0.01
ModuleIO write_istate_info 0.12 1 0.12 0.01

START Time : Thu Dec 26 20:29:12 2024
FINISH Time : Thu Dec 26 20:56:35 2024
TOTAL Time : 1643
SEE INFORMATION IN : OUT.ABACUS/
/share/home/zhangtao/work/WTe2/train/abacus_dataset/structure/118

                          ABACUS v3.8.4

           Atomic-orbital Based Ab-initio Computation at UStc                    

                 Website: http://abacus.ustc.edu.cn/                             
           Documentation: https://abacus.deepmodeling.com/                       
              Repository: https://github.com/abacusmodeling/abacus-develop       
                          https://github.com/deepmodeling/abacus-develop         
                  Commit: unknown

Thu Dec 26 20:56:41 2024
MAKE THE DIR : OUT.ABACUS/

slurm submitting script:
#! /bin/bash
#SBATCH -p regular
#SBATCH -N 2
#SBATCH --ntasks-per-node=56
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH -J deeph-dataset
#SBATCH -o job-%j.log
#SBATCH -e job-%j.err

source /share/apps/intel-oneAPI-2021/setvars.sh
source /share/home/zhangtao/software/abacus-develop-3.8.5/toolchain/install/setup

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
ABACUS=/share/home/zhangtao/software/abacus-develop-3.8.5/bin

export OMP_NUM_THREADS=1

srun hostname -s |sort -n > slurm.hosts
cd ./structure/
for i in {80..126}
do
cd $i
pwd
srun $ABACUS/abacus
rm OUT.ABACUS/ABACUS-CHARGE-DENSITY.restart
rm OUT.ABACUS/data-rR-sparse.csr
cd ..
done
rm -rf slurm.hosts

Expected behavior

it could smoothly calculate a batch of jobs

To Reproduce

description for my abacus: the hse module has been included in my version.

if you want to reveal my case, here is some tests you can try. But I can not promise it could reveal the error, because sometimes the eror is reported after 30 jobs, sometimes it is reported after 2 jobs. it is a random behaviour. I still hope you can reveal this error successfully. the submitting script I have wrritten in the last table.

abacus.zip

Environment

No response

Additional Context

circulation running code:

for i in {80..126}
do
cd $i
pwd
srun $ABACUS/abacus
rm OUT.ABACUS/ABACUS-CHARGE-DENSITY.restart
rm OUT.ABACUS/data-rR-sparse.csr
cd ..
done

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).
@mohanchen
Copy link
Collaborator

Thanks for reporting the issue! As you have mentioned, it is a random behaviour. We will try to see if we can locate the error.

@mohanchen mohanchen added the Questions Raise your quesiton! We will answer it. label Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Questions Raise your quesiton! We will answer it.
Projects
None yet
Development

No branches or pull requests

2 participants