You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, I am trying to calculate a batch of jobs using abacus 3.8.5. I meet some problems like this.
I adopted a circulation to calculate those jobs within one slurm job script. Intially, those jobs works well in the first few jobs, then it starts to report mpi and segmentation errors and the program didn't stop spontaneously. if I locate error-reported job, clean those extra core.xxx file and resubmit it, this job still can be performed normally and successfully. So I doubt it may be due to the unreleased memory of last calculation which conflicts with the current running job. Could you help me to check whether is there any shortcomings with abacus. By the way, my abacus works well on single jobs.
ABACUS v3.8.4
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: unknown
Thu Dec 26 20:29:12 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : CPU / Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Warning: the number of valence electrons in pseudopotential > 6 for W: [Xe] 4f14 5d4 6s2
Warning: the number of valence electrons in pseudopotential > 6 for Te: [Kr] 4d10 5s2 5p4
Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
START Time : Thu Dec 26 20:29:12 2024
FINISH Time : Thu Dec 26 20:56:35 2024
TOTAL Time : 1643
SEE INFORMATION IN : OUT.ABACUS/ /share/home/zhangtao/work/WTe2/train/abacus_dataset/structure/118
ABACUS v3.8.4
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: unknown
Thu Dec 26 20:56:41 2024
MAKE THE DIR : OUT.ABACUS/
srun hostname -s |sort -n > slurm.hosts
cd ./structure/
for i in {80..126}
do
cd $i
pwd
srun $ABACUS/abacus
rm OUT.ABACUS/ABACUS-CHARGE-DENSITY.restart
rm OUT.ABACUS/data-rR-sparse.csr
cd ..
done
rm -rf slurm.hosts
Expected behavior
it could smoothly calculate a batch of jobs
To Reproduce
description for my abacus: the hse module has been included in my version.
if you want to reveal my case, here is some tests you can try. But I can not promise it could reveal the error, because sometimes the eror is reported after 30 jobs, sometimes it is reported after 2 jobs. it is a random behaviour. I still hope you can reveal this error successfully. the submitting script I have wrritten in the last table.
Describe the bug
Hi There,
Recently, I am trying to calculate a batch of jobs using abacus 3.8.5. I meet some problems like this.
I adopted a circulation to calculate those jobs within one slurm job script. Intially, those jobs works well in the first few jobs, then it starts to report mpi and segmentation errors and the program didn't stop spontaneously. if I locate error-reported job, clean those extra core.xxx file and resubmit it, this job still can be performed normally and successfully. So I doubt it may be due to the unreleased memory of last calculation which conflicts with the current running job. Could you help me to check whether is there any shortcomings with abacus. By the way, my abacus works well on single jobs.
details descriptions:
error log:
==== backtrace (tid: 34710) ====
0 0x0000000000050a75 ucs_debug_print_backtrace() ???:0
1 0x0000000000052f11 ucp_ep_match_remove_ep() ???:0
2 0x0000000000058477 ucp_wireup_remote_connected() ???:0
3 0x0000000000059c0e ucp_wireup_send_request() ???:0
4 0x000000000004a3de uct_ud_ep_process_rx() ???:0
5 0x0000000000051825 uct_ud_mlx5_ep_t_delete() ???:0
6 0x000000000002cd6a ucp_worker_progress() ???:0
7 0x000000000000a7a1 mlx_ep_progress() mlx_ep.c:0
8 0x00000000000229cd ofi_cq_progress() osd.c:0
9 0x0000000000022957 ofi_cq_readfrom() osd.c:0
10 0x00000000006d0b30 fi_cq_read() /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_eq.h:385
11 0x000000000021fd19 MPIDI_Progress_test() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:145
12 0x000000000021fa56 MPID_Progress_test() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:216
13 0x000000000021fa56 MPID_Progress_wait() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:277
14 0x000000000089c8c0 MPIR_Wait_impl() /build/impi/_buildspace/release/../../src/mpi/request/wait.c:38
15 0x00000000003f21c7 MPID_Wait() /build/impi/_buildspace/release/../../src/mpid/ch4/include/mpidpost.h:191
16 0x00000000003ea9f0 MPIC_Sendrecv() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:345
17 0x00000000000f5adb MPIR_Allgather_intra_brucks() /build/impi/_buildspace/release/../../src/mpi/coll/allgather/allgather_intra_brucks.c:84
18 0x00000000000f79cb MPIR_Allgather_intra_auto() /build/impi/_buildspace/release/../../src/mpi/coll/allgather/allgather.c:129
19 0x000000000030fff1 MPIR_Comm_split_impl() /build/impi/_buildspace/release/../../src/mpi/comm/comm_split.c:161
20 0x000000000030fff1 PMPI_Comm_split() /build/impi/_buildspace/release/../../src/mpi/comm/comm_split.c:477
21 0x00000000004f5ab1 Parallel_Global::split_diag_world() /share/home/zhangtao/software/abacus-develop-3.8.5/source/module_base/parallel_global.cpp:60
22 0x0000000000721f92 Driver::reading() /share/home/zhangtao/software/abacus-develop-3.8.5/source/driver.cpp:134
23 0x0000000000721d59 Driver::init() /share/home/zhangtao/software/abacus-develop-3.8.5/source/driver.cpp:34
24 0x0000000000456c29 main() ???:0
25 0x0000000000022555 __libc_start_main() ???:0
26 0x0000000000456ad9 _start() ???:0
srun: error: node049: task 44: Segmentation fault (core dumped)
[node049:34717:0:34717] ud_ep.c:263 Fatal: UD endpoint 0x251c490 to : unhandled timeout error
==== backtrace (tid: 34717) ====
log file:
/share/home/zhangtao/work/WTe2/train/abacus_dataset/structure/117
Thu Dec 26 20:29:12 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : CPU / Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Warning: the number of valence electrons in pseudopotential > 6 for W: [Xe] 4f14 5d4 6s2
Warning: the number of valence electrons in pseudopotential > 6 for Te: [Kr] 4d10 5s2 5p4
Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
UNIFORM GRID DIM : 240 * 128 * 360
UNIFORM GRID DIM(BIG) : 48 * 32 * 90
DONE(0.498146 SEC) : SETUP UNITCELL
DONE(0.522707 SEC) : INIT K-POINTS
Self-consistent calculations for electrons
SPIN KPOINTS PROCESSORS THREADS NBASE
4 8 112 112 6696
Use Systematically Improvable Atomic bases
ELEMENT ORBITALS NBASE NATOM XC
W 4s2p2d2f1g-8au 43 36
Te 2s2p2d1f-7au 25 72
Initial plane wave basis and FFT box
DONE(0.618356 SEC) : INIT PLANEWAVE
DONE(1.62949 SEC) : LOCAL POTENTIAL
SELF-CONSISTENT :
START CHARGE : atomic
DONE(2.98984 SEC) : INIT SCF
ITER TMAGX TMAGY TMAGZ AMAG ETOT/eV EDIFF/eV DRHO TIME/s
GE1 -1.46e-02 -5.38e-04 9.80e+00 1.10e+01 -4.37568219e+05 0.00000000e+00 1.0888e-01 48.77
GE2 6.82e-02 -6.09e-03 -3.55e+00 3.98e+00 -4.37586699e+05 -1.84800427e+01 5.8743e-02 44.35
GE3 -2.13e-02 -2.54e-02 -1.71e-01 5.24e-01 -4.37586448e+05 2.51946962e-01 1.2509e-02 44.25
GE4 8.52e-03 8.13e-03 3.27e-01 5.81e-01 -4.37585673e+05 7.74657372e-01 6.6829e-03 44.19
GE5 8.57e-03 6.24e-03 4.13e-02 2.05e-01 -4.37586335e+05 -6.61918743e-01 3.6264e-03 44.18
GE6 -2.65e-03 -2.63e-03 -8.83e-02 1.93e-01 -4.37586894e+05 -5.59086419e-01 1.7292e-03 44.16
GE7 -5.85e-03 -2.89e-03 -5.85e-03 9.95e-02 -4.37586988e+05 -9.45896138e-02 1.1845e-03 44.19
GE8 -2.42e-03 -1.25e-03 -1.43e-03 7.04e-02 -4.37587032e+05 -4.31605742e-02 4.5332e-04 44.21
GE9 -5.11e-04 -4.69e-04 -1.78e-03 1.83e-02 -4.37587034e+05 -2.04659039e-03 2.3070e-04 44.20
GE10 8.73e-06 -1.25e-04 -1.20e-03 6.58e-03 -4.37587034e+05 -7.84764105e-04 1.6803e-04 44.17
GE11 -9.33e-05 -7.78e-05 4.43e-05 3.71e-03 -4.37587035e+05 -7.94247330e-04 1.1264e-04 44.13
GE12 -5.78e-05 -2.84e-05 -2.38e-04 1.62e-03 -4.37587036e+05 -5.04367135e-04 7.7132e-05 44.40
GE13 -3.76e-06 2.84e-06 -9.98e-05 8.58e-04 -4.37587036e+05 -5.42701164e-04 5.6086e-05 44.18
GE14 1.18e-05 4.11e-06 4.62e-05 3.57e-04 -4.37587037e+05 -7.15291460e-04 4.1150e-05 44.19
GE15 -5.05e-06 -3.37e-06 -6.68e-06 2.54e-04 -4.37587038e+05 -6.69437661e-04 3.1977e-05 44.14
GE16 -6.90e-06 -3.26e-06 -2.68e-05 2.09e-04 -4.37587038e+05 -6.80805247e-04 2.3916e-05 44.10
GE17 3.37e-06 2.86e-06 -2.46e-06 2.05e-04 -4.37587039e+05 -4.32145611e-04 1.7428e-05 44.19
GE18 2.81e-06 1.16e-06 1.24e-05 6.65e-05 -4.37587039e+05 -3.02412699e-04 1.2341e-05 44.18
GE19 -1.82e-06 -1.92e-06 -3.11e-06 8.48e-05 -4.37587039e+05 -3.14875808e-04 7.4783e-06 44.12
GE20 -1.72e-06 -8.66e-07 -9.34e-06 4.56e-05 -4.37587040e+05 -1.44921833e-04 4.7970e-06 44.13
GE21 8.77e-08 1.05e-07 -1.61e-06 2.68e-05 -4.37587040e+05 -1.20670611e-04 3.4081e-06 44.18
GE22 6.72e-07 3.79e-09 3.60e-06 2.17e-05 -4.37587040e+05 -6.42332626e-05 2.1660e-06 44.08
GE23 5.45e-08 3.27e-08 4.99e-07 8.56e-06 -4.37587040e+05 -3.96082367e-05 1.5053e-06 43.98
GE24 -2.22e-07 2.03e-07 -1.52e-06 1.02e-05 -4.37587040e+05 -2.43522961e-05 9.9533e-07 44.01
GE25 8.76e-09 6.03e-08 -5.18e-08 3.16e-06 -4.37587040e+05 -9.98646435e-06 6.7719e-07 44.18
GE26 7.53e-08 -4.90e-09 5.24e-07 2.78e-06 -4.37587040e+05 -6.37074039e-06 5.0779e-07 44.27
GE27 2.65e-08 -2.79e-08 2.14e-07 1.66e-06 -4.37587040e+05 -2.97978340e-06 3.8380e-07 44.17
GE28 -1.22e-08 3.03e-08 -1.27e-07 1.16e-06 -4.37587040e+05 -2.66389201e-06 3.2717e-07 44.27
GE29 -6.58e-09 2.22e-08 -7.80e-08 6.86e-07 -4.37587040e+05 -2.39101372e-06 2.5727e-07 44.17
GE30 4.15e-09 -1.27e-08 5.75e-08 5.84e-07 -4.37587040e+05 -1.97479142e-06 2.0120e-07 44.18
GE31 4.83e-11 -7.68e-09 7.09e-09 2.29e-07 -4.37587040e+05 -1.50639903e-06 1.6062e-07 44.16
GE32 -3.96e-09 4.05e-09 -4.81e-08 4.34e-07 -4.37587040e+05 -1.35394753e-06 1.2674e-07 44.12
GE33 2.59e-09 8.59e-09 2.80e-08 1.99e-07 -4.37587040e+05 -3.02348947e-06 9.9503e-08 44.03
TIME STATISTICS
CLASS_NAME NAME TIME/s CALLS AVG/s PER/%
Driver reading 0.19 1 0.19 0.01
Input_Conv Convert 0.00 1 0.00 0.00
Driver driver_line 1643.14 1 1643.14 99.99
UnitCell check_tau 0.00 1 0.00 0.00
ESolver_KS_LCAO before_all_runners 1.35 1 1.35 0.08
PW_Basis_Sup setuptransform 0.08 1 0.08 0.00
PW_Basis_Sup distributeg 0.07 1 0.07 0.00
mymath heapsort 0.01 3 0.00 0.00
Charge_Mixing init_mixing 0.00 1 0.00 0.00
PW_Basis_K setuptransform 0.06 1 0.06 0.00
PW_Basis_K distributeg 0.06 1 0.06 0.00
PW_Basis setup_struc_factor 0.15 1 0.15 0.01
NOrbital_Lm extra_uniform 0.34 577 0.00 0.02
Mathzone_Add1 SplineD2 0.01 577 0.00 0.00
Mathzone_Add1 Cubic_Spline_Interpolation 0.17 577 0.00 0.01
ppcell_vl init_vloc 0.27 1 0.27 0.02
Ions opt_ions 1641.56 1 1641.56 99.89
ESolver_KS_LCAO runner 1641.56 1 1641.56 99.89
ESolver_KS_LCAO before_scf 1.35 1 1.35 0.08
Vdwd2 energy 0.08 1 0.08 0.00
atom_arrange search 0.00 1 0.00 0.00
atom_arrange Atom_input 0.00 1 0.00 0.00
atom_arrange grid_d.init 0.00 1 0.00 0.00
Grid Build_Hash_Table 0.00 1 0.00 0.00
Grid Construct_Adjacent_expand 0.00 1 0.00 0.00
Grid Construct_Adjacent_expand_periodic 0.00 108 0.00 0.00
Grid_Technique init 0.03 1 0.03 0.00
Grid_BigCell grid_expansion_index 0.01 1 0.01 0.00
Grid_Driver Find_atom 0.00 756 0.00 0.00
Record_adj for_2d 0.05 1 0.05 0.00
LCAO_domain grid_prepare 0.00 1 0.00 0.00
Veff initialize_HR 0.00 1 0.00 0.00
OverlapNew initialize_SR 0.00 1 0.00 0.00
EkineticNew initialize_HR 0.00 1 0.00 0.00
NonlocalNew initialize_HR 0.01 1 0.01 0.00
Charge set_rho_core 0.00 1 0.00 0.00
Charge atomic_rho 1.30 2 0.65 0.08
PW_Basis_Sup recip2real 4.44 545 0.01 0.27
PW_Basis_Sup gathers_scatterp 3.24 545 0.01 0.20
Potential init_pot 0.39 1 0.39 0.02
Potential update_from_charge 12.28 34 0.36 0.75
Potential cal_fixed_v 0.01 1 0.01 0.00
PotLocal cal_fixed_v 0.01 1 0.01 0.00
Potential cal_v_eff 12.20 34 0.36 0.74
H_Hartree_pw v_hartree 0.62 34 0.02 0.04
PW_Basis_Sup real2recip 5.56 506 0.01 0.34
PW_Basis_Sup gatherp_scatters 4.53 506 0.01 0.28
PotXC cal_v_eff 11.47 34 0.34 0.70
XC_Functional v_xc 11.39 34 0.33 0.69
Potential interpolate_vrs 0.06 34 0.00 0.00
H_Ewald_pw compute_ewald 0.00 1 0.00 0.00
HSolverLCAO solve 1446.02 33 43.82 87.99
HamiltLCAO updateHk 132.29 264 0.50 8.05
OperatorLCAO init 131.33 792 0.17 7.99
Veff contributeHR 126.26 33 3.83 7.68
Gint_interface cal_gint 0.00 165 0.00 0.00
Gint_k transfer_pvpR 126.26 33 3.83 7.68
OverlapNew calculate_SR 0.18 1 0.18 0.01
OverlapNew contributeHk 0.78 264 0.00 0.05
EkineticNew contributeHR 0.18 33 0.01 0.01
EkineticNew calculate_HR 0.18 1 0.18 0.01
NonlocalNew contributeHR 3.60 33 0.11 0.22
NonlocalNew calculate_HR 3.54 1 3.54 0.22
OperatorLCAO contributeHk 0.73 264 0.00 0.04
HSolverLCAO hamiltSolvePsiK 1118.38 264 4.24 68.06
DiagoElpa elpa_solve 1110.15 264 4.21 67.55
elecstate cal_dm 78.13 33 2.37 4.75
psiMulPsiMpi pdgemm 77.63 264 0.29 4.72
DensityMatrix cal_DMR 0.81 33 0.02 0.05
ElecStateLCAO psiToRho 115.68 33 3.51 7.04
Gint transfer_DMR 7.14 33 0.22 0.43
Charge_Mixing get_drho 0.02 33 0.00 0.00
Charge mix_rho 3.58 32 0.11 0.22
Charge Broyden_mixing 0.73 32 0.02 0.04
ESolver_KS_LCAO after_scf 177.98 1 177.98 10.83
ModuleIO write_rhog 2.22 1 2.22 0.14
ModuleIO output_HSR 74.82 1 74.82 4.55
ModuleIO save_HSR_sparse 74.69 1 74.69 4.55
cal_r_overlap_R init 54.67 1 54.67 3.33
ORB_gaunt_table init_Gaunt_CH 0.01 1 0.01 0.00
ORB_gaunt_table Calc_Gaunt_CH 0.00 3738 0.00 0.00
ORB_gaunt_table init_Gaunt 0.03 1 0.03 0.00
ORB_gaunt_table Get_Gaunt_SH 0.01 78408 0.00 0.00
Center2_Orb cal_ST_Phi12_R 51.02 1569 0.03 3.10
cal_r_overlap_R out_rR_other 45.58 1 45.58 2.77
ESolver_KS_LCAO after_all_runners 0.12 1 0.12 0.01
ModuleIO write_istate_info 0.12 1 0.12 0.01
START Time : Thu Dec 26 20:29:12 2024
FINISH Time : Thu Dec 26 20:56:35 2024
TOTAL Time : 1643
SEE INFORMATION IN : OUT.ABACUS/
/share/home/zhangtao/work/WTe2/train/abacus_dataset/structure/118
Thu Dec 26 20:56:41 2024
MAKE THE DIR : OUT.ABACUS/
slurm submitting script:
#! /bin/bash
#SBATCH -p regular
#SBATCH -N 2
#SBATCH --ntasks-per-node=56
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH -J deeph-dataset
#SBATCH -o job-%j.log
#SBATCH -e job-%j.err
source /share/apps/intel-oneAPI-2021/setvars.sh
source /share/home/zhangtao/software/abacus-develop-3.8.5/toolchain/install/setup
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
ABACUS=/share/home/zhangtao/software/abacus-develop-3.8.5/bin
export OMP_NUM_THREADS=1
srun hostname -s |sort -n > slurm.hosts
cd ./structure/
for i in {80..126}
do
cd $i
pwd
srun $ABACUS/abacus
rm OUT.ABACUS/ABACUS-CHARGE-DENSITY.restart
rm OUT.ABACUS/data-rR-sparse.csr
cd ..
done
rm -rf slurm.hosts
Expected behavior
it could smoothly calculate a batch of jobs
To Reproduce
description for my abacus: the hse module has been included in my version.
if you want to reveal my case, here is some tests you can try. But I can not promise it could reveal the error, because sometimes the eror is reported after 30 jobs, sometimes it is reported after 2 jobs. it is a random behaviour. I still hope you can reveal this error successfully. the submitting script I have wrritten in the last table.
abacus.zip
Environment
No response
Additional Context
circulation running code:
for i in {80..126}
do
cd $i
pwd
srun $ABACUS/abacus
rm OUT.ABACUS/ABACUS-CHARGE-DENSITY.restart
rm OUT.ABACUS/data-rR-sparse.csr
cd ..
done
Task list for Issue attackers (only for developers)
The text was updated successfully, but these errors were encountered: