We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi,
I am trying to run the GAN tutorial on MNIST (I made some minor modifications for my system):
import argparse import lbann import lbann.launcher from gan_model import build_model from mnist_dataset import make_data_reader mini_batch_size = 128 num_epochs = 100 job_name = "gan" trainer = lbann.Trainer(mini_batch_size) model = build_model(num_epochs) data_reader = make_data_reader() opt = lbann.Adam(learn_rate=1e-4, beta1=0., beta2=0.99, eps=1e-8) kwargs = { "nodes": 1, "scheduler" : "openmpi", "setup_only" : True, "time_limit" : 30, } lbann.run(trainer, model, data_reader, opt, job_name=job_name, **kwargs)
which gives the batch script:
export IBV_FORK_SAFE=1 echo "Started at $(date)" mpiexec -n 1 --map-by ppr:1:node -wdir /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/tutorials_lbann/gan/mnist/20231117_145903_gan_n1_ppn1 /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/lbann-latest/build_newompi3/install/bin/lbann --prototext=/lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/tutorials_lbann/gan/mnist/20231117_145903_gan_n1_ppn1/experiment.prototext status=$? echo "Finished at $(date)" exit ${status}
I get the error below (I already added export IBV_FORK_SAFE=1 to the batch.sh script produced):
export IBV_FORK_SAFE=1
-------------------------------------------------------------------------- WARNING: There are more than one active ports on host 'sqg2b16', but the default subnet GID prefix was detected on more than one of these ports. If these ports are connected to different physical IB networks, this configuration will fail in Open MPI. This version of Open MPI requires that every physically separate IB subnet that is used between connected MPI processes must have different subnet ID values. Please see this FAQ entry for more details: http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_default_gid_prefix to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: [[6305,1],0] (PID 56764) If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -------------------------------------------------------------------------- **************************************************************** Caught signal 11 (SIGSEGV - invalid memory reference) on rank 0 Stack trace: 0: lbann::stack_trace::get[abi:cxx11]() 1: lbann::exception::exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) 2: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/lbann-latest/build_newompi3/install/lib64/liblbann.so.0.104.0(+0xc470a71) [0x2ad53e4f4a71] (could not find stack frame symbol) 3: /usr/lib64/libpthread.so.0(+0xf5d0) [0x2ad58bdc35d0] (could not find stack frame symbol) 4: std::_Hashtable<std::string, std::string, std::allocator<std::string>, std::__detail::_Identity, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::clear() 5: google::protobuf::DescriptorPool::FindFileByName(std::string const&) const 6: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/python3.7/site-packages/google/protobuf/pyext/_message.cpython-37m-x86_64-linux-gnu.so(+0xb8e7a) [0x2ad6193a9e7a] (could not find stack frame symbol) 7: _PyMethodDef_RawFastCallKeywords (demangling failed) 8: _PyMethodDescr_FastCallKeywords (demangling failed) 9: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6dbb5) [0x2ad587abcbb5] (could not find stack frame symbol) 10: _PyEval_EvalFrameDefault (demangling failed) 11: _PyEval_EvalCodeWithName (demangling failed) 12: PyEval_EvalCodeEx (demangling failed) 13: PyEval_EvalCode (demangling failed) 14: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol) 15: _PyMethodDef_RawFastCallDict (demangling failed) 16: _PyCFunction_FastCallDict (demangling failed) 17: _PyEval_EvalFrameDefault (demangling failed) 18: _PyEval_EvalCodeWithName (demangling failed) 19: _PyFunction_FastCallKeywords (demangling failed) 20: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 21: _PyEval_EvalFrameDefault (demangling failed) 22: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 23: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 24: _PyEval_EvalFrameDefault (demangling failed) 25: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 26: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 27: _PyEval_EvalFrameDefault (demangling failed) 28: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 29: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 30: _PyEval_EvalFrameDefault (demangling failed) 31: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 32: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol) 33: _PyObject_CallMethodIdObjArgs (demangling failed) 34: PyImport_ImportModuleLevelObject (demangling failed) 35: _PyEval_EvalFrameDefault (demangling failed) 36: _PyEval_EvalCodeWithName (demangling failed) 37: PyEval_EvalCodeEx (demangling failed) 38: PyEval_EvalCode (demangling failed) 39: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol) 40: _PyMethodDef_RawFastCallDict (demangling failed) 41: _PyCFunction_FastCallDict (demangling failed) 42: _PyEval_EvalFrameDefault (demangling failed) 43: _PyEval_EvalCodeWithName (demangling failed) 44: _PyFunction_FastCallKeywords (demangling failed) 45: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 46: _PyEval_EvalFrameDefault (demangling failed) 47: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 48: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 49: _PyEval_EvalFrameDefault (demangling failed) 50: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 51: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 52: _PyEval_EvalFrameDefault (demangling failed) 53: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 54: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 55: _PyEval_EvalFrameDefault (demangling failed) 56: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 57: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol) 58: _PyObject_CallMethodIdObjArgs (demangling failed) 59: PyImport_ImportModuleLevelObject (demangling failed) 60: _PyEval_EvalFrameDefault (demangling failed) 61: _PyEval_EvalCodeWithName (demangling failed) 62: PyEval_EvalCodeEx (demangling failed) 63: PyEval_EvalCode (demangling failed) 64: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol) 65: _PyMethodDef_RawFastCallDict (demangling failed) 66: _PyCFunction_FastCallDict (demangling failed) 67: _PyEval_EvalFrameDefault (demangling failed) 68: _PyEval_EvalCodeWithName (demangling failed) 69: _PyFunction_FastCallKeywords (demangling failed) 70: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 71: _PyEval_EvalFrameDefault (demangling failed) 72: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 73: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 74: _PyEval_EvalFrameDefault (demangling failed) 75: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 76: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 77: _PyEval_EvalFrameDefault (demangling failed) 78: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 79: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 80: _PyEval_EvalFrameDefault (demangling failed) 81: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 82: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol) 83: _PyObject_CallMethodIdObjArgs (demangling failed) 84: PyImport_ImportModuleLevelObject (demangling failed) 85: _PyEval_EvalFrameDefault (demangling failed) 86: _PyEval_EvalCodeWithName (demangling failed) 87: PyEval_EvalCodeEx (demangling failed) 88: PyEval_EvalCode (demangling failed) 89: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol) 90: _PyMethodDef_RawFastCallDict (demangling failed) 91: _PyCFunction_FastCallDict (demangling failed) 92: _PyEval_EvalFrameDefault (demangling failed) 93: _PyEval_EvalCodeWithName (demangling failed) 94: _PyFunction_FastCallKeywords (demangling failed) 95: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 96: _PyEval_EvalFrameDefault (demangling failed) 97: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 98: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 99: _PyEval_EvalFrameDefault (demangling failed) 100: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 101: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 102: _PyEval_EvalFrameDefault (demangling failed) 103: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 104: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 105: _PyEval_EvalFrameDefault (demangling failed) 106: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 107: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol) 108: _PyObject_CallMethodIdObjArgs (demangling failed) 109: PyImport_ImportModuleLevelObject (demangling failed) 110: _PyEval_EvalFrameDefault (demangling failed) 111: _PyEval_EvalCodeWithName (demangling failed) 112: PyEval_EvalCodeEx (demangling failed) 113: PyEval_EvalCode (demangling failed) 114: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol) 115: _PyMethodDef_RawFastCallDict (demangling failed) 116: _PyCFunction_FastCallDict (demangling failed) 117: _PyEval_EvalFrameDefault (demangling failed) 118: _PyEval_EvalCodeWithName (demangling failed) 119: _PyFunction_FastCallKeywords (demangling failed) 120: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 121: _PyEval_EvalFrameDefault (demangling failed) 122: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 123: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 124: _PyEval_EvalFrameDefault (demangling failed) 125: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol) 126: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol) 127: _PyEval_EvalFrameDefault (demangling failed) **************************************************************** -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. --------------------------------------------------------------------------
FYI, I built LBANN with cmake (using openmpi version 3.1.6). I am also using python 3.7. Any help to resolve this error would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hi,
I am trying to run the GAN tutorial on MNIST (I made some minor modifications for my system):
which gives the batch script:
I get the error below (I already added
export IBV_FORK_SAFE=1
to the batch.sh script produced):FYI, I built LBANN with cmake (using openmpi version 3.1.6). I am also using python 3.7.
Any help to resolve this error would be greatly appreciated.
The text was updated successfully, but these errors were encountered: