We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model1 = ppsci.arch.MLP(**cfg.MODEL1)
通过命令行执行 python -m paddle.distributed.launch --selected_gpus='0,1' --find_unused_parameters=True crack2d_unsteady.py
File "/home/hongguobin0094/.conda/envs/ppsci_py310/lib/python3.10/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 355, in backward core.eager.run_backward([self], grad_tensor, retain_graph) RuntimeError: (PreconditionNotMet) Error happened, when parameter[19][linear_9.b_0] has been ready before. Please set find_unused_parameters=True to traverse backward graph in each step to prepare reduce in advance. If you have set, there may be several reasons for this error: 1) In multiple reentrant backward phase, some parameters are reused.2) Using model parameters outside of forward function. Please make sure that model parameters are not shared in concurrent forward-backward passes. [Hint: Expected has_marked_unused_vars_ == false, but received has_marked_unused_vars_:1 != false:0.] (at ../paddle/fluid/distributed/collective/reducer.cc:812) Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. I0115 15:47:05.968951 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct I0115 15:47:05.969121 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct I0115 15:47:05.969133 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct I0115 15:47:06.083209 2441630 tcp_store.cc:290] receive shutdown event and so quit from MasterDaemon run loop [2025-01-15 15:47:11,318] [ INFO] launch_utils.py:334 - terminate all the procs INFO 2025-01-15 15:47:11,318 launch_utils.py:334] terminate all the procs [2025-01-15 15:47:11,319] [ ERROR] launch_utils.py:648 - ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log. ERROR 2025-01-15 15:47:11,319 launch_utils.py:648] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log. [2025-01-15 15:47:15,323] [ INFO] launch_utils.py:334 - terminate all the procs INFO 2025-01-15 15:47:15,323 launch_utils.py:334] terminate all the procs [2025-01-15 15:47:15,324] [ WARNING] launch.py:443 - Terminating... exit WARNING 2025-01-15 15:47:15,324 launch.py:443] Terminating... exit [2025-01-15 15:47:19,328] [ INFO] launch_utils.py:334 - terminate all the procs INFO 2025-01-15 15:47:19,328 launch_utils.py:334] terminate all the procs
The text was updated successfully, but these errors were encountered:
非常感谢提交issue反馈问题,我本地试了一下一个可以运行多卡的也是MLP模型为基础的案例:viv.py:
看起来好像没问题……
所以是否方便提供一下完整的训练脚本呢?可以上传到aistudio然后分享链接给我,或者是新建一个github私有项目添加我为成员的方式都行。
运行命令: python -m paddle.distributed.launch --gpus="0,1" viv.py,参考: https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/user_guide/#221
python -m paddle.distributed.launch --gpus="0,1" viv.py
Sorry, something went wrong.
No branches or pull requests
请提出你的问题 Please ask your question
网络结构超简单
model1 = ppsci.arch.MLP(**cfg.MODEL1)
执行
通过命令行执行 python -m paddle.distributed.launch --selected_gpus='0,1' --find_unused_parameters=True crack2d_unsteady.py
报错
The text was updated successfully, but these errors were encountered: