Skip to content
This repository has been archived by the owner on Apr 8, 2024. It is now read-only.

Communicate all lightgbm distributed metrics into node 0 for surfacing as aggregate #185

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jfomhover
Copy link
Contributor

@jfomhover jfomhover commented Dec 2, 2021

Problem

When using distributed lightgbm, each node will report its own validation metrics.

Proposed Design

Discarded alternative

An initial design here https://github.com/microsoft/lightgbm-benchmark/blob/c45eacfc284aff286c46ee9349455466ca07810f/src/common/lightgbm_utils.py consisted in using mpi naively during each of the callback calls (synchronously to each callback). This creates some interaction with LightGBM's internal MPI initialization somehow, and ended up causing the exception below on node 0:

  File "train.py", line 265, in run
    booster = lightgbm.train(
  File "/azureml-envs/lightgbm/lib/python3.8/site-packages/lightgbm/engine.py", line 293, in train
    booster.update(fobj=fobj)
  File "/azureml-envs/lightgbm/lib/python3.8/site-packages/lightgbm/basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "/azureml-envs/lightgbm/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Please initialize the network interface first

That's why we ended up using threading instead.

@jfomhover jfomhover temporarily deployed to mlops December 2, 2021 00:07 Inactive
@jfomhover jfomhover temporarily deployed to mlops December 2, 2021 00:28 Inactive
@github-actions
Copy link

github-actions bot commented Dec 2, 2021

Unit Test Results for Build

  1 files  ±0    1 suites  ±0   8s ⏱️ -1s
76 tests ±0  76 ✔️ ±0  0 💤 ±0  0 ±0 

Results for commit ce90af5. ± Comparison against base commit 7b8a339.

@jfomhover jfomhover added enhancement New feature or request training-benchmark labels Dec 2, 2021
@jfomhover jfomhover added this to the Standardization milestone milestone Dec 2, 2021
@jfomhover jfomhover added the hold This PR/Issue should be put on hold. label Dec 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request hold This PR/Issue should be put on hold. training-benchmark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant