Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost distributed with spark uses only one forrest? Why don't use parallel forrests? #11155

Closed
Anisalexvl opened this issue Jan 9, 2025 · 4 comments

Comments

@Anisalexvl
Copy link

Anisalexvl commented Jan 9, 2025

I found out that using Spark XGBoost results in only a single learned forest, and I don't understand why. It seems we could train different boosted models on each partition and then average their predictions.

This seems misleading to me, because there doesn’t appear to be any real advantage to splitting a large dataset. The result would essentially be the same as if I just subsampled the entire dataset and didn’t use the Spark implementation at all.

if context.partitionId() == 0:
config = booster.save_config()
yield pd.DataFrame({"data": [config]})
booster_json = booster.save_raw("json").decode("utf-8")
for offset in range(0, len(booster_json), _MODEL_CHUNK_SIZE):
booster_chunk = booster_json[offset : offset + _MODEL_CHUNK_SIZE]
yield pd.DataFrame({"data": [booster_chunk]})

@trivialfis
Copy link
Member

trivialfis commented Jan 9, 2025

XGB is based on a collective (MPI) framework, and workers communicate with each other during training. In the end, each worker returns the same model based on data from all workers. The spark interface uses the result from the first worker as a convention.

@Anisalexvl
Copy link
Author

Cannot understand which code is reliable for communication between spark nodes during training. Could you please point to this functionality?

@trivialfis
Copy link
Member

The most important communication is allreduce with the gradient histogram. There are other synchronizations as well. Feel free to grep the codebase for allreduce calls.

@trivialfis
Copy link
Member

Closing as the original question is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants