You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found out that using Spark XGBoost results in only a single learned forest, and I don't understand why. It seems we could train different boosted models on each partition and then average their predictions.
This seems misleading to me, because there doesn’t appear to be any real advantage to splitting a large dataset. The result would essentially be the same as if I just subsampled the entire dataset and didn’t use the Spark implementation at all.
XGB is based on a collective (MPI) framework, and workers communicate with each other during training. In the end, each worker returns the same model based on data from all workers. The spark interface uses the result from the first worker as a convention.
The most important communication is allreduce with the gradient histogram. There are other synchronizations as well. Feel free to grep the codebase for allreduce calls.
I found out that using Spark XGBoost results in only a single learned forest, and I don't understand why. It seems we could train different boosted models on each partition and then average their predictions.
This seems misleading to me, because there doesn’t appear to be any real advantage to splitting a large dataset. The result would essentially be the same as if I just subsampled the entire dataset and didn’t use the Spark implementation at all.
xgboost/python-package/xgboost/spark/core.py
Lines 1161 to 1168 in 4500941
The text was updated successfully, but these errors were encountered: