XGBoost distributed with spark uses only one forrest? Why don't use parallel forrests? #11155

Anisalexvl · 2025-01-09T19:45:50Z

I found out that using Spark XGBoost results in only a single learned forest, and I don't understand why. It seems we could train different boosted models on each partition and then average their predictions.

This seems misleading to me, because there doesn’t appear to be any real advantage to splitting a large dataset. The result would essentially be the same as if I just subsampled the entire dataset and didn’t use the Spark implementation at all.

xgboost/python-package/xgboost/spark/core.py

Lines 1161 to 1168 in 4500941

    
           if context.partitionId() == 0: 
        
               config = booster.save_config() 
        
               yield pd.DataFrame({"data": [config]}) 
        
               booster_json = booster.save_raw("json").decode("utf-8") 
        
               for offset in range(0, len(booster_json), _MODEL_CHUNK_SIZE): 
        
                   booster_chunk = booster_json[offset : offset + _MODEL_CHUNK_SIZE] 
        
                   yield pd.DataFrame({"data": [booster_chunk]})

trivialfis · 2025-01-09T23:46:30Z

XGB is based on a collective (MPI) framework, and workers communicate with each other during training. In the end, each worker returns the same model based on data from all workers. The spark interface uses the result from the first worker as a convention.

Anisalexvl · 2025-01-10T10:48:25Z

Cannot understand which code is reliable for communication between spark nodes during training. Could you please point to this functionality?

trivialfis · 2025-01-10T13:06:53Z

The most important communication is allreduce with the gradient histogram. There are other synchronizations as well. Feel free to grep the codebase for allreduce calls.

trivialfis · 2025-01-11T08:13:09Z

Closing as the original question is solved.

trivialfis closed this as completed Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost distributed with spark uses only one forrest? Why don't use parallel forrests? #11155

XGBoost distributed with spark uses only one forrest? Why don't use parallel forrests? #11155

Anisalexvl commented Jan 9, 2025 •

edited

Loading

trivialfis commented Jan 9, 2025 •

edited

Loading

Anisalexvl commented Jan 10, 2025

trivialfis commented Jan 10, 2025

trivialfis commented Jan 11, 2025

XGBoost distributed with spark uses only one forrest? Why don't use parallel forrests? #11155

XGBoost distributed with spark uses only one forrest? Why don't use parallel forrests? #11155

Comments

Anisalexvl commented Jan 9, 2025 • edited Loading

trivialfis commented Jan 9, 2025 • edited Loading

Anisalexvl commented Jan 10, 2025

trivialfis commented Jan 10, 2025

trivialfis commented Jan 11, 2025

Anisalexvl commented Jan 9, 2025 •

edited

Loading

trivialfis commented Jan 9, 2025 •

edited

Loading