Skip to content

System Optimization to Accelerate Distributed Model Training

Zhang Pengshan (David) edited this page Sep 4, 2017 · 2 revisions

Distributed model training in Shifu/Guagua is proved to be 5-10x faster than Spark MLLib especially in a shared multi-tenancy Hadoop cluster. Several important system-level optimizations are listed here to show why Shifu/Guagua is faster.

Netty Based RPC for Message Passing

Fault Tolerance

Straggler Mitigation

Clone this wiki locally