SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training

Date:

This talk introduces our experiences in detecting and localizing network failures in large-scale containerized large model training services at Alibaba, which is presented in Session: NetMon of ACM SIGCOMM 2025.