About me
Qian is a staff engineer at Ant Group, specializing in site reliability engineering. He leads the infrastructure SRE team, applying SRE principles to manage AI infrastructure. His expertise spans heterogeneous cluster management, xPU maintenance, and leveraging observability to enhance the team's capability in diagnosing model training and inference issues. With a wealth of experience in infrastructure management, Qian is currently exploring the evolving skill set required for SRE professionals in the era of large language models. His goal is to adapt and grow in this rapidly changing technological landscape, ensuring that SRE practices remain at the forefront of AI infrastructure management.