Anamaya Sullerey and Brian Hansen, Meta
We share how we have transformed the way Monetization at Meta approaches machine learning training infrastructure management to unleash Efficiency and unlock Innovation. As AI model sizes and deployment footprints continue to explode, inefficient resource allocation and utilization are no longer just a nuisance – they're a major roadblock to innovation.
We'll dive into the cutting-edge strategies and real-world examples of how to use governance to:
- Drive ROI: Accurately measure and attribute the cost of ML training to focus on high ROI investments.
- Unlock hidden capacity: Maximize your existing resources and reduce waste
- Accelerate time-to-market: Streamline your ML development process and get to production faster
Through a case study of a successful ML training workload governance system, we'll explore the complexities of attributing costs in ML training to projects and share hard-won lessons from bridging the gap between research and production.
https://www.usenix.org/conference/srecon25americas/presentation/sullerey