SREcon25 Americas

SREcon25 Americas

11:00am PDT

An SRE Approach to Monitoring ML in Production

Tuesday March 25, 2025 11:00am - 11:45am PDT

Grand Ballroom AB

Daria Barteneva, Microsoft Azure

Machine Learning (ML) is becoming a part of many aspects of SRE life. As an SRE, we are (or will be soon) dealing with the challenge of serving ML models as part of a large distributed production system. Unfortunately the domain expertise required to build ML doesn't overlap with the expertise required to run large distributed system. The SRE community lacks standard practices and experiences that would allow us to operationalize ML and help to answer critical question: how exactly do we operate ML at scale reliably?

In this talk we will explore the (lack of) overlap between ML and SRE domains and discuss how we can help practitioners to solve common challenges. Scoping this talk to ML Observability we will be decomposing a complex system into its primary components helping engineers to bridge domain expertise gap in making ML systems more observable.

But when our production system serves ML models, relying only on traditional observability practices is not enough. We will review the characteristics and requirements specific to serving ML in production and discuss mechanisms that will help us to understand the end to end system reliability and quality.

https://www.usenix.org/conference/srecon25americas/presentation/barteneva

Speakers

Daria Barteneva

Microsoft Azure

Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing... Read More →

Tuesday March 25, 2025 11:00am - 11:45am PDT
Grand Ballroom AB

Track 1

11:50am PDT

Transformers in SRE Land: Evolving to Manage AI Infrastructure

Tuesday March 25, 2025 11:50am - 12:35pm PDT

Grand Ballroom AB

Qian Ding, Ant Group

The rapid advancement of AI has fundamentally transformed the technological landscape. As AI models grow in complexity and scale, the challenges of managing the underlying infrastructure have intensified commensurately. This presentation explores the unique demands of AI infrastructure and how SREs can adapt to this evolving environment.

We'll delve into the specific challenges of managing GPU-accelerated clusters, including anomaly detection, node lifecycle management, and the distinctive requirements of AI workloads. By sharing real-world experiences and lessons learned, we aim to provide valuable insights into how SREs can effectively navigate this new frontier, ensuring the reliability, scalability, and performance of AI infrastructure.

https://www.usenix.org/conference/srecon25americas/presentation/ding

Speakers

Qian Ding

Ant Group

Qian is a staff engineer at Ant Group, specializing in site reliability engineering. He leads the infrastructure SRE team, applying SRE principles to manage AI infrastructure. His expertise spans heterogeneous cluster management, xPU maintenance, and leveraging observability to enhance... Read More →

Tuesday March 25, 2025 11:50am - 12:35pm PDT
Grand Ballroom AB

Track 1

1:50pm PDT

Case Study: A Thundering Herd in the Wild

Tuesday March 25, 2025 1:50pm - 2:35pm PDT

Grand Ballroom AB

Nicolas Arroyo, Bloomberg LP

The 'thundering herd problem' is an issue that occurrs when multiple threads wait on the same event and are all woken up at the same time. If only one thread can handle the event, then that means that the others waste resources with noop context switches. This problem has been largely resolved in modern kernels and through the use of notification APIs (e.g., epoll, kqueue, and/or IOCP).

We will present how we investigated and identified an unexpected variant of this problem. We will review our performance troubleshooting process, starting with aggregated sampling, followed by dynamic instrumentation and detailed sampling, and finally, kernel mode sampling. With every step, we will explain what information we gained to help us discover the problem: system calls buried inside commonly used libraries that use absolute timers, which caused threads to synchronize and led to a multitude of threads waking up at the same time.

https://www.usenix.org/conference/srecon25americas/presentation/arroyo

Speakers

Nicolas Arroyo

Bloomberg LP

Nicolas Arroyo is a seasoned developer with 20 years of experience across diverse domains, including machine learning, data science, security, performance, systems architecture, embedded systems, distributed systems, and networking. He is passionate about performance optimization... Read More →

Tuesday March 25, 2025 1:50pm - 2:35pm PDT
Grand Ballroom AB

Track 1

2:40pm PDT

Techniques Netflix Uses to Weather Significant Demand Shifts

Tuesday March 25, 2025 2:40pm - 3:25pm PDT

Grand Ballroom AB

Joseph Lynch, Netflix

Netflix runs a complex architecture supporting hundreds of different types of devices connecting from all over the world at all times. For various reasons at various times, load on these systems shifts significantly in pattern and magnitude, sometimes by multiple orders of magnitude in just a few minutes. When demand shifts, dozens of edge gateways, thousands of microservices, and tens of thousands of caches and databases have to weather the load shift while maintaining a high quality of service for our users.

In this talk, we will start with understanding how the four-region full-active architecture of Netflix's streaming control plane gives us the levers to shape and prioritize traffic. Techniques like balancing load and at key times unbalancing it or using partial or complete failover and shifting help us mitigate demand shifts.

Next, once load has entered one of our regions, we will see a combination of intelligent pre-scaling with automated service buffer management paired with reactive measures such as load shedding and rapid autoscaling to best bring available capacity supply to bear. For some types of demand shifts, we have to make hard tradeoffs between system stability and our ideal user experience, and choose to smartly degrade the service while maintaining the highest quality of experience we can. We will dive deep into these techniques with examples and tradeoffs.

Finally, we will touch on how the underlying data architecture makes all of this possible, and briefly what resilience techniques we use to keep our stateful systems available during load increases. For example, we will cover the use of data gateways with built-in resilience techniques, capacity planning, sharding, and thoughtful use of caching.

https://www.usenix.org/conference/srecon25americas/presentation/lynch

Speakers

Joseph Lynch

Netflix

Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. He led the shift of the Netflix data tier to abstraction, driving resilience through a Data Gateway architecture... Read More →

Tuesday March 25, 2025 2:40pm - 3:25pm PDT
Grand Ballroom AB

Track 1

3:55pm PDT

Using Statistical Techniques to Automatically Detect Game-Breaking Issues

Tuesday March 25, 2025 3:55pm - 4:15pm PDT

Grand Ballroom AB

Ian Neidel, Netflix

Content Delivery Network SREs are accustomed to metrics such as latency, bitrate, and dropped packets that measure how well we deliver content. However, as our team at Netflix expanded into ensuring good quality of experience for cloud gaming, a new challenge emerged: we must also be sure that what we deliver is fine as well. That is, we need to be able to automatically detect broken gameplay sessions and game breaking issues in a scalable way.

With a growing number of sessions and reams of logs per day, we turn to statistics and machine learning techniques to solve these otherwise difficult tasks at scale. In this talk we will cover the variety of metrics we use to infer brokenness, explain accessible methods to vectorize and cluster exception messages, and provide some insight into the statistics we use to find broken sessions, identify game breaking issues, and infer their impact with confidence.

https://www.usenix.org/conference/srecon25americas/presentation/neidel

Speakers

Ian Neidel

Netflix

Ian Neidel is a SRE for Open Connect, Netflix’s in-house CDN. He works on Quality of Experience for Cloud Games, improving resiliency and realtime observability for Live Streaming, and automatic diagnosis and remediation of issues across Netflix’s distributed fleet of servers... Read More →

Tuesday March 25, 2025 3:55pm - 4:15pm PDT
Grand Ballroom AB

Track 1

4:20pm PDT

Mapping a Better Future with STPA

Tuesday March 25, 2025 4:20pm - 4:40pm PDT

Grand Ballroom AB

Theo Klein, Google

Want to prevent outages before they happen? Traditional SRE methods focus on component failures, but a whole class of outages stem from unexpected system interactions. We found a solution.

In our team, we use Systems Theoretic Process Analysis (STPA) to identify and fix system-level vulnerabilities before they cause outages. By applying STPA during the design phase, we've prevented major incidents and saved countless engineering hours.

This talk will show you how STPA can transform your approach to reliability. We'll share a real-world example where STPA caught critical design flaws that traditional methods missed, saving us months of costly rework.

Don't wait for outages to happen. Learn how STPA can help you build more resilient systems and become a 1000x engineer.

https://www.usenix.org/conference/srecon25americas/presentation/klein

Speakers

Theo Klein

Google

Theo Klein is a Senior Site Reliability Engineer working on Google Maps. Over the past year, he has lead an effort to improve the safety and reliability of road disruptions data on Google Maps. Previously, he lead efforts to remove unneeded dependencies on critical systems, which... Read More →

Tuesday March 25, 2025 4:20pm - 4:40pm PDT
Grand Ballroom AB

Track 1

4:45pm PDT

Is the S in SRE for “Security”?

Tuesday March 25, 2025 4:45pm - 5:30pm PDT

Grand Ballroom AB

John Benninghoff, Security Differently

There is significant overlap between Cybersecurity and SRE; understanding and leveraging that can improve the performance of both. Lessons from safety science tell us that security and SRE come through being successful more often, not failing less. Research in DevOps, Software Security, and elsewhere shows a strong link between different types of organizational performance, including development, operations, SRE, and security; in many cases, organizations most effectively reduce cybersecurity risk by improving general technology performance.

Many SRE capabilities overlap with Security, including the critical activities of patching & managing attack surface, along with observability, incident response, postmortems, testing, and platform engineering. SRE and Security teams can collaborate by supporting their mutual goals, sharing their perspectives dealing with incidents both frequent and rare, and by setting Security Level Objectives to inform decisions on when to divert resources to security as SRE teams do with Service Level Objectives.

https://www.usenix.org/conference/srecon25americas/presentation/benninghoff

Speakers

John Benninghoff

Security Differently

John Benninghoff is a long-time student and practitioner of managing information risk. His 25-year career in Cybersecurity and SRE includes diverse experience in financial services, retail, government, and health care. He founded Security Differently to advise organizations on how... Read More →

Tuesday March 25, 2025 4:45pm - 5:30pm PDT
Grand Ballroom AB

Track 1

11:00am PDT

Maturing Your Data Architecture in a Week: How Bluesky Survived

Wednesday March 26, 2025 11:00am - 11:45am PDT

Grand Ballroom AB

Jaz Volpert, Bluesky PBC

In November of 2024, Bluesky saw a sudden surge in activity adding one million new users per day several days in a row, with daily active users increasing by 1,200% in a week. Through this exponential growth, Bluesky's backend team of ~6 engineers kept the site online and continued to onboard new users despite all of our core infrastructure running on our own physical infrastructure. In this talk, I'll walk you through the 11 days of hell (16+ hours a day) in which we rapidly matured our data architecture to support over 1M hourly active users producing 1,600+ events/sec.

https://www.usenix.org/conference/srecon25americas/presentation/volpert

Speakers

Jaz Volpert

Bluesky PBC

Jaz is the Backend Go developer at Bluesky responsible for scalable data systems and physical infrastructure. From a global index of billions of records, to graph databases, to video platforms, Jaz has built a wide variety of large-scale systems used by tens of millions of users around... Read More →

Wednesday March 26, 2025 11:00am - 11:45am PDT
Grand Ballroom AB

Track 1

11:50am PDT

Inclusive SRE: Best Practices for Working with a Visually Impaired Incident Analyst or Responder

Wednesday March 26, 2025 11:50am - 12:35pm PDT

Grand Ballroom AB

Randall (Randy) Horwitz, IBM CIO

Fortunately, Society is becoming more inclusive, enabling all of us to learn to work with people with differing abilities, like those who are visually impaired. We all want to be more inclusive, but how do we best collaborate with a visually impaired incident analyst or responder? What kinds of challenges do they have? How can they collaborate if they can’t see our dashboards?

Resolving difficult incidents always requires leveraging different perspectives, and people who think/hear/see differently can provide a game changing perspective.

Please join Randy Horwitz, visually impaired Senior Technical Staff Member, IBM CIO and former incident responder for a 35-minute presentation demonstrating how to bridge these gaps. Screen reader demos will be provided.

https://www.usenix.org/conference/srecon25americas/presentation/horwitz

Speakers

Randall (Randy) Horwitz

IBM

Randall (Randy) Horwitz currently works as a Senior Technical Staff Member for the CIO Technology Platforms Transformation I&T Operations organization.Since 2016, when he worked as the support manager for the IBM Developer Experience, Mr. Horwitz has been passionate about development... Read More →

Wednesday March 26, 2025 11:50am - 12:35pm PDT
Grand Ballroom AB

Track 1

1:50pm PDT

Optimizing Machine Learning Training Infrastructure: A Governance Approach

Wednesday March 26, 2025 1:50pm - 2:35pm PDT

Grand Ballroom AB

Anamaya Sullerey and Brian Hansen, Meta

We share how we have transformed the way Monetization at Meta approaches machine learning training infrastructure management to unleash Efficiency and unlock Innovation. As AI model sizes and deployment footprints continue to explode, inefficient resource allocation and utilization are no longer just a nuisance – they're a major roadblock to innovation.

We'll dive into the cutting-edge strategies and real-world examples of how to use governance to:

Drive ROI: Accurately measure and attribute the cost of ML training to focus on high ROI investments.

Unlock hidden capacity: Maximize your existing resources and reduce waste

Accelerate time-to-market: Streamline your ML development process and get to production faster

Through a case study of a successful ML training workload governance system, we'll explore the complexities of attributing costs in ML training to projects and share hard-won lessons from bridging the gap between research and production.

https://www.usenix.org/conference/srecon25americas/presentation/sullerey

Speakers

Anamaya Sullerey

Meta

Anamaya Sullerey is a technical leader in the AdsML Production Engineering team, focused on capacity, efficiency, and reliability in the ML production environment. He has over two decades of broad experience across ML, software, compute and network systems, and silicon. Anamaya holds... Read More →

Brian Hansen

Meta

Brian leads the AdsML Production Engineering teams for Meta, focused on scaling machine learning in production environments. He has been a successful serial entrepreneur for two decades taking multiple start-ups from early to late stage growth. Throughout his career Brian has been... Read More →

Wednesday March 26, 2025 1:50pm - 2:35pm PDT
Grand Ballroom AB

Track 1

2:40pm PDT

Beyond Sequential: A Recipe for Async Pipeline Observability and Alerting

Wednesday March 26, 2025 2:40pm - 3:25pm PDT

Grand Ballroom AB

Jash Mistry and Gabriela Medvetska, eBay Inc

Navigating the complexities of microservices observability requires more than just traditional monitoring — especially for asynchronous systems. This session provides a comprehensive “recipe” to cooking up Service Level Objectives (SLOs) for asynchronous pipelines. Learn how to identify critical metrics, instrument your app using Prometheus, design meaningful dashboards, and define actionable alerts. Whether you're a junior site reliability sous-chef or a seasoned ops chef, you'll leave with a practical cookbook of strategies to enhance your async system's observability and monitor customer experience.

https://www.usenix.org/conference/srecon25americas/presentation/mistry

Speakers

Jash Mistry

eBay Inc

Jash Mistry is a Senior Software Engineer at eBay. As a member of the Site Reliability Engineering team, he played a crucial role in the evolution of monitoring—expanding on absolute error counts and average latencies to develop a highly reliable SLO-driven observability platform... Read More →

Gabriela Medvetska

eBay Inc

Gabriela Medvetska is a Software Engineer at eBay. As a member of the Site Reliability Engineering team, she worked on a variety of projects ranging from developing UIs for internal observability tooling to implementing machine learning algorithms to improve site resiliency during... Read More →

Wednesday March 26, 2025 2:40pm - 3:25pm PDT
Grand Ballroom AB

Track 1

3:55pm PDT

Chaos Experiments - Datacenter Stress Testing

Wednesday March 26, 2025 3:55pm - 4:40pm PDT

Grand Ballroom AB

Clayton Krueger, USAA

In this session, we’ll explore how a financial services provider has developed a comprehensive, automated chaos engineering program, supported by strong leadership. While chaos testing is commonly done with individual applications, we’ve elevated the practice by applying it to an entire data center. This journey didn’t happen overnight, and we’ll take you through the key stages of our progress. We’ll discuss the major challenges we faced specifically around fear, uncertainty, and doubt. Attendees will gain insights into the tools and strategies we used to overcome obstacles and the lessons learned along the way. Additionally, we’ll share our plans for future efforts and how we aim to further enhance the robustness of our infrastructure. This session is perfect for anyone looking to deepen their understanding of large-scale chaos engineering in a complex environment.

https://www.usenix.org/conference/srecon25americas/presentation/krueger

Speakers

Clayton Krueger

USAA

Clayton Krueger is a trailblazing leader and founding member of the SRE team at USAA, where he has played a pivotal role in shaping the company’s infrastructure resiliency strategy. Clayton has been instrumental in designing and implementing USAA’s core metrics collection and... Read More →

Wednesday March 26, 2025 3:55pm - 4:40pm PDT
Grand Ballroom AB

Track 1

4:45pm PDT

Measuring Availability the Player Focused Way: How Riot Games Changed Its Availability Culture

Wednesday March 26, 2025 4:45pm - 5:30pm PDT

Grand Ballroom AB

Maxfield Stewart, Riot Games

Riot Games started its journey to building out SRE culture in 2020. The number 1 problem we had to solve first was a unified language across all teams and games about what availability was. In other words, we had to define "uptime". This talk will walk through how we developed our availability measurements by simple modifications to our incident management process and aligned leadership and engineers on being held accountable to availability using our most popular core value, Player Focus.

https://www.usenix.org/conference/srecon25americas/presentation/stewart

Speakers

Maxfield Stewart

Riot Games

Maxfield Stewart has been shipping software and supporting production environments for over 25 years. Having worked in private consulting for fortune 500 companies like Goldman Sachs and Sprint, to over a decade and a half in the game industry. For the last 12 years Max has been helping... Read More →

Wednesday March 26, 2025 4:45pm - 5:30pm PDT
Grand Ballroom AB

Track 1

9:00am PDT

Stopping Performance Regression via Changepoint Detection

Thursday March 27, 2025 9:00am - 9:20am PDT

Grand Ballroom AB

Joseph Cirella and Shanthini Velan, Bloomberg

Bloomberg's Ticker Plant infrastructure provides real-time market data to almost all internal and external clients; any increase in latency impacts much of the company's real-time products. This talk discusses how statistical changepoint detection is used to identify when our complex system's performance characteristics have significantly changed. We will discuss the challenges of deploying this, such as dealing with "expected" changepoints like market open/close and downtime, relaying the change information to engineers in an effective manner, and establishing a feedback loop.

https://www.usenix.org/conference/srecon25americas/presentation/cirella

Speakers

Joseph Cirella

Bloomberg

Joseph Cirella is a Software Engineer at Bloomberg, where he works within the Ticker Plant SRE Capacity & Performance team.

Shanthini Velan

Bloomberg

Shanthini Velan leads the Ticker Plant SRE Capacity & Performance team at Bloomberg.

Thursday March 27, 2025 9:00am - 9:20am PDT
Grand Ballroom AB

Track 1

9:25am PDT

Per Aspera ad Productum: Turning Processes into Products

Thursday March 27, 2025 9:25am - 9:45am PDT

Grand Ballroom AB

Yuri Bernstein, Medallia

How to increase your team throughput by applying product management principles to SRE tools.

https://www.usenix.org/conference/srecon25americas/presentation/bernstein

Speakers

Yuri Bernstein

Medallia

Yuri is now working as Senior Staff SRE at Medallia. He brings 16 years of industry experience, building and managing global teams. Yuri’s current focus is SRE organization workflow efficiency and scalability. Yuri is a zealot of keeping things simple, yet powerful applying multidisciplinary... Read More →

Thursday March 27, 2025 9:25am - 9:45am PDT
Grand Ballroom AB

Track 1

9:50am PDT

Incident Management Metrics That Matter

Thursday March 27, 2025 9:50am - 10:35am PDT

Grand Ballroom AB

Laura de Vesine and Jamie Luck, Datadog Inc

Businesses run on metrics. They use them to judge success, identify areas for investment, and reward employees. Unfortunately, naive metrics can do more harm than good, especially in the context of low-frequency events like incidents. Management teams often reach for MTTR (mean time to recovery) or raw incident counts to judge the success of reliability and resilience programs, but these metrics generate spurious insights and perverse incentives. As SREs we can't simply tell the business not to measure them -- we need to offer alternatives. This talk explores a starting list of things to measure instead (and how to build your own list), as well as a framework to educate less technical people on what the actual value proposition of incident management is.

https://www.usenix.org/conference/srecon25americas/presentation/de-vesine

Speakers

Laura de Vesine

Datadog Inc

Laura de Vesine is a 20+ year software industry veteran. She has spent the last 9 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture, with a recent expansion into security. Laura is currently a... Read More →

Jamie Luck

Datadog Inc

Jamie is a Senior SRE working in Incident Management at Datadog. Ever since they broke their first laptop and learned about this free operating system called Linux, it was all over. They have been working in the resilience and reliability space for ten years, operating everything... Read More →

Thursday March 27, 2025 9:50am - 10:35am PDT
Grand Ballroom AB

Track 1

11:05am PDT

Systems Thinking with Poisoned Systems

Thursday March 27, 2025 11:05am - 11:50am PDT

Grand Ballroom AB

Hazel Weakly, Nivenly Foundation; Sandeep Kanabar, Gen

AI is often said to be a "garbage in, garbage out" solution. So what happens when you take a carefully tuned system and try to operate it with AI?

Chaos! Bedlam! Or maybe... not?

AI assistance has some studied drawbacks: data poisoning, bias, inaccessibility, de-skilling, and more. We could very well end up in a world that is run by inaccessible and inscrutable black box AI systems. But! The situation isn't hopeless!

AI seems to be here to stay, but the drawbacks don't have to be. Join Hazel and Sandeep as we take you on a journey through our personal experiences with biased and broken systems, how we've worked around them, and strategies we have for addressing these issues as well as preventing future ones. Together, we'll discover how to transform AI into a transparent and reliable tool that helps enable innovation rather than chaos.

https://www.usenix.org/conference/srecon25americas/presentation/weakly

Speakers

Hazel Weakly

Nivenly Foundation

Hazel spends her days working on building out teams of humans as well as the infrastructure, systems, automation, and tooling to make life better for others. She’s worked at a variety of companies, across a wide range of tech, and knows that the hardest problems to solve are the... Read More →

Sandeep Kanabar

Gen

Hailing from India, Sandeep is a passionate software engineer working at Gen (formerly NortonLifeLock). A frequent meetup speaker, Sandeep enjoys sharing his lessons learned from 15+ years in the tech space with the community. He's a staunch advocate for diversity and inclusion and... Read More →

Thursday March 27, 2025 11:05am - 11:50am PDT
Grand Ballroom AB

Track 1

11:55am PDT

No Time to Do It All! Approaching Overload on DevOps Teams

Thursday March 27, 2025 11:55am - 12:40pm PDT

Grand Ballroom AB

Alex Wise

There's always more work to be done. Alex will take a look at signs of overload in your organization, how to identify them, and strategies for managing it. He'll cover concepts including Overload in Joint Cognitive Systems, WIP Spirals, the Utilization Trap, and how they can be applied to your organization.

https://www.usenix.org/conference/srecon25americas/presentation/wise

Speakers

Alex Wise

Alex is a site reliability engineer who loves safety-critical systems and attacking problems that attack back. He is best known for his work with the Software Freedom School helping those new to tech understand how to use and why to choose open source software. He worked as a software... Read More →

Thursday March 27, 2025 11:55am - 12:40pm PDT
Grand Ballroom AB

Track 1

1:55pm PDT

One Million Builds per Year, Only One Page - Operating Internal Services Without Heroics

Thursday March 27, 2025 1:55pm - 2:15pm PDT

Grand Ballroom AB

Cail Young, Octopus Deploy

A nuts-and-bolts examination of how a small team at Octopus Deploy was able to deliver a set of internal services that enabled in excess of 1 million builds in a calendar year - with only one out-of-hours page in that time! We'll cover the technical and social aspects of what was involved, and discuss some of the downsides of having what appears to be a stable system.

https://www.usenix.org/conference/srecon25americas/presentation/young

Speakers

Cail Young

Octopus Deploy

Cail has spent the last couple of decades working at the intersection of people and technology: in the performing arts, in the motion picture industry, and now in the field of software operations. He is fascinated by learning from incidents - large and small - and will gladly trade... Read More →

Thursday March 27, 2025 1:55pm - 2:15pm PDT
Grand Ballroom AB

Track 1

2:20pm PDT

Going Multi Cloud in a Hurry with Quality and Style

Thursday March 27, 2025 2:20pm - 2:40pm PDT

Grand Ballroom AB

Geoff Oakham, ecobee

How would you extend a Kubernetes based platform to support a second cloud provider? What if no one on your team knew the second platform well? Join Geoff as he talks about the soft skills and techniques he tried while delivering the product on time, met compliance standards, and trained up his co-workers.

https://www.usenix.org/conference/srecon25americas/presentation/oakham

Speakers

Geoff Oakham

ecobee

Geoff is a Staff SRE at ecobee. In his spare time he builds fun things his wife and 8yo find on social, and fixes up a century home. He was recently given a 3d printer and discovered he has enough spare time to spend fixing that too!

Thursday March 27, 2025 2:20pm - 2:40pm PDT
Grand Ballroom AB

Track 1

2:45pm PDT

Mitigating Against Large Scale Systemic Failures in E-Trading

Thursday March 27, 2025 2:45pm - 3:05pm PDT

Grand Ballroom AB

Chris Hawley, Morgan Stanley

Electronic trading systems are inherently complex and operate within narrow, high-stakes time windows, making their availability critical. Despite employing various resiliency patterns, these systems remain vulnerable to tail risks that could lead to widespread failures with significant consequences.

This presentation will explore real-world examples to uncover the nature of these risks, examine the limitations of common resiliency strategies, and discuss alternative approaches to enhance system robustness and reliability.

https://www.usenix.org/conference/srecon25americas/presentation/hawley

Speakers

Chris Hawley

Morgan Stanley

Chris Hawley is an Executive Director at Morgan Stanley in Institutional Securities Technology. He is a technical lead in the Listed Sales & Trading department.Chris is a product owner within Site Reliability Engineering for the firm's global order management and electronic trading... Read More →

Thursday March 27, 2025 2:45pm - 3:05pm PDT
Grand Ballroom AB

Track 1

3:10pm PDT

Hijacking Service Discovery to Simulate Dependency Degradation

Thursday March 27, 2025 3:10pm - 3:30pm PDT

Grand Ballroom AB

Abdulrahman Alhamali, Shopify

Services have dependencies, and dependencies degrade: they can slow down, limit the bandwidth, or go entirely offline. Service should have mechanisms to deal with that: circuit breaking, bulkheading, and graceful degradation are some of the mechanisms developers might want to implement. But how can they confirm that these mechanisms work without waiting for an incident to happen? Simulation!

There are a few solutions for simulating dependency degradation, but a majority of them require traffic to be forwarded through a proxy. In this talk, we present a few ways to streamline this traffic forwarding, by hijacking service discovery.

https://www.usenix.org/conference/srecon25americas/presentation/alhamali

Speakers

Abdulrahman Alhamali

Shopify

Abdulrahman (Abed) has been a staff site reliability engineer in Shopify for three years. During this time, he has worked on a variety of resiliency solutions for the core product, and created innovative resiliency testing tools. He has also championed scale testing, resiliency education... Read More →

Thursday March 27, 2025 3:10pm - 3:30pm PDT
Grand Ballroom AB

Track 1