Session
How to set SLOs, drive improvements, and make friends with business stakeholders
Make reliability a shared priority, not just tech-speak. This session shows you how to frame SLOs in business terms, engage stakeholders, and use clear metrics to align technical and business priorities.
The outline of the talk:
- Engineers care about reliability, and we have developed a language to talk about it
- We count the nines (9s) of availability, define error budgets, measure MTTR and MTBF
- Why is it so hard then to convince our business counterparts about doing technical improvements?
- Because our language sounds intimidating and disconnected from reality.
- We fail to explain the actual value of reliability
- The key question of reliability:
- How much does it cost when your service is down for one hour?
- Don't ask how many nines of availability a service should have, ask how much cost is acceptable?
- SLO formula
- X must be true Y percentage of the time
- X is your definition of success
- Y is your threshold
- Two level of how you can measure success:
- Technical level: A service is running, DB is working, API returns a 200 status code.
- Business level: The business process is working. 99.9% of transfers are successful, 99% of reports are generated within 30 seconds, etc.
- Aim to define SLO on the business level.
- From measuring to prioritization
- Benefits of measuring SLO on the business level:
- You know the costs of outages
- You know the cost of bad architecture
- You know the cost of slow processes
- Your data points are facts from the pasts
- Business plans and new features are guess work about the future
- It's easier to talk about priorities when you numbers are solid.
- You're using the same units to compare tech improvements and features.
- Use error budgets to drive improvement
- Review how your systems perform against SLO.
- If your SLO is 99.9%, you allow yourself to fail in 0.01% of cases. This is your error budget.
- What do you do when you exceed the budget?
- Code freeze
- Prioritize immediate improvements to recover reliability.
- Conduct postmortems
- "Do better next time" is not a strategy.
- Make reliability a first-class citizen.
- Report SLOs together with business metrics.
- Remind your stakeholders that availability is your most important feature.

Maxim Schepelin
Engineering leader at Booking.com
Amsterdam, The Netherlands
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top