SLA is for lawyers, SLO is where the money hides

We have thousands of frontend servers in 7 data centers serving over 500k HTTP requests per second. They all expect to answer as quickly as possible to meet our SLA

Having said that, not breaking the SLA is one thing, but how to define the SLO is another. Let's say our SLA has a response time of p99 < 1000ms. This gives us a wide range where we can determine the SLO.

It may seem logical to set the SLO as low as possible. This way, we are less likely to break our SLA. But what if I tell our customer that I can return him a response on 400ms or I can return him a response on 800ms that will boost his revenue?
Should we then define a different SLO? Maybe we should embrace the risk of breaking SLA from time to time but to have bigger revenue most of the time?

In my lecture I’ll describe three systems we developed to utilize our system dynamically to gain an RPM-oriented SLO. While processing requests, we evaluate the value of each feature and determine if we have the time and resources to utilize it for revenue generation.
Those are Java infrastructures we use internally to provide the most valuable responses to our customers within the limits of our Service Level Agreement.

Gal Shelach

Team leader of a production team in the Infrastructure group - Taboola

Tel Aviv, Israel

View Speaker Profile