Speaker

Gal Shelach

Gal Shelach

Team leader of a production team in the Infrastructure group - Taboola

Tel Aviv, Israel

I am a team leader of a production team in the Infrastructure group. As such, I take part and lead tasks that improve Taboola’s performance and stability.
I work on the core pillars of our infrastructure to support our ever-growing scale.

The fun part of my job is to look at the big picture of our system and track down suspicious bottlenecks. I am doing so by using both in-house and open source monitoring and profiling tools.
The culprits can be Taboola’s services or even commercial and open source 3rd party services. 
After catching them I do whatever is necessary to make the services great again.

When I am not in the office you can find me surfing, sailing or camping on Israel’s wildlands ☺

Area of Expertise

  • Information & Communications Technology

Topics

  • Software Deveopment
  • Infrastructure as Code
  • DevOps Skills

Break the building but keep the tenants happy

We have thousands of frontend servers in seven data centers serving more than 500k HTTP requests per second. Nearly 300k of those requests relate to notifications about the user's actions, clicks, visibility, etc.

In the past, we had a single type of service that handled other types of events, as well as these requests. This design puts us at risk of losing events (money) if there is an issue with any logic occurring on these services.

In this session, I will describe our complicated and lengthy process for separating the event handling flow into a dedicated pipeline. Better SLA, ability to recover lost data when production issues arise, fully isolated event handling pipeline, and better developer experience.

The most important part is we managed to do all this while handling events at all times, without any breaking changes for our developers, and without a single downtime.

SLA is for lawyers, SLO is where the money hides

We have thousands of frontend servers in 7 data centers serving over 500k HTTP requests per second. They all expect to answer as quickly as possible to meet our SLA

Having said that, not breaking the SLA is one thing, but how to define the SLO is another. Let's say our SLA has a response time of p99 < 1000ms. This gives us a wide range where we can determine the SLO.

It may seem logical to set the SLO as low as possible. This way, we are less likely to break our SLA. But what if I tell our customer that I can return him a response on 400ms or I can return him a response on 800ms that will boost his revenue?
Should we then define a different SLO? Maybe we should embrace the risk of breaking SLA from time to time but to have bigger revenue most of the time?

In my lecture I’ll describe three systems we developed to utilize our system dynamically to gain an RPM-oriented SLO. While processing requests, we evaluate the value of each feature and determine if we have the time and resources to utilize it for revenue generation.
Those are Java infrastructures we use internally to provide the most valuable responses to our customers within the limits of our Service Level Agreement.

Hey, developer, DIY all the way to production

The Taboola philosophy is that developers should be independent and take ownership of their features from end to end. Taboola’s development teams don't even have QA engineers, so each developer is solely responsible for delivering a feature.

Taboola has >350 developers creating over 40 new releases a day. The transition from QA to production means exposing a new feature to 1.4B monthly unique users and up to 500K HTTP requests/sec, which is scary.

We, as the team accountable for both production stability and development experience, aim to provide Taboola's RnD developers with tools and methodologies that will help them achieve that. To accomplish this, I will describe the technologies we use, as well as the principles and culture we use.

In this talk, I will explain the steps every Taboolar needs to take from designing a new feature to fully implementing it. We enable developers to develop quickly and independently all the way to production by using tools like special canary tests and smart canary deployment on hundreds of servers worldwide.

With the help of the tools we developed and the methodologies we are using, I myself am becoming a better developer. I am becoming more creative, responsible, taking greater risks, and most importantly, enjoying my life. I will be happy to convince everyone in the audience to work as we do.

Gal Shelach

Team leader of a production team in the Infrastructure group - Taboola

Tel Aviv, Israel