Speaker

Gal Shelach

Gal Shelach

Team leader of a production team in the Infrastructure group - Taboola

Tel Aviv, Israel

Actions

I am a team leader of a production team in the Infrastructure group.

I boast an MSc in Electrical Engineering, though I haven't quite dabbled in eclectic engineering.
Toting an MBA as well, but you won't find me managing businesses.
I'm a certified financial planner too - and yes, I do dip my toes into that realm occasionally.

Beyond all that, inefficiency is my arch-nemesis, and I've been on a decade-long crusade to make systems run smoother.

And, just for the record, I have a soft spot for dad jokes.

Area of Expertise

  • Information & Communications Technology

Topics

  • Software Deveopment
  • Infrastructure as Code
  • DevOps Skills

Break the building but keep the tenants happy

We have thousands of frontend servers in seven data centers serving more than 500k HTTP requests per second. Nearly 300k of those requests relate to notifications about the user's actions, clicks, visibility, etc.

In the past, we had a single type of service that handled other types of events, as well as these requests. This design puts us at risk of losing events (money) if there is an issue with any logic occurring on these services.

In this session, I will describe our complicated and lengthy process for separating the event handling flow into a dedicated pipeline. Better SLA, ability to recover lost data when production issues arise, fully isolated event handling pipeline, and better developer experience.

The most important part is we managed to do all this while handling events at all times, without any breaking changes for our developers, and without a single downtime.

SLA is for lawyers, SLO is where the money hides

We have thousands of frontend servers in 7 data centers serving over 500k HTTP requests per second. They all expect to answer as quickly as possible to meet our SLA

Having said that, not breaking the SLA is one thing, but how to define the SLO is another. Let's say our SLA has a response time of p99 < 1000ms. This gives us a wide range where we can determine the SLO.

It may seem logical to set the SLO as low as possible. This way, we are less likely to break our SLA. But what if I tell our customer that I can return him a response on 400ms or I can return him a response on 800ms that will boost his revenue?
Should we then define a different SLO? Maybe we should embrace the risk of breaking SLA from time to time but to have bigger revenue most of the time?

In my lecture I’ll describe three systems we developed to utilize our system dynamically to gain an RPM-oriented SLO. While processing requests, we evaluate the value of each feature and determine if we have the time and resources to utilize it for revenue generation.
Those are Java infrastructures we use internally to provide the most valuable responses to our customers within the limits of our Service Level Agreement.

Hey, developer, DIY all the way to production

The Taboola philosophy is that developers should be independent and take ownership of their features from end to end. Taboola’s development teams don't even have QA engineers, so each developer is solely responsible for delivering a feature.

Taboola has >350 developers creating over 40 new releases a day. The transition from QA to production means exposing a new feature to 1.4B monthly unique users and up to 500K HTTP requests/sec, which is scary.

We, as the team accountable for both production stability and development experience, aim to provide Taboola's RnD developers with tools and methodologies that will help them achieve that. To accomplish this, I will describe the technologies we use, as well as the principles and culture we use.

In this talk, I will explain the steps every Taboolar needs to take from designing a new feature to fully implementing it. We enable developers to develop quickly and independently all the way to production by using tools like special canary tests and smart canary deployment on hundreds of servers worldwide.

With the help of the tools we developed and the methodologies we are using, I myself am becoming a better developer. I am becoming more creative, responsible, taking greater risks, and most importantly, enjoying my life. I will be happy to convince everyone in the audience to work as we do.

Gal Shelach

Team leader of a production team in the Infrastructure group - Taboola

Tel Aviv, Israel

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top