Gal Shelach

Team leader of a production team in the Infrastructure group - Taboola

Tel Aviv, Israel

Actions

I am a team leader of a production team in the Infrastructure group.

I boast an MSc in Electrical Engineering, though I haven't quite dabbled in eclectric engineering.
Toting an MBA as well, but you won't find me managing businesses.
I'm a certified financial planner too - and yes, I do dip my toes into that realm occasionally.

Beyond all that, inefficiency is my arch-nemesis, and I've been on a decade-long crusade to make systems run smoother.

And, just for the record, I have a soft spot for dad jokes.

Area of Expertise

Information & Communications Technology

Topics

Software Deveopment
Infrastructure as Code
DevOps Skills

Sneaky Peak to The Secrets of Kafka Assignment Strategy

Something strange happened while I worked with Kafka.

While adding a new consumer from Kafka to one of our services, the service stopped consuming from ALL other existing consumers. As part of my job at Taboola as a team leader on a production team in the Infrastructure group, we’re supposed to remove bottlenecks, not create them.

This talk will describe how I investigated the issue, explain what I discovered, and share my insights into the whole situation.

Taboola’s recommendations appear on tens of thousands of web pages and mobile apps every second. As users engage with the content, multiple events are fired to signal that recommendations are rendered, opened, clicked, and so on. Each event triggers one or more Kafka messages, which translates into a lot of Kafka messages for every recommendation.

We add new types of events to our infrastructure all the time. We usually just add the new topic and relevant consumers, test everything locally with different CI procedures, and then on production servers running existing consumers.

This time, something strange happened when we added a consumer for a new event type.

When we added the new consumers on one server in the server pool, other consumers on that server stopped consuming from all other topics. The new consumer had a different group ID and consumed from a new topic, so this shouldn’t have happened. We were surprised that it affected other groups and other topics.

I will explain what we did and how we solved the issue.

At the end of the day we learned that:

1. There’s a relationship between consumers on the same service and the assignment of topics on a partition, regardless of the group or topic ID.
2. Each server increments consumer IDs unless the order is explicitly overridden.
3. All consumers on a service affect the lexicographical order of consumers on the same service .

No query too heavy - We've developed a method for retrieving data that's ready for you to try

Data has become the new gold, powering many industries and even making them dependent on it. Tech companies develop products that collect and process information, using it to make smart decisions. On the other end, users expect real-time insights into how they interact with these products. They want all the relevant data in one place, instantly—without the clutter.

This creates several technical challenges for development teams, but I want to focus on one key issue: queries. When dealing with small amounts of data and simple filters, things are straightforward. But when you're working with hundreds of terabytes of data and need complex slice & dice capabilities with just a few minutes of delay, things get much trickier.

For example, advertisers using Taboola’s platform expect a single dashboard showing all their campaign data—both operational details (like name, creation date, and status) and performance metrics (such as clicks, conversions, and costs). Over time, as data volumes grew and queries became more complex, handling them efficiently became a major challenge—one that every data-driven company faces.

In this session, I'll share the solution we developed to handle large-scale queries quickly and efficiently.

How we drastically improved our throughput by rewriting our load balancer

Taboola’s recommendation engine gets over 800,000 requests per second all handled within a strict sub-second SLA across thousands of servers and many data centers spread over the world. As such, the effectiveness of our load balancing strategy had a big impact on both our latency and hardware utilization. Over the years we had many iterations to make our load balancing as effective as possible, using different products (haproxy, linkerd, open-resty, nginx+) and different load-balancing strategies (weighted round-robin, least_connection, least_time) and yet we felt like there's more to be done. In this talk we'll present our approach and how we completely rewritten our load balancing solution to drastically reduce our p99 and improve our utilization by making it more aware of the types of hardware, the quality of the response each server returned and taking into account cache-locality as part of the load-balancing algorithm.

Gal Shelach

Team leader of a production team in the Infrastructure group - Taboola

Tel Aviv, Israel

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Gal Shelach

Actions

Links

Area of Expertise

Topics

Sessions

Sneaky Peak to The Secrets of Kafka Assignment Strategy

No query too heavy - We've developed a method for retrieving data that's ready for you to try

How we drastically improved our throughput by rewriting our load balancer

Gal Shelach

Links

Actions