Daniel Gur

SRE team lead at Outbrain

Tel Aviv, Israel

Actions

With over 20 years of experience in the fields of IT, Linux and Devops, Now I'm leading the Observability team at Outbrain, dealing with Outbrain's huge scaling challenges.

Area of Expertise

Business & Management
Information & Communications Technology

Topics

Thanos
Prometheus
Kubernetes
Logs
ElasticSearch

Seeing the Unseen: The power of network observability

When we started the "Network observability" project in Outbrain we had one internal customer in mind: the security team, and its desire to identify network anomalies.
Already after the initial design, we quickly realized that using the power of opensource eBPF tools (with some custom code to enrich the collected data) in our 8k servers on-prem datacenters + public cloud environments, we will be able to unfold the ability to easily (and in a cheap way!) to monitor almost anything within our infrastructure. From understanding expensive dataflows (egress from the clouds) to understand who's "hammering" our on-prem CEPH S3 solution.
For sure, we could do it in other ways as well, but we realized the we have "one solution to fit (almost) all!", without the need to write multiple-custom-solutions.

In the lecture i'll present the technology (mostly CNCF based), how it works in an k8s based (but not only) eco system, which data is being collected, and some of our day-to-day usages.

Don't be afraid to drop data!

As engineers, we are usually "trained" no to loss any data because "we might need it in the future", but is that true for all cases?
In Outbrain, as part of the data democratisation methodology, a developer is free to send (almost) any kind of log/message that he wants/needs.
Therefore, we had to impose mechanisms to potentially drop data base on various dimensions of the sent message (and the sending rate/volume).
In this session I will present the idea behind the system and how it works.

Full session recording: https://youtu.be/Pw2LX1uUSlw?si=cnQJEfJ2fEV96GLb

Prometheus That Scales

Running Prometheus is (fairly) easy.
Running Prometheus gets complicated when you need to scrape more then 250 million time-series per minute.
Keeping them for a 1 year term with Thanos is even more complicated. In this lecture I will focus on our journey from a bunch of physical servers that ran Prometheus towards deploying full-scale, kuberneres operated Prometheus, Thanos and additional components with some tips and tickets which allowed us to reduce costs.

Full session recording:
https://youtu.be/snkHA5hCT6c?si=I9Q21wErHG0wPhKM

hayaData 2024 Sessionize Event

September 2024 Tel Aviv, Israel

KCD Israel

April 2023 Tel Aviv, Israel

Daniel Gur

SRE team lead at Outbrain

Tel Aviv, Israel

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Daniel Gur

Actions

Links

Area of Expertise

Topics

Sessions

Seeing the Unseen: The power of network observability

Don't be afraid to drop data!

Prometheus That Scales

Events

hayaData 2024 Sessionize Event

KCD Israel

Daniel Gur

Links

Actions