![Daniel Gur](https://sessionize.com/image/866c-400o400o2-9giACHDZ6HhrDfY3bqg4pM.jpg)
Daniel Gur
SRE team lead at Outbrain
Tel Aviv, Israel
Actions
With over 20 years of experience in the fields of IT, Linux and Devops, Now I'm leading the Observability team at Outbrain, dealing with Outbrain's huge scaling challenges.
Links
Area of Expertise
Topics
Seeing the Unseen: The power of network observability
When we started the "Network observability" project in Outbrain we had one internal customer in mind: the security team, and its desire to identify network anomalies.
Already after the initial design, we quickly realized that using the power of opensource eBPF tools (with some custom code to enrich the collected data) in our 8k servers on-prem datacenters + public cloud environments, we will be able to unfold the ability to easily (and in a cheap way!) to monitor almost anything within our infrastructure. From understanding expensive dataflows (egress from the clouds) to understand who's "hammering" our on-prem CEPH S3 solution.
For sure, we could do it in other ways as well, but we realized the we have "one solution to fit (almost) all!", without the need to write multiple-custom-solutions.
In the lecture i'll present the technology (mostly CNCF based), how it works in an k8s based (but not only) eco system, which data is being collected, and some of our day-to-day usages.
Don't be afraid to drop data!
As engineers, we are usually "trained" no to loss any data because "we might need it in the future", but is that true for all cases?
In Outbrain, as part of the data democratisation methodology, a developer is free to send (almost) any kind of log/message that he wants/needs.
Therefore, we had to impose mechanisms to potentially drop data base on various dimensions of the sent message (and the sending rate/volume).
In this session I will present the idea behind the system and how it works.
Full session recording: https://youtu.be/Pw2LX1uUSlw?si=cnQJEfJ2fEV96GLb
Prometheus That Scales
Running Prometheus is (fairly) easy.
Running Prometheus gets complicated when you need to scrape more then 250 million time-series per minute.
Keeping them for a 1 year term with Thanos is even more complicated. In this lecture I will focus on our journey from a bunch of physical servers that ran Prometheus towards deploying full-scale, kuberneres operated Prometheus, Thanos and additional components with some tips and tickets which allowed us to reduce costs.
Full session recording:
https://youtu.be/snkHA5hCT6c?si=I9Q21wErHG0wPhKM
hayaData 2024 Sessionize Event
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top