Session

Monitoring Business Metrics to Reliably Catch Downtime

There’s nothing worse than discovering an outage hours after it starts—especially when your customers notice first. Every minute of downtime can mean lost sales, frustrated users, and damage to your brand. But how do you detect problems that traditional monitoring tools miss?
Most tools focus on technical metrics like CPU, memory, or error rates. But what about outages that don’t cause 500s? From expected failure flows that become problematic at scale to missing instrumentation on key business logic, many real-world issues slip through the cracks.
After one such issue went undetected for four hours—directly impacting revenue—we knew we had to do better. In this talk, you'll learn how we built a real-time monitoring system using business metrics reported via Prometheus to detect what technical metrics couldn't.

Whether you're building a monitoring setup from scratch or leveling up your existing one, you'll leave with ideas to better protect uptime, revenue, and customer trust.

Key Takeaways:
How to instrument key business flows for visibility

Set up a useful dashboard for diagnosing and detecting outages

Alerting strategies that catch real problems (without alert fatigue)

Lessons learned, trade-offs made, and practical tips

First Presented at Software Architecture Global Summit 2025
https://geekle.us/schedule/wsas25

David Schwartz

Software Architect at Next Insurance

Bet Shemesh, Israel

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top