Speaker

Ramón Medrano Llamas

Ramón Medrano Llamas

Senior Staff Site Reliability Engineer at Google

Zürich, Switzerland

Ramón is a Staff Site Reliability Engineer at Google where he works on the Identity team. He started back in 2011 as an intern and has since then become team Technical Lead (TL), Engineering Manager and recently moved into a üTL role for the Privacy, Safety and Security teams. Their role is to store, manage and safeguard user accounts, from account creation down to credential management passing by account security like hijacking and phishing protection. The team employs hundreds of microservices across the stack, that offers a variety of protocols and APIs to customers. They run in thousands of machines in tens of data centres across the globe and must be as reliable as possible as not only other Google products depend on them, but also people and enterprises worldwide that use Google, Workspace and the Google Cloud Platform.

Prior to Google, Ramón worked at CERN, being part of the Physics Department and the ATLAS Collaboration, where he developed the ROOT framework for data analysis and then the functional testing framework to validate and ensure the reliability of the distributed computing facilities that allowed for the Higgs Boson discovery in 2012.

He holds a Computer Engineering MSc and Ph.D. For the last decade has been researching part time on autonomic computing and the management of computer fleets in data centres and enterprises to optimise and reduce the power usage of them.

Area of Expertise

  • Information & Communications Technology

Topics

  • SRE
  • devops
  • distributed systems
  • autonomic computing

Measuring Reliability in Production

Measuring Reliability in Production uses an example microservice application to describe how to define SLIs and SLOs. It includes an overview of application architecture, a how-to for developing SLOs, and suggestions for implementing SLOs in Google Cloud Operations. There's also a focus on how to identify CUJs (Critical User Journeys) and recommendations for implementing metrics to use as SLI and SLO targets.

Autoscaling services on all dimensions

Why doing toil, if the machine can do it for you? This talk covers all of the multitude of autoscaling mechanisms applicable to service meshes made by containers managed by systems like Borg, Kubernetes, Swarm or DC/OS. From vertical, horizontal, auto turnup, load shifting, etc.

When deploying containerised stateless services on a clusters managed by Kubernetes, for example, the most efficient way to run them is with the minimal number of replicas possible to cover the load, maximising the utilisation of resources. How to calculate the number of replicas to maintain a reliable service can be tricky: Pod restarts, traffic imbalances, load shifts, etc.

Further, vertically scaling services is a multi dimension problem and services based on virtual machines like the JVM present specific challenges for autoscaling.

Configuring the autoscaler for the right utilisation levels, using the right metrics and the right decaying factors is key for successfully scaling services.

DevOps & Cloud Days Sessionize Event

June 2022

2019 All Day DevOps Sessionize Event

November 2019

Ramón Medrano Llamas

Senior Staff Site Reliability Engineer at Google

Zürich, Switzerland