SRE devops distributed systems autonomic computing
Zürich, Zurich, Switzerland
Ramón Medrano Llamas is a Site Reliability Engineer (SRE) at Google in Zürich, leading the Identity team, responsible for the authentication and Identity management services at Google.
He concentrates on the reliability aspects of new Google products and new features of existing products, ensuring that they meet the high bar of a Google service.
Before joining Google in 2013, he worked at CERN developing and designing distributed systems for physics data analysis.
He holds a master's degree in computer science and is currently working on a PhD on autonomic system management part time.
At the time of writing of this proposal, COVID-19 is hitting Europe hard. Stock market is down 35% and I'm confined at home, working with team on contingency plans that assume that up 70% of our workforce, both in California and Switzerland will be infected by the SARS-CoV-2.
We run all the authentication and hijacking services for Google, so us not being able to respond to incidents is critical for the business continuity of one of the biggest corporations in the world and the availability of thousands of customers in G Suite and Google Cloud Platform.
This is, of course, a worse case scenario, but a nice exercise to think how to cover basic operations, oncall and general business continuity on these conditions.
By the time this talk will be presented on November 12, the world will be a different place. We'll know what has happened with the pandemic and how the crisis would have unwinded.
This talk is a live postmortem, it will show a diary of what has happened from March 18, 2020 until November 12, 2020 –8 months of crisis, confinement and hopefully global recovery. How contingency plans were created, where were we hit and what success and lessons we had.
Why doing toil, if the machine can do it for you? This talk covers all of the multitude of autoscaling mechanisms applicable to service meshes made by containers managed by systems like Borg, Kubernetes, Swarm or DC/OS. From vertical, horizontal, auto turnup, load shifting, etc.
When deploying containerised stateless services on a clusters managed by Kubernetes, for example, the most efficient way to run them is with the minimal number of replicas possible to cover the load, maximising the utilisation of resources. How to calculate the number of replicas to maintain a reliable service can be tricky: Pod restarts, traffic imbalances, load shifts, etc.
Further, vertically scaling services is a multi dimension problem and services based on virtual machines like the JVM present specific challenges for autoscaling.
Configuring the autoscaler for the right utilisation levels, using the right metrics and the right decaying factors is key for successfully scaling services.