Unsurprisingly, SREs consider that the four golden signals of monitoring are latency, traffic, errors and saturation. Monitoring deserves extensive coverage, both from the principles and practices viewpoints. The practice nudged all those services to plan for Chubby’s failure. If Chubby far exceeded its targets, then a controlled failure would bring down the service. Chubby’s SRE team decided to ensure that Chubby met, or just slightly exceeded, its reliability targets. Alas, everything fails eventually, so Chubby’s failures started to translate into user visible failures. Over time, more and more services came to depend on Chubby, assuming it would never fail. A memorable one is about Chubby, Google’s reliable lock service. The book contains lots of real-life stories that bring concepts into life, such as error budgeting. That’s the service’s error budget and the product team can use it on experimentation and innovation. This means that the service can be down 8,76 hour per year. Let’s say that a product teams sets a 99,9% reliability target for a given service. So, at Google, the reliability target is a product, not a technical, decision. On a practical level, it’s easy to understand that 100% availability is impossible when you factor in all the pieces that sit between the end users and your web service, such as their Wi-fi, their internet provider, their laptop. Excepting some niche domains, such as pacemakers, 100% reliability is a wrong target. The cost of increasing levels of reliability grows exponentially. Google facilitates this discussion with the concept of the error budget, changing the whole discussion in the process. The dev’s thirst for new features and the ops’ focus on reliability is a common source of conflict. The way Google handles risk is distinctive. Much more than two and the engineer becomes overwhelmed, unable to handle the event thoroughly and to learn from it. Also, as a rule of thumb, on-call engineers should handle about two events per shift. This approach has the nice side-effect of motivating the product teams to build systems that do not rely on manual operation. If it’s consistently greater than 50%, then the product teams get the excess operational work. Google monitors the amount of operational work each SRE team is doing. SRE must spend 50% of their time on engineering activities. So Google SRE teams benefit from software engineers urge to automate everything and their dislike of manual work (toil).īut how do SREs manage to get the time to automate problems away? Google’s answer is a 50% cap on toil. It’s not feasible to grow the organization linearly with the IT systems’ growth. The rest of the team has similar skills, but with operations specific knowledge, such as networking and Unix systems internals. They have the same skills as the software engineers who work on product development. It all starts with the team’s composition: 50/60% of SREs are software engineers by training. So what is exactly Site Reliability Engineering, in the words of Sloss? According to Sloss, SRE is “what happens when you ask a software engineer to design an operations team”. Software Engineering in SREĭ2iQ: The Leading Independent Kubernetes Platform. Much of the book discusses how SRE teams align and execute their work with these core tenets. SRE has 8 tenets: availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning. It also details the key practices that allow Google to grow at breakneck speed without sacrificing performance or reliability.Īlthough SRE predates DevOps, Benjamin Treynor Sloss, Vice President at Google and Google’s SRE founder, says that SRE can be seen as a “specific implementation of DevOps with some idiosyncratic extensions”. The book describes the principles that underpin the Site Reliability Engineering (SRE) discipline. “Site Reliability Engineering - How Google Runs Production Systems” is an open window into Google’s experience and expertise on running some of the largest IT systems in the world. Handling load requires a multi-pronged approach, with load balancing and gracefully handling overload are at the forefront.Monitoring is a basic building block for operations and quality assurance activities.A cap on manual operational work allows the ops team to scale sub-linearly with the IT systems growth.Defining reliability targets (“error budgets”) allow devs and ops to have enlightening conversations on the new features vs availability debates.Software engineering is fundamental to modern ops.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |