Don’t forget to join the virtual meetups of contributing.today for 2 interesting shows! today, 21 April 2021, about Site Reliability Engineering with a great panel of SREs and another one the next week about Chaos Engineering with @QuintessenceAnx from PagerDuty!
Insightful article by @wiredferret for the latest issue of @incrementmag on how to change our mindset to accept failure in order to build resilient systems following risk reduction and harm mitigation patterns.
With the recent announcement of Sabayon Linux becoming Mocaccino OS, we know that Luet will be used as package manager. This package manage sounds promising, with the ability to define your build / runtime dependencies on top of a container layer.
“assigning a primary on-call to handle pager duty, while round-robin assigning tickets across the team. This setup frequently led to undesirable outcomes, as engineers couldn’t successfully under-take project work and ticket duty simultaneously” If that looks like your team and you’re looking for ideas to manage toil this article from @usenix ;login: magazine and shared on the @googlesre resources page https://sre.google/resources/ could help you to identify interruptions and find out an adapted strategy for your team.
The best thing to create and facilitate the adoption of an SRE culture in your organization is to have an optimum training plan adapted to its size, maturity and people experience. Take a look inside chapter 1 of this @googlesre book as a good starting point to find a matrix describing different use cases for organizations of any size, and in chapter 3 you’ll find case studies for small and large organizations that can inspire new ideas for your team!
This article is the first outcome of a call for participation to a study on infrastructure drift we launched at the last Paris SRE Meetup. As part of our work on ‘drittctl’ we are writing a report on how infrastructure drift can be a challenge for DevOps teams, and how they address it. The goal is to share with the community core problems and best practices.
Here is a foretaste of this study, outlining some of the key facts we recorded.
When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of teams around the globe. Facing impacts and consequences ranging from intensive toil to dangerous security threats, many team are keenly aware of the issue and actively looking for solutions.
We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks.
Non-Abstract Large System Design (NALSD) a very useful and critical skill for SREs: “By breaking down software into logical components and placing these components into a production ecosystem with reliable infrastructure, we arrive at systems that provide reasonable and appropriate targets for data consistency, system availability, and resource efficiency.”
Hey, in case you missed it, tomorrow (Nov 12) starts the 2020 Fall edition of the @AllDayDevOps conference, with talks during 24 hours by 180 speakers around the world, the event is held entirely online and the registration is free. Take a look at the schedule… there’s even a dedicated SRE track!
If you don’t know how to start introducing SLOs at work, this a great example from Ioannis (@geototti21) and his journey to bring SLOs into his organization with a clear path and framework. As he said “Explain how SLOs can be an important internal tracking target that is tougher than your SLA. Also, mention how we can use the SLOs to offer better SLAs than “service uptime” if we are asked”
If you want to know why 100% reliability is the wrong target go to Chapter 2 of the SRE Workbook, you’ll find “… as you go from 99% to 99.9% to 99.99% reliability, each extra nine comes at an increased cost, but the marginal utility to your customers steadily approaches zero” between other good reasons
If you’re looking to adopt Production Readiness reviews, this is an insightful post from Pavlos blog (@dastergon), on it, he summarizes several aspects of Production Readiness reviews, like the spectrum from 1-sided to full automated reviews or examples of Production Readiness checklists from several companies, as well as what can the reviews bring to an organization: “Production Readiness Reviews are a powerful tool that enables us to provide a common language for our production standards across the organization. It increases our confidence throughout the whole lifecycle of a service and builds trust between product development and SRE teams”
Testing a software is like good practices: everyone has its own interpretation. It’s something really interesting to see how other companies test their software and which tools they use, in order to define your own test strategy fitting your own use-cases.
This succinct article from the Honeycomb blog is a great starting point to understand the fundamentals of Observability, like metrics, logs, traces and structured events, as well as the concepts of context, dimensionality and cardinality. It also includes additional links for further reading about #o11y and distributed tracing.
This was an very interesting talk, providing a great introduction to Chaos Engineering concepts, good advice with Do’s and Don’ts and a clear path to reliability (Scan, Baseline, Analyze & Plan, Harden and Report) by Kolton Andrus from Gremlin, Inc. Don’t hesitate to register to the #ChaosConf there are still 2 days full of interesting topics to come!
This is the first part of a series of articles on how to debug Go programs in binary state using eBPF and it’s also a very good introduction to that kernel feature. By the way, I think eBPF will bring a lot of new observability tools, for example the new Network Performance Monitoring of Datadog.