Welcome to Reliability Report 🎉

This site is a collaborative effort to bring to you relevant curated content and the latest news about Reliability Engineering, that includes SRE, DevOps and Cloud Native topics.

Subscribe to our RSS feed, follow us on Twitter @ReliabilityR or on our Telegram channel t.me/ReliabilityReport.

New integrations are coming, like a Telegram Bot and a weekly newsletter. Stay tuned!

Latest Content

Upcoming SRE and Chaos Engineering sessions at contributing.today

curated by @pabluk

Don’t forget to join the virtual meetups of contributing.today for 2 interesting shows! today, 21 April 2021, about Site Reliability Engineering with a great panel of SREs and another one the next week about Chaos Engineering with @QuintessenceAnx from PagerDuty!

Source: https://www.contributing.today/

Everything is broken, and it’s okay

curated by @pabluk

Insightful article by @wiredferret for the latest issue of @incrementmag on how to change our mindset to accept failure in order to build resilient systems following risk reduction and harm mitigation patterns.

Source: https://increment.com/reliability/failure-is-okay/

Use containers to package your software

curated by @tormath1

With the recent announcement of Sabayon Linux becoming Mocaccino OS, we know that Luet will be used as package manager. This package manage sounds promising, with the ability to define your build / runtime dependencies on top of a container layer.

Source: https://luet-lab.github.io/docs/about/

Interrupt Reduction Projects

curated by @pabluk

“assigning a primary on-call to handle pager duty, while round-robin assigning tickets across the team. This setup frequently led to undesirable outcomes, as engineers couldn’t successfully under-take project work and ticket duty simultaneously” If that looks like your team and you’re looking for ideas to manage toil this article from @usenix ;login: magazine and shared on the @googlesre resources page https://sre.google/resources/ could help you to identify interruptions and find out an adapted strategy for your team.

Source: https://www.usenix.org/system/files/login/articles/login_winter16_11_beyer.pdf

Identifying Your SRE Training Needs

curated by @pabluk

The best thing to create and facilitate the adoption of an SRE culture in your organization is to have an optimum training plan adapted to its size, maturity and people experience. Take a look inside chapter 1 of this @googlesre book as a good starting point to find a matrix describing different use cases for organizations of any size, and in chapter 3 you’ll find case studies for small and large organizations that can inspire new ideas for your team!

Source: https://sre.google/resources/practices-and-processes/training-site-reliability-engineers/

Why you should take care of infrastructure drift

curated by @GeraldC13

This article is the first outcome of a call for participation to a study on infrastructure drift we launched at the last Paris SRE Meetup. As part of our work on ‘drittctl’ we are writing a report on how infrastructure drift can be a challenge for DevOps teams, and how they address it. The goal is to share with the community core problems and best practices. Here is a foretaste of this study, outlining some of the key facts we recorded.

When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of teams around the globe. Facing impacts and consequences ranging from intensive toil to dangerous security threats, many team are keenly aware of the issue and actively looking for solutions.

We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks.

Source: https://driftctl.com/2020/11/24/infrastructure-drift

Introducing Non-Abstract Large System Design

curated by @pabluk

Non-Abstract Large System Design (NALSD) a very useful and critical skill for SREs: “By breaking down software into logical components and placing these components into a production ecosystem with reliable infrastructure, we arrive at systems that provide reasonable and appropriate targets for data consistency, system availability, and resource efficiency.”

Source: https://landing.google.com/sre/workbook/chapters/non-abstract-design/

All Day DevOps 2020

curated by @pabluk

Hey, in case you missed it, tomorrow (Nov 12) starts the 2020 Fall edition of the @AllDayDevOps conference, with talks during 24 hours by 180 speakers around the world, the event is held entirely online and the registration is free. Take a look at the schedule… there’s even a dedicated SRE track!

Source: https://www.alldaydevops.com/2020-fallschedule

SLO — From Nothing to… Production

curated by @pabluk

If you don’t know how to start introducing SLOs at work, this a great example from Ioannis (@geototti21) and his journey to bring SLOs into his organization with a clear path and framework. As he said “Explain how SLOs can be an important internal tracking target that is tougher than your SLA. Also, mention how we can use the SLOs to offer better SLAs than “service uptime” if we are asked”

Source: https://geototti21.medium.com/slo-from-nothing-to-production-91b8d4270bd5

Why is 100% reliability the wrong target?

curated by @pabluk

If you want to know why 100% reliability is the wrong target go to Chapter 2 of the SRE Workbook, you’ll find “… as you go from 99% to 99.9% to 99.99% reliability, each extra nine comes at an increased cost, but the marginal utility to your customers steadily approaches zero” between other good reasons

Source: https://landing.google.com/sre/workbook/chapters/implementing-slos/#reliability-targets-and-error-budgets

The Production Readiness Spectrum

curated by @pabluk

If you’re looking to adopt Production Readiness reviews, this is an insightful post from Pavlos blog (@dastergon), on it, he summarizes several aspects of Production Readiness reviews, like the spectrum from 1-sided to full automated reviews or examples of Production Readiness checklists from several companies, as well as what can the reviews bring to an organization: “Production Readiness Reviews are a powerful tool that enables us to provide a common language for our production standards across the organization. It increases our confidence throughout the whole lifecycle of a service and builds trust between product development and SRE teams”

Source: https://dastergon.gr/posts/2020/09/the-production-readiness-spectrum/

how they test ?

curated by @tormath1

Testing a software is like good practices: everyone has its own interpretation. It’s something really interesting to see how other companies test their software and which tools they use, in order to define your own test strategy fitting your own use-cases.

Source: https://abhivaikar.github.io/howtheytest

Observability 101: Terminology and Concepts

curated by @pabluk

This succinct article from the Honeycomb blog is a great starting point to understand the fundamentals of Observability, like metrics, logs, traces and structured events, as well as the concepts of context, dimensionality and cardinality. It also includes additional links for further reading about #o11y and distributed tracing.

Source: https://www.honeycomb.io/blog/observability-101-terminology-and-concepts/

Chaos Engineering: The Path to Reliability

curated by @pabluk

This was an very interesting talk, providing a great introduction to Chaos Engineering concepts, good advice with Do’s and Don’ts and a clear path to reliability (Scan, Baseline, Analyze & Plan, Harden and Report) by Kolton Andrus from Gremlin, Inc. Don’t hesitate to register to the #ChaosConf there are still 2 days full of interesting topics to come!

Source: https://www.chaosconf.io/

Under Deconstruction: The State of Shopify’s Monolith

curated by @tormath1

Another feedback on RoR application migration @ Shopify with a long paper explaining the different steps / challenges (showing a few hints / IRL examples).

It’s an interesting reading, since the question of Monolith vs modular applications is not easy to answer and there is no generic method to solve it.

So, we developed a new tool called Packwerk to analyze static constant references. […] We’re planning to make Packwerk open source soon. Stay tuned!

I’m definitely looking for to see how they proceed and why not try to make it language agnostic.

Source: https://engineering.shopify.com/blogs/engineering/shopify-monolith

Forbes Cloud 100 list

curated by @tormath1

Through this list is fun to see a few elements:

  • only a few companies are not based in the USA (5)
  • the number of employees does not seem to impact the value ($) of the company
  • 2 DevOps company in the top 20
Source: https://www.forbes.com/cloud100/#25aca7525f94