Welcome to Reliability Report 🎉

This site is a collaborative effort to bring to you relevant curated content and the latest news about Reliability Engineering, that includes SRE, DevOps and Cloud Native topics.

Subscribe to our RSS feed, follow us on Twitter @ReliabilityR or on our Telegram channel t.me/ReliabilityReport.

New integrations are coming, like a Telegram Bot and a weekly newsletter. Stay tuned!

Latest Content

Grafana OnCall becomes Open-Source

Tue, Jun 14, 2022 curated by @tormath1

This quite recent product from Grafana is now available as an open-source solution with a symbolic initial release v1.0.0 - congrats to them!

Source: https://grafana.com/blog/2022/06/14/introducing-grafana-oncall-oss-open-source/

Recovering corrupted RabbitMQ data by reversing its storage protocol

Fri, Jan 28, 2022 curated by @pabluk

A very well explained article by @edealir about RabbitMQ storage protocol internals and the journey to recover corrupted data from it!

Source: https://medium.com/cybelangel-product-engineering/recovering-corrupted-rabbitmq-data-by-reversing-its-storage-protocol-part-1-bed2501d0fa9

Understanding How Facebook Disappeared from the Internet

Tue, Oct 5, 2021 curated by @pabluk

A very concise and insightful explanation about BGP and Internet infrastructure from the @Cloudflare’s perspective during the FB incident

Source: https://blog.cloudflare.com/october-2021-facebook-outage/

Mitchell's New Role at HashiCorp

Fri, Jul 23, 2021 curated by @tormath1

Mitchell Hashimoto is retiring from Hashicorp exec team to become a full-time individual contributor.

Source: https://www.hashicorp.com/blog/mitchell-s-new-role-at-hashicorp

Microsoft acquires Kinvolk

Thu, Apr 29, 2021 curated by @tormath1

It’s also a personal news as a (former-) Kinvolk software engineer. Super happy and we look forward to see the great things incoming :D

Source: https://azure.microsoft.com/en-us/blog/microsoft-acquires-kinvolk-to-accelerate-containeroptimized-innovation/

Upcoming SRE and Chaos Engineering sessions at contributing.today

Thu, Apr 22, 2021 curated by @pabluk

Don’t forget to join the virtual meetups of contributing.today for 2 interesting shows! today, 21 April 2021, about Site Reliability Engineering with a great panel of SREs and another one the next week about Chaos Engineering with @QuintessenceAnx from PagerDuty!

Source: https://www.contributing.today/

PromCon Online 2021

Mon, Mar 29, 2021 curated by @tormath1

@PromConIO schedule is available! The 3rd of May and online. Which talks do you want to attend? :)

Source: https://promcon.io/2021-online/schedule/

Everything is broken, and it’s okay

Tue, Mar 9, 2021 curated by @pabluk

Insightful article by @wiredferret for the latest issue of @incrementmag on how to change our mindset to accept failure in order to build resilient systems following risk reduction and harm mitigation patterns.

Source: https://increment.com/reliability/failure-is-okay/

v5.12-rc1-dontuse

Mon, Mar 8, 2021 curated by @tormath1

Funny story about this release candidate of Linux 5.12.

TL;DR:

[…] swap files stopped working right.

Source: https://arstechnica.com/gadgets/2021/03/psa-linux-folks-stay-away-from-the-5-12-rc1-kernel/

Google Kubernetes Engine on autopilot mode

Wed, Feb 24, 2021 curated by @tormath1

Using GKE autopilot mode, you will have less to manage and more to play!

Source: https://techcrunch.com/2021/02/24/google-cloud-puts-its-kubernetes-engine-on-autopilot

GCP and AWS: a deep comparison

Fri, Jan 29, 2021 curated by @tormath1

In this long and complete paper, you’ll get some elements to help you choosing a cloud platform in your infrastructure design process.

Source: https://kinsta.com/blog/google-cloud-vs-aws/

Use containers to package your software

Tue, Jan 19, 2021 curated by @tormath1

With the recent announcement of Sabayon Linux becoming Mocaccino OS, we know that Luet will be used as package manager. This package manage sounds promising, with the ability to define your build / runtime dependencies on top of a container layer.

Source: https://luet-lab.github.io/docs/about/

SREcon20 Americas talks

Wed, Dec 23, 2020 curated by @pabluk

Yay! SREcon20 Americas talks are ready and available on Youtube 🎉 For more details on each talk see the program here https://www.usenix.org/conference/srecon20americas/program enjoy 🍿 thanks @SREcon and @usenix

Source: https://www.youtube.com/watch?v=2C2F5USR6N4&list=PLbRoZ5Rrl5lfLXUjFjS0mP1XzNzNZMhYN

Interrupt Reduction Projects

Tue, Dec 22, 2020 curated by @pabluk

“assigning a primary on-call to handle pager duty, while round-robin assigning tickets across the team. This setup frequently led to undesirable outcomes, as engineers couldn’t successfully under-take project work and ticket duty simultaneously” If that looks like your team and you’re looking for ideas to manage toil this article from @usenix ;login: magazine and shared on the @googlesre resources page https://sre.google/resources/ could help you to identify interruptions and find out an adapted strategy for your team.

Source: https://www.usenix.org/system/files/login/articles/login_winter16_11_beyer.pdf

Identifying Your SRE Training Needs

Sat, Dec 5, 2020 curated by @pabluk

The best thing to create and facilitate the adoption of an SRE culture in your organization is to have an optimum training plan adapted to its size, maturity and people experience. Take a look inside chapter 1 of this @googlesre book as a good starting point to find a matrix describing different use cases for organizations of any size, and in chapter 3 you’ll find case studies for small and large organizations that can inspire new ideas for your team!

Source: https://sre.google/resources/practices-and-processes/training-site-reliability-engineers/

A guide to the reliability talks at AWS re:Invent

Wed, Dec 2, 2020 curated by @GeraldC13

Top picks of reliability-focused talks on AWS re:Invent (virtual) from @Ana_M_Medina a Sr Chaos Eng. at @GremlinInc

Source: https://www.gremlin.com/blog/a-guide-to-the-reliability-talks-at-aws-re-invent/

Why you should take care of infrastructure drift

Fri, Nov 27, 2020 curated by @GeraldC13

This article is the first outcome of a call for participation to a study on infrastructure drift we launched at the last Paris SRE Meetup. As part of our work on ‘drittctl’ we are writing a report on how infrastructure drift can be a challenge for DevOps teams, and how they address it. The goal is to share with the community core problems and best practices. Here is a foretaste of this study, outlining some of the key facts we recorded.

When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of teams around the globe. Facing impacts and consequences ranging from intensive toil to dangerous security threats, many team are keenly aware of the issue and actively looking for solutions.

We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks.

Source: https://driftctl.com/2020/11/24/infrastructure-drift

Introducing Non-Abstract Large System Design

Sun, Nov 22, 2020 curated by @pabluk

Non-Abstract Large System Design (NALSD) a very useful and critical skill for SREs: “By breaking down software into logical components and placing these components into a production ecosystem with reliable infrastructure, we arrive at systems that provide reasonable and appropriate targets for data consistency, system availability, and resource efficiency.”

Source: https://landing.google.com/sre/workbook/chapters/non-abstract-design/

All Day DevOps 2020

Wed, Nov 11, 2020 curated by @pabluk

Hey, in case you missed it, tomorrow (Nov 12) starts the 2020 Fall edition of the @AllDayDevOps conference, with talks during 24 hours by 180 speakers around the world, the event is held entirely online and the registration is free. Take a look at the schedule… there’s even a dedicated SRE track!

Source: https://www.alldaydevops.com/2020-fallschedule

SLO — From Nothing to… Production

Wed, Nov 4, 2020 curated by @pabluk

If you don’t know how to start introducing SLOs at work, this a great example from Ioannis (@geototti21) and his journey to bring SLOs into his organization with a clear path and framework. As he said “Explain how SLOs can be an important internal tracking target that is tougher than your SLA. Also, mention how we can use the SLOs to offer better SLAs than “service uptime” if we are asked”

Source: https://geototti21.medium.com/slo-from-nothing-to-production-91b8d4270bd5

Why is 100% reliability the wrong target?

Tue, Oct 27, 2020 curated by @pabluk

If you want to know why 100% reliability is the wrong target go to Chapter 2 of the SRE Workbook, you’ll find “… as you go from 99% to 99.9% to 99.99% reliability, each extra nine comes at an increased cost, but the marginal utility to your customers steadily approaches zero” between other good reasons

Source: https://landing.google.com/sre/workbook/chapters/implementing-slos/#reliability-targets-and-error-budgets

The Production Readiness Spectrum

Sat, Oct 24, 2020 curated by @pabluk

If you’re looking to adopt Production Readiness reviews, this is an insightful post from Pavlos blog (@dastergon), on it, he summarizes several aspects of Production Readiness reviews, like the spectrum from 1-sided to full automated reviews or examples of Production Readiness checklists from several companies, as well as what can the reviews bring to an organization: “Production Readiness Reviews are a powerful tool that enables us to provide a common language for our production standards across the organization. It increases our confidence throughout the whole lifecycle of a service and builds trust between product development and SRE teams”

Source: https://dastergon.gr/posts/2020/09/the-production-readiness-spectrum/

how they test ?

Fri, Oct 9, 2020 curated by @tormath1

Testing a software is like good practices: everyone has its own interpretation. It’s something really interesting to see how other companies test their software and which tools they use, in order to define your own test strategy fitting your own use-cases.

Source: https://abhivaikar.github.io/howtheytest

Observability 101: Terminology and Concepts

Fri, Oct 9, 2020 curated by @pabluk

This succinct article from the Honeycomb blog is a great starting point to understand the fundamentals of Observability, like metrics, logs, traces and structured events, as well as the concepts of context, dimensionality and cardinality. It also includes additional links for further reading about #o11y and distributed tracing.

Source: https://www.honeycomb.io/blog/observability-101-terminology-and-concepts/

Chaos Engineering: The Path to Reliability

Tue, Oct 6, 2020 curated by @pabluk

This was an very interesting talk, providing a great introduction to Chaos Engineering concepts, good advice with Do’s and Don’ts and a clear path to reliability (Scan, Baseline, Analyze & Plan, Harden and Report) by Kolton Andrus from Gremlin, Inc. Don’t hesitate to register to the #ChaosConf there are still 2 days full of interesting topics to come!

Source: https://www.chaosconf.io/

Under Deconstruction: The State of Shopify’s Monolith

Fri, Oct 2, 2020 curated by @tormath1

Another feedback on RoR application migration @ Shopify with a long paper explaining the different steps / challenges (showing a few hints / IRL examples).

It’s an interesting reading, since the question of Monolith vs modular applications is not easy to answer and there is no generic method to solve it.

So, we developed a new tool called Packwerk to analyze static constant references. […] We’re planning to make Packwerk open source soon. Stay tuned!

I’m definitely looking for to see how they proceed and why not try to make it language agnostic.

Source: https://engineering.shopify.com/blogs/engineering/shopify-monolith

Debugging Go in prod using eBPF

Fri, Sep 18, 2020 curated by @pabluk

This is the first part of a series of articles on how to debug Go programs in binary state using eBPF and it’s also a very good introduction to that kernel feature. By the way, I think eBPF will bring a lot of new observability tools, for example the new Network Performance Monitoring of Datadog.

Source: https://blog.pixielabs.ai/blog/ebpf-function-tracing/post/

Forbes Cloud 100 list

Thu, Sep 17, 2020 curated by @tormath1

Through this list is fun to see a few elements:

only a few companies are not based in the USA (5)
the number of employees does not seem to impact the value ($) of the company
2 DevOps company in the top 20

Source: https://www.forbes.com/cloud100/#25aca7525f94

Eliminating Toil

Wed, Sep 16, 2020 curated by @pabluk

This chapter from the first Site Reliability Engineering book describes clearly the definition of toil, how to measure it, and it provides some ideas to handle it.

Source: https://landing.google.com/sre/sre-book/chapters/eliminating-toil/

Why Is Site Reliability Engineering Important

Mon, Jul 13, 2020 curated by @pabluk

This is a quote from the SRE book:

Site reliability engineering (SRE) is one of the fastest-growing enterprise roles and set of operational practices for managing services at scale.

You’ll find a clear introduction of these principles here.

Source: https://devops.com/why-is-site-reliability-engineering-important/