reliability

Breaking Clouds

Cloud infrastructure is a necessity in our modern digital world. However, understanding and preparing for failures in cloud infrastructure is critical for reliability of our services. Failures can be viewed as learning opportunities and to improve our system design. It can inform proactive problem-solving, fostering effective incident response, and guiding future design challenges. Chaos Engineering plays a vital role in testing for resilience of our system.

Site Reliability Engineering Explained: An Exploration of DevOps, Platform Engineering, and SRE

As the software development landscape continues to evolve, the roles of Site Reliability Engineering (SRE), DevOps, and Platform Engineering often leave people puzzled about their distinctions and interrelations. In this engaging 30-minute talk, we'll clarify these concepts by delving into the world of SRE, examining its unique position at the intersection of DevOps and Platform Engineering.

Understanding Alerting - How to come up with a good enough alerting strategy

Have you ever considered that your incident from last night might actually be something very positive? No? Then you should watch this talk! I'm going to introduce you to some concepts in the domain of resiliency engineering and then have a look into how you can build an alerting strategy that doesn't page you unnecessarily at 3am.

Above-the-line / Below-the-line framework

Introduction into the "Above-the-line / Below-the-line" framework and why you look at your systems design mostly wrong.