Cloud infrastructure is a necessity in our modern digital world. However, understanding and preparing for failures in cloud infrastructure is critical for reliability of our services. Failures can be viewed as learning opportunities and to improve our system design. It can inform proactive problem-solving, fostering effective incident response, and guiding future design challenges. Chaos Engineering plays a vital role in testing for resilience of our system.
As the software development landscape continues to evolve, the roles of Site Reliability Engineering (SRE), DevOps, and Platform Engineering often leave people puzzled about their distinctions and interrelations. In this engaging 30-minute talk, we'll clarify these concepts by delving into the world of SRE, examining its unique position at the intersection of DevOps and Platform Engineering.
Have you ever considered that your incident from last night might actually be something very positive? No? Then you should watch this talk! I'm going to introduce you to some concepts in the domain of resiliency engineering and then have a look into how you can build an alerting strategy that doesn't page you unnecessarily at 3am.
Introduction into the "Above-the-line / Below-the-line" framework and why you look at your systems design mostly wrong.