Plan and PRACTICE for better incident response with insights from Tim Armandpour, CTO of PagerDuty. Learn the secrets to resilience from the team that mitigated the impact of a major outage—handling a 250% traffic surge while delivering on their SLA.
Listen to find out:
Listen here
TimeStamps:
(00:00:00) Introduction to Alphalist Podcast
(00:01:00) Meet Tim Armanpour
(00:01:56) Tim's Early Career
(00:06:22) Handling Major Incidents at PagerDuty
(00:09:21) The Importance of Preparedness
(00:13:54) Practicing Failure Scenarios
(00:18:16) Resilient Infrastructure and Architectural Patterns
(00:22:44) Standardization and Data Management
(00:25:48) Exploring Infrastructure Resilience
(00:26:20) Achieving High Availability with Lower SLA Cloud Platforms
(00:29:38) Defining Meaningful SLIs
(00:32:15) Assessing Incident Readiness
(00:35:15) The Importance of Ownership
(00:41:30) Continuous Improvement
(00:43:53) Lessons from a Yogurt Business
(00:48:18) Final Thoughts and Takeaways