This talk will guide you on the steps taken to solve a real SRE (Site Reliability Engineering) issue we faced. This problem caused a slowdown for several thousand developers, but no service loss. We will show how good logging/tracing strategies and pre-emptive log post-mortems can save a company hundreds of hours.
This talk is a recount of real events where the names, timestamps, file paths, and IP (Internet Protocol) addresses have been changed for privacy reasons. However, the issue remains the same and visible.
The talk is broken down into the following steps:
- Get a multi-gigabyte log
- Create a parser for it
- Save the parser, and share it with a colleague
- Analyze the data
- Create a custom analysis using Trace Compass’s custom analysis parser
- Share the results
- Modify Trace Compass slightly to highlight the site issue
At the end, the spectator should be able to use this tool to understand problems on a site level, and potentially contribute steps of how to solve the issue.
This talk is not just addressed to system administrators though, developers, dev(sec)ops engineers, managers and anyone interested in knowing how their systems are being used is welcome to attend.