Solving an Internal Real-World SRE Issue with Eclipse Trace Compass

This talk will guide you on the steps taken to solve a real SRE (Site Reliability Engineering) issue we faced. This problem caused a slowdown for several thousand developers, but no service loss. We will show how good logging/tracing strategies and pre-emptive log post-mortems can save a company hundreds of hours.

This talk is a recount of real events where the names, timestamps, file paths, and IP (Internet Protocol) addresses have been changed for privacy reasons. However, the issue remains the same and visible.

The talk is broken down into the following steps:

Get a multi-gigabyte log
Create a parser for it
Save the parser, and share it with a colleague
Analyze the data
Create a custom analysis using Trace Compass’s custom analysis parser
Share the results
Modify Trace Compass slightly to highlight the site issue

At the end, the spectator should be able to use this tool to understand problems on a site level, and potentially contribute steps of how to solve the issue.

This talk is not just addressed to system administrators though, developers, dev(sec)ops engineers, managers and anyone interested in knowing how their systems are being used is welcome to attend.

Objective of the presentation:

This presentation is a tutorial on how to use trace compass, but more importantly, it is a tutorial on how to solve issues in a methodological/systemic way. It will not cover log generation, but once you have a log, how some post-mortem information can be extracted.

Attendee pre-requisites - If none, enter "N/A":

N/A

Schedule info

Time:

27 Oct 2022 - 09:00 to 27 Oct 2022 - 09:35

Room:

Silchersaal