Pay Down Production Defects

Douglas Schlenker

Engineering Manager

Google Sheets

Turning on exception monitoring in a production environment for the first time can be a daunting experience.

It happens to all teams.. at some point in a site or applications lifetime the decision is made to turn on an exception monitoring service to collect exceptions that are happening in the production environment.

When it's first turned on though often the initial reaction is "we had no idea there were this many issues on our site!"

What do you do?

The process to reduce errors in production is simple in practice but it can be hard to achieve.

Create a learning environment

Whether you use a "post-mortem" document or a "correction of errors" document, create a template that allows individuals to:

Describe the error

Determine the root cause

Outline how to fix the error

Outline how to prevent the error from happening again

Learnings from the error

Assess the situation

Ensure you have your exception monitor correctly configured. Is error fingerprinting setup correctly? Are the errors in production only for the production environment or do you have other environments (development, staging, test etc) leaking in? Are you tracking your deployments (used to determine if something is new or a regression)? Do you have source maps configured?

Once you know your monitoring service is correctly configured, start with an achievable goal such as "we will fix all errors that occur more than 1,000 times in an hour".

Alerting

Create alerts in your exception monitoring service that match your initial goal. When the alert is triggered, assign someone to be accountable for the event and have them start on their post-mortem/correction of error document.

Learning

Setup a regular cadence within the team/department/organization to review the documents that have been created. Allow everyone to ask questions and learn from the solutions.

Prevention

Prevention is the most important component to reduce errors in production. Without prevention, regressions will occur and new high volume/severe issues will keep occurring.

The most common forms of prevention are:

Automated Testing

Manual/User Acceptance Testing

Feature Flagging

Monitoring & Alerting

With these 4 tools adopted, as errors are resolved they should not re-occur again in the future.

Improve

Once the frequency of errors has reduced, create more challenging alerts (ie, 250 times in an hour). Keep this cycle of alerting, learning and improving to get down to a manageable level.

Like this project

Posted Nov 17, 2022