Turning on exception monitoring in a production environment for the first time can be a daunting experience.
It happens to all teams.. at some point in a site or applications lifetime the decision is made to turn on an exception monitoring service to collect exceptions that are happening in the production environment.
When it's first turned on though often the initial reaction is "we had no idea there were this many issues on our site!"
What do you do?
The process to reduce errors in production is simple in practice but it can be hard to achieve.
Create a learning environment
Whether you use a "post-mortem" document or a "correction of errors" document, create a template that allows individuals to:
Describe the error
Determine the root cause
Outline how to fix the error
Outline how to prevent the error from happening again
Learnings from the error
Assess the situation
Ensure you have your exception monitor correctly configured. Is error fingerprinting setup correctly? Are the errors in production only for the production environment or do you have other environments (development, staging, test etc) leaking in? Are you tracking your deployments (used to determine if something is new or a regression)? Do you have source maps configured?
Once you know your monitoring service is correctly configured, start with an achievable goal such as "we will fix all errors that occur more than 1,000 times in an hour".
Alerting
Create alerts in your exception monitoring service that match your initial goal. When the alert is triggered, assign someone to be accountable for the event and have them start on their post-mortem/correction of error document.
Learning
Setup a regular cadence within the team/department/organization to review the documents that have been created. Allow everyone to ask questions and learn from the solutions.
Prevention
Prevention is the most important component to reduce errors in production. Without prevention, regressions will occur and new high volume/severe issues will keep occurring.
The most common forms of prevention are:
Automated Testing
Manual/User Acceptance Testing
Feature Flagging
Monitoring & Alerting
With these 4 tools adopted, as errors are resolved they should not re-occur again in the future.
Improve
Once the frequency of errors has reduced, create more challenging alerts (ie, 250 times in an hour). Keep this cycle of alerting, learning and improving to get down to a manageable level.