How to Support Incident Management by Reducing Alarm Fatigue with Amazon DevOps Guru
By Brayan Marin, DevOps Engineer – Triumph and Nikunj Vaidya, Solutions Architect – AWS
As businesses continue to build faster and more scalable systems, the need to increase reliability is constantly on the forefront of your stakeholders’ minds.
As a result, many monitoring tools are loaded onto an environment, increasing alarm noise but never actually identifying the root issue.
With Amazon DevOps Guru, you can take advantage of a fully managed service that cuts down on alarm noise. You can offload administrative tasks with operational issues quickly, and DevOps Guru applies machine learning (ML) to analyze operational data, application metrics, and events to identify anomalous behavior.
This post covers how Triumph Technology Solutions, an AWS Advanced Consulting Partner and dedicated provider of AWS-centered cloud services, got the most out of DevOps Guru by integrating it with PagerDuty, an incident management platform that provides alerts, automatic escalations, and on-call scheduling for your team.
This integration allowed Triumph to reduce alarm noise and better pinpoint the root cause of issues for their customer before they caused outages.
Customer Story: Greenworks Tools
Greenworks Tools, an industry leader and ecommerce retailer for battery-powered outdoor power tools, was facing issues with too many alarms and noise in their system prior to enrolling in Triumph’s managed services program.
On average, they were spending 1-2 hours trying to get to the root cause due to too much noise from priority alarms they were receiving. To solve that problem, Triumph’s managed services team chose Amazon DevOps Guru which leveraged machine learning (ML) models to help surface critical issues before they cause severe issues or outages.
By using DevOps Guru, which would only send the identified critical alerts directly to the Triumph managed services team via PagerDuty, Greenworks Tools saved on average 1.5 observability that were reallocated to higher priority tasks.
“DevOps Guru was very beneficial to our incident response. It provided useful insights we previously didn’t have, and it helped reduce our response time to find a root-cause of an issue by an hour on average,” says Eric D., Sr. IT Manager at Greenworks Tools.
Getting Started with Amazon DevOps Guru
In this section, we’ll walk through how Triumph set up DevOps Guru and integrated it with PagerDuty for the customer.
Triumph began by setting up Amazon DevOps Guru. It’s important to note that DevOps Guru is a regional service, so in the Getting Started sequence you can choose to analyze all resources in your selected region or specify them individually.
In this case, Triumph chose to go with the entire region. Once you select Enable in the Getting Started sequence, you’ll be able to see all of the resources analyzed in a dashboard.
Figure 1 – DevOps Guru dashboard.
If you go to Insights, you can see a previous alert that was closed. Instead of digging through many small alerts regarding target response time, DevOps Guru let Triumph know directly that something within the stack was unhealthy.
Figure 2 – DevOps Guru insight summary.
This reduced the amount of time it took by directing Triumph right to the stack to let the team know what was unhealthy so they could fix the latency issues.
Setting Up Fast and Reliable Incident Response
The native DevOps Guru integration with PagerDuty allowed Triumph to receive direct alerts quickly and reliably. With this integration, Triumph connected an escalation policy with DevOps Guru and had a staff member from the managed services team be the first responder any time an alert is sent out.
To set up fast and reliable incident response, go to Settings in DevOps Guru and notice you can be alerted via Amazon Simple Notification Service (SNS). In our example, this step allowed Triumph to integrate with PagerDuty.
Setting Up an Amazon SNS Topic
In the AWS Management Console, go to Amazon SNS to create the topic to integrate with the PagerDuty URL generated in the previous step. For Triumph, SNS relays notifications directly to PagerDuty and provides various contact options to the team.
All they needed to do is create a Topic, name it appropriately (TT-Managed-Service-Security-PagerDuty-Alerts), create a Subscription, and attach the integration URL. Finally, they added it to the Topic that was just created.
Figure 3 – SNS subscription.
Back in Settings, Triumph added the SNS topic that was just created. Now, when DevOps Guru sends out an alert, Triumph will be able to be the first responder with a clear remediation plan set out by DevOps Guru.
Figure 4 – DevOps Guru settings.
Incident Response Moving Forward
To demonstrate how DevOps Guru helps cut time on incident response, let’s look to another recent alert triggered by the service.
Figure 5 – DevOps Guru insight details.
Immediately upon arrival, Triumph noticed it had aggregated multiple alarms regarding 5xx count errors and that the Elastic Load Balancer is throwing out errors.
This prompted an investigation where Triumph noticed one of the target groups registered an unhealthy host, causing the load balancer to return some errors.
By being led directly to the root cause, Triumph terminated the unhealthy host quickly. This helps reduce the time spent troubleshooting the application to determine where these errors are coming from and frees up more time.
Conclusion
To demonstrate how DevOps Guru helps cut time on incident response, let’s look to another recent alert triggered by the service.
Triumph Technology Solutions helps customers like Greenworks Tools make their transition into the cloud easy, and provides comprehensive monitoring solutions with managed services to ensure they work soundly without having to worry.
With this Amazon DevOps Guru solution in place for Triumph’s customers, they are no longer responding to active incidents rather being proactive to prevent any actual incidents. This leads to a more reliable and fault tolerant environment for Triumph’s customers.
To learn more, please contact Triumph for your DevOps needs.
View more articles View more articles