How we replaced manual review of PagerDuty alerts with an automatic weekly summary dashboard.
I’m a software engineer at Plaid on the Data Science & Infrastructure team. We support a myriad of services for both external clients and internal employees: we enable the product, growth, and engineering teams with tools like Redshift for large analytics queries and the ELK stack for real-time investigation across the many services powering Plaid.
Supporting these analytics & monitoring services is mission-critical: if they go down dozens of workflows across the company break; it’s important that we catch any degradation of service before it becomes it an issue. Over the course of 2018 we’ve added dozens of new features to the services we support: new datasets, faster analytics capabilities, and more self-service data portals. We need to find any issues before they are realized by our users:
We use PagerDuty to send real-time alerts when there is a degradation in service performance. Our team has an on-call rotation: between four engineers we rotate being the directly responsible individual for handling any alerts of service performance generated through PagerDuty. The week you’re on call, you use the PagerDuty app to alert of any realtime
DSI-Urgent alerts, and the PagerDuty slack integration to send other messages to the
DSI-Notify slack channel.
Towards the end of the year we realized that while we had done well to add new PagerDuty alerts to support SLAs for our services, we weren’t as disciplined with triaging and fixing the underlying long-term issues. In order to step back from treating the symptoms and work towards overall service quality, we decided to add an “on-call summary & handoff” digest to our weekly team meetings. This helped facilitate discussion about why we were being paged, iterate on a service to improve reliability, and change alert structure to reduce false alarms.
This worked well as a pilot — we were able to better surface common issues, more easily identify patterns, and decouple the on-call cadence of addressing real-time issues with the need for making longer term substantive changes.
It was not, however, our long-term vision to add manual work to the on-call rotation. PagerDuty offers a few views into analytics, but we wanted finer grain control of our data and to be able to join it to other analytics data sources. We set out to use the PagerDuty V2 API to pull the data that would help inform patterns of incidents and supplant the manual tracking of our handoff document.
We use Apache Airflow for a wide-range of recurring data ETL, and wrote a one-off Airflow DAG that downloads a history of the data from PagerDuty, writes it into S3, and then copies from S3 to Redshift. You can see a sanitized version of the python script in this gist.
We focused on two tables:
log_entries, which gives us a history of which team got notified about what alert on what date. PagerDuty offers a one month history of incidents through the API. Because a user can modify an incident well after the original date of the alert, we re-harvest the entire 30-day incident history and update any modified data. By contrast, log_entries is an append-only log, so we more simply take new data at a regular cadence and copy it into redshift.
Finally, with the data in Redshift we can create an automated version of our handoff document! Here’s an example of how we query the final schema to pull insights out of the incident and notification history:
-- categorize alerts, and count by category + date
-- takes a rolling five day history
when incidents.summary like '%airflow heartbeat%'
then 'Airflow Heartbeat'
when incidents.summary like '%mysql_dump%'
when incidents.summary like '%TaskInstance: dump_%'
when incidents.summary like '%ES%not green%'
then 'ES not green'
when incidents.summary like '%queue is building%'
then 'Logstash Queue'
when incidents.summary like '%kinesis%behind%'
then 'Kinesis Behind'
when incidents.summary like '%TaskInstance%'
then 'Airflow - Other'
-- remove numbers to more easily group alerts
regexp_replace(incidents.summary, '[0-9]', '')
end as summary
abs(datediff('day', getdate(), created_at)) < 5
Which leads us to the grand conclusion: a Periscope dashboard where we can look at the number of alerts, the breakdown by category and service, and track trends over time.
Our handoff doc is a little cleaner now: rather than having to manually type out reports about individual alerts, we can summarize the week from our dashboard and jump straight to the collective and individual actions we want to take to build resilient and reliable data infrastructure.
This new process has helped us prioritize & fix noisy alerts and persistent incidents. As a bonus, it helps our users surface and escalate lingering issues that have been masked by the cadence on on-call alerts, focusing on other important issues: