Cache outage causing service disruption

Incident Report for The Dental App

Postmortem

Summary

On February 17th 2025 around 1pm EST, we experienced a degradation of performance in our API service, which escalated into a partial outage due to an issue with our cache database. This resulted in slow response times and intermittent availability for affected customers. We want to reassure our users that this incident had no impact on data security, and no data was lost during the event.

Impact

  • Customers experienced slower API response times, affecting application performance.
  • As the issue escalated, intermittent availability of the API was observed.
  • The incident primarily affected customers whose services were running on the impacted cache nodes.

Root Cause

The degradation in API performance was traced back to our primary cache database, which began slowing down due to a filesystem error affecting a subset of server nodes. As the error persisted, these affected nodes became unresponsive, leading to a full cache outage for certain API requests.

Resolution

To restore service, we took the following steps:

  1. Identified and isolated the affected cache nodes.
  2. Rebuilt the cache service using dedicated hardware to ensure better performance and stability.
  3. Deployed the new cache cluster in a highly available configuration with nodes distributed across multiple data centers.
  4. Enabled automatic failover mechanisms to ensure continued availability in case of future node failures.
  5. Implemented enhanced monitoring and alerting on the new cluster to detect potential issues earlier and respond proactively.

Next Steps & Preventive Measures

  • Infrastructure Improvements: The new cache service runs on dedicated hardware with built-in redundancy and failover mechanisms.
  • Proactive Monitoring: Enhanced monitoring and alerts now provide earlier detection of potential filesystem or performance issues.
  • Resilience Testing: We are implementing additional stress testing and fault injection scenarios to validate the high availability of the cache system.

Current Status

All services have been fully restored, and performance is back to normal. We appreciate your patience and understanding as we worked to resolve this issue. If you have any further questions, please reach out to our support team.

Posted Feb 18, 2025 - 10:28 EST

Resolved

We’re currently experiencing service disruptions due to an issue with our cache database service. The problem has been identified and is impacting application performance and availability.
Posted Feb 18, 2025 - 10:14 EST