Skip to main content

Service Down Runbook

This runbook outlines the steps to take when a service is reported as down.

1. Verify the Incident

  • Check monitoring dashboards (e.g., Grafana, Datadog) for alerts related to the service.
  • Attempt to access the service directly.
  • Confirm with the reporting party that the service is indeed down.

2. Gather Information

  • Note the exact time the incident was reported.
  • Identify the affected service(s) and their dependencies.
  • Check recent deployments or configuration changes.

3. Initial Troubleshooting Steps

  • Restart the service: This is often the quickest fix. Use systemctl restart <service-name> or the equivalent for your environment.
  • Check logs: Look for error messages or unusual patterns in the service logs. Common log locations include /var/log/<service-name>/ or journalctl -u <service-name>.
  • Check resource utilization: Ensure the server has sufficient CPU, memory, and disk space.

4. Escalate

  • If the issue persists after initial troubleshooting, escalate to the appropriate team or on-call engineer.
  • Provide all gathered information and troubleshooting steps already performed.

5. Post-Incident Review

  • Once the service is restored, conduct a post-incident review to understand the root cause and prevent future occurrences.
  • Update this runbook if any new steps or information are discovered.