Service Down Runbook
This runbook outlines the steps to take when a service is reported as down.
1. Verify the Incident
- Check monitoring dashboards (e.g., Grafana, Datadog) for alerts related to the service.
- Attempt to access the service directly.
- Confirm with the reporting party that the service is indeed down.
2. Gather Information
- Note the exact time the incident was reported.
- Identify the affected service(s) and their dependencies.
- Check recent deployments or configuration changes.
3. Initial Troubleshooting Steps
- Restart the service: This is often the quickest fix. Use
systemctl restart <service-name>or the equivalent for your environment. - Check logs: Look for error messages or unusual patterns in the service logs. Common log locations include
/var/log/<service-name>/orjournalctl -u <service-name>. - Check resource utilization: Ensure the server has sufficient CPU, memory, and disk space.
4. Escalate
- If the issue persists after initial troubleshooting, escalate to the appropriate team or on-call engineer.
- Provide all gathered information and troubleshooting steps already performed.
5. Post-Incident Review
- Once the service is restored, conduct a post-incident review to understand the root cause and prevent future occurrences.
- Update this runbook if any new steps or information are discovered.