Lifecycle states
Incidents move through:
investigatingidentifiedmonitoringresolved
Severity levels
minor: limited scope, low customer impactmajor: broad impact or significant degradationcritical: severe service interruption
Update cadence
- Initial update as soon as practical after detection.
- Follow-up updates at regular intervals while impact continues.
- Resolution update when service metrics recover.
- Postmortem publication for impactful incidents.
Channel model
/statusas canonical public channel- RSS feed for status updates
- Optional email subscriptions (when enabled)
Postmortem expectations
A postmortem should include:
- incident summary and impact window
- root cause
- mitigation timeline
- prevention actions and owner follow-up
Example timeline format
09:12 UTC- investigating09:37 UTC- identified09:54 UTC- monitoring10:03 UTC- resolved