System Reliability

Every service fails. The question is how we detect, recover and record the event. These essays treat reliability not as a checkbox but as a behaviour that keeps systems alive.