Tales from the Bare Metal, Episode 02 « Thou shalt give every destructive tool a floor below which it must refuse! » At 9:37 in the morning, Pacific time, on Tuesday 28 February 2017, an authorised engineer in Amazon's S3 team typed a command. The command was meant to remove a small number of servers from the S3 billing subsystem, an internal cost-tracking layer that had been showing an issue worth debugging. By 13:54, roughly half of the public-facing internet had been down for four hours, and Amazon had spent most of those four hours unable to update its own status dashboard to say so. This is a long-documented incident. Amazon's official postmortem, posted on aws.amazon.com on the day after, is brief, factual, and self-critical in the AWS house style. The story has been retold many times. The point of revisiting it now is the architecture, which is what survives the retelling. What Happened, in Sequence The S3 team was investigating slow billing reports.…