Following a data center outage last week, Amazon.com Inc.(AMZN) has apologized in a 6,000-word statement Friday.
The Seattle-based company's web hosting service Elastic Compute Cloud or EC2 experienced a massive glitch on April 21. This led to a shutdown of its servers, instantly knocking out access to major websites including Foursquare, Reddit and Quoro.
“We want to apologize,” the statement read.
Amazon is a leading provider of cloud web hosting that hires out storage space on its powerful servers to global web customers.
The company has proposed a 10-day credit for its web services customers who lost substantial data but refrained from divulging the credit amount in the statement on its website. It was also coupled with a promise to revamp future communication.
“We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services,” Amazon said.
Glitch at a glance
The statement accounted a change in the network configuration as the cause of the glitch.
“As with any complicated operational issue, this one was caused by several root causes interacting with one another,” Amazon wrote.
As per the Amazon statement, it was human error that brought about the shutdown. Adding to the damage, it rendered an automated error-recovery mechanism out of control, and many computers got “stuck” in recovery mode.
Amazon Web Services (AWS) attempted to upgrade capacity in a storage volume, or “availability zone,” of its local network in Northern Virginia.
These availability zones exist in every region, "with information spread across several zones in an effort to protect against data loss or downtime", CNN Money reports.
The upgrade required a redirection of traffic within its primary network but accidentally, Amazon sent it to a backup network.
This secondary network was not framed to handle the inundation of traffic, and thus upset the system “by clogging it up and cutting out a bunch of storage nodes from the network”.
“The traffic shift was executed incorrectly,” Amazon said.
As Amazon worked out the traffic flow, a fail-safe triggered, causing the storage volumes to get “stuck”, in an attempt to back up their data somewhere.
That set a “re-mirroring storm” in motion, assuming all available storage space.
The height of the glitch saw nearly 13 percent of the availability zone’s volumes stuck.
Amazon claims that customers involved in computing tasks over multiple zones remained mostly unaffected. A switch in zones has also been rendered difficult by the error.
Amazon is still at work in a bid to restore some of the computers that crashed that day.
The error opened room for “many opportunities to protect the service against any similar event reoccurring,” Amazon wrote.
Amazon is making certain changes to prevent errors in future. It has also promised to be more “forthcoming.”
“The trigger for this event was a network configuration change… We will audit our change process and increase the automation to prevent this mistake from happening in the future,” the company assured.
“In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications.”
AWS contributes to a small percentage of the retailer's total revenue. But Amazon eyes greater heights for the service, “which rents out computer time by the hour,” in the coming future.