Real stories about disasters, major failures and near disasters in IT

Last update June 27 2012 with info on RBS computer glitch

Lots of organizations are not well prepared for a significant failure of parts or all of the IT-infrastructure. Every organization makes backups. But a backup is only usefull when it can be restored to data which is complete and up-to-date. You will not be the first to find out the backup is worthless when you really need the backup.

Lots of organizations think the chance of a major disturbance is low. Still I hear a lot of stories about disasters or almost disasters. In most cases the cause of the disaster was not a natural one like fire, earthquake or flooding. Allmost all causes were related to human errors or software/hardware failures.

The image below is takes from an EMC survey on data losses http://netherlands.emc.com/about/news/press/2011/20111123-01.htm

To give an idea how easily a (near) disaster can hit your datacenter I collected some real stories about disasters.

1. Failure on Active Directory because of a NTP timeserver misconfiguration.
2. Complete shutdown of a datacenter because of power failure and failure of redundant systems.
3. Lost of hundreds of Active Directory accounts because of a failure in a script
4. Failure of a RAMsan because of a firmware fault and not up to date update process
5. Lost of tens of servers because an electrician put 380V on a 220V circuit
6. Almost lost of a storage array because someone put a plastic cover over the storage array to protect it against dust caused by some drilling in the serverroom.
7. Failure of  a SAN with a hugh impact on state of Virgina services.
8. Failure of internet banking at Dutch bank ING because of human error.
9. Failure of HP EVA storage because of firmware issues at Gemeente Groningen
10. disaster at Revlon datacenter in Venezuela due to fire
11. Hospital in Maastricht needs to cancel operations due to server failure
12. Tech fault at RBS and Natwest freezes millions of UK bank balances
13. hardware failure at ApplicationNet causes 1000s of unavailable desktops
14. Fire in France datacenter
15 University of Twente fire in datacenter

Failure on Active Directory because of a NTP timeserver misconfiguration.
A company was using an external time server for the internal systems like Active Directory. All of a sudden the external timeserver (NTP) changed its time 1 year ahead. Nobody noticed the difference. When someone managing the external timeserver noticed the time difference, he or she adjusted the time and put it one year back. The customer using the external timeserver was using Active Directory and the time of the AD was synchronized with this time server. All of a sudden objects in AD had a timestamp one year ahead. It took Microsoft a lot of effort to fix this as backups were not correct.  A lot of production time was lost.

Complete shutdown of a datacenter because of power failure and failure of redundant systems.
A hospital had a power generator for it’s serverroom as well as a general power generator for the hospital. Both were dependant on a certain circuit. When the electricity went down , the primary power generator kick it, but failed. The power generator of the server room failed as well. The UPS in the serverroom has a capacity of 30 minutes. There was no shutdown script taking care of a clean shutdown of the servers.

Lost of hundreds of Active Directory accounts because of a failure in a script
A system administrator was doing some testing with a script. Instead of testing on a test OU, he did it on all OU. The script deleted hundreds of user accounts in Active Directory.

Failure of a RAMsan because of a firmware fault and not up to date update process
a RAMsan uses memory to store data. To deliver redundancy memory modules are in pairs like a RAID1 for harddisks. Data is written to two memory modules. If one module fails, the other takes over. But because of a firmware error, the failover did not happen. Result downtime of several hours of the storage to get this fixed.

Viginia fights computer failures
Apparently, one of their supposedly high-end enterprise class EMC Symmetrix DMX storage systems, supporting 26 different state agencies in Virginia, crashed on August 25th 2010 and more than a week later, many of those agencies were still down, including the Department of Motor Vehicles and the Department of Taxation and Revenue.
Read here and here about a major disaster happened at Virgina state

Several failure in internet banking of Dutch bank ING
ING suffered from several severe problems resulted in customers not being able to use internet banking facilities for serveral hours. This was according ING caused by human  errors. Read here more (dutch language).

Failure of HP EVA storage because of firmware issues at Gemeente Groningen
Gemeente Groningen (city in the Netherlands) IT infrastructure had serieus issues when firmware bugs troubled their HP EVA san. Read more about this issue here. or in Dutch language here

Disaster at Revlon datacenter in Venezuela due to fire
Revlon’s disaster recovery plan was put to the test in May 2011 when one of its factories in Venezuela burned to the ground. “The data center phoned home to say, ‘I’m getting hot,'” Giambruno said of the automated warning he received. Within two hours, the data center workloads from Venezuela were electronically moved to another Revlon facility in Edison, N.J., and most of that time was spent trying to track down a critical IT manager who was “on a beach” somewhere.

Using technology from F5 Networks, he explained, Revlon made the connection from Venezuela to New Jersey and directed another vendor, Infoblox, which manages Internet Protocol (IP) addresses, to move the server IP addresses in the burning building to the new location.

“We spun up the servers, made sure all the replication of the data was right, and then literally told Infoblox … ‘All that connectivity, move it,'” Giambruno said. The last step was to e-mail all the employees who had been accessing the Venezuelan data center a link to a Riverbed Mobile Client from Riverbed Technology so they could access a virtual desktop image of their application from Edison and resume their work.
source: informationweek.com

Hospital in Maastricht, the Netherlands needs to cancel operation due to serverf failure
Academisch Ziekenhuis Maastricht needed to cancel 18 operations and 200-300 appointments because a part of one of the most important servers burned out. The backup server was not able to take over most of the functionality. See this article in Dutch on Automatiseringsgids.nl

Major computer outage at British banks
NatWest, RBS and Ulster Bank branches are opening longer this weekend to help hundreds of thousands of customers affected by a computer glitch. (June 2012).

The computer glitch at the Royal Bank of Scotland  which left millions of customers unable to access their accounts could have been caused by just one junior technician in India, it was suggested last night.

The “inexperienced operative” accidentally wiped information during a routine software upgrade, it has been claimed. Read more at Yahoo.com
The Register has more information in an article titled RBS collapse details revealed: Arrow points to defective part Software that caused cockup apparently run from India
RBS may sue CA over software glitch that hit Ulster Bank and NatWest

hardware failure at ApplicationNet causes 1000s of unavailable desktops

Dutch PostNL outsourced  management of  13000 workplaces to ApplicationNet. ApplicationNet offers a cloud service. Begin of oktober 2012 a major hardware failure in the datacenter of ApplicationNet caused all virtual desktops (some 1000s) in the PostNL headquarters to be unavailable for three days. The fault was in the Storage Area Network of the provider. While PostNL states a recovery was available, it was not used for some reason. The exact cause is not known.

Fire in a French Datacenter used by ADP for salary administration

A fire in a French datacenter resulted in 1 million Dutchmen not being able to view their account of days off and salary slip. More info in dutch here

Fire in the datacenter of Twente University (2002)
More information here.

Employee of Unit4 puts fire in serverroom on his last working day.
An employee of Unit4 puts the serverroom on purpose on fire on his last working day. He alerted the fire himself. Doing so he wanted to impress his boss so his contract would be extended. More info here.

Nuon datacenter unserviceable because of fire  extinguisher went off.
Powder used for fire extinguish damaged the servers and SAN of the Nuon datacenter in Amsterdam in 2007. The serviced needed to be failed over to a datacenter in Groningen. Aerosol is believed to be used in this case.
Some links with more information.
Nuon stelt poederspuitend datacentrum buiten gebruik [27-5-2007]
Nuon datacentrum blijft buiten gebruik door bluspoeder vervuiling
Rechtspraak.nl
webwereld article

Advertisements

About Marcel van den Berg
I am a technical consultant with a strong focus on server virtualization, desktop virtualization, cloud computing and business continuity/disaster recovery.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: