Computer News No. 117  Sep.-Oct. 2005

A Review of SAN Storage Failure on September 15, 2005

    1. Summary of the SAN Failure Incident
    2. Recommendations for Future Improvement 

1. Summary of the SAN Failure Incident

On September 15, 2005 at around 0:42 a.m., the SAN (Storage Area Network) system that serves a number of central servers failed.  The failure was caused by a very rare case of successive failures of two disk units of a RAID (Redundant Array of Independent Disk) group of the system within an interval of two minutes.  According to HP, the equipment vendor for this SAN, it was a case that had hit the technology limit of storage system based on RAID5 technology that Computer Centre had implemented for our disk storage systems. 

Several computer servers, which used the storage on the SAN system, failed.  The affected servers include those supporting the University's email and www services, which included the hkusua.hku.hk, hkucc.hku.hk, www.hku.hk, graduate.hku.hk and extranet.hku.hk.  Services of these systems were affected extensively, ranging from partial outage of 13 hours for hkucc.hku.hk to a total outage of 34 hours for www.hku.hk. 

While the equipment engineers took about 12 hours in repairing the SAN storage, it had taken a much longer time for Computer Centre to restore all the data files and email files from backup tapes due to the large volume of disk storage involved and the lack of equipment for supporting fast data recovery.    It had taken Computer Centre over 5 days to complete the restoration of the latest available data files for the affected user accounts, including 1,700 out of the 7,000 accounts of hkucc.hku.hk, 5,500 out of the 32,000 accounts of hkusua.hku.hk and all of the 40,000 accounts of graduate.hku.hk.

On facing the wide scope of disk storage unavailability, the Computer Centre had taken immediate actions to minimize the impact to users of the affected systems:

  • setting up a temporary web server to keep users informed of the system recovery progress and provide the necessary homepage for users to gain access to the unaffected HKU Portal and Student Connect services,
  • mobilizing staff resources in information dissemination and enquiry answering, and
  • escalating the requests of urgent technical support from the related hardware and software vendors, namely HP and Veritas, to senior levels of the companies. 

Our systems staff have acted speedily and professionally in recovering the affected services in the earliest possible time and consequently they stayed in the computer room for over 40 hours before they could take a rest.

The recovery process was lengthy as a software bug was found on the operating system of the hkucc.hku.hk, www.hku.hk and graduate.hku.hk systems which slowed down the data restoration process significantly.  Indeed,  a long time was needed in recovering all the corrupted data from the backup tapes even without the software bug which is an area that we must consider for improvement.  Besides, insufficient staff in the Centre's Systems team also attributed to the fact that www.hku.hk and graduate.hku.hk systems could only be recovered after an outage of more than 1 day, and the latest version of data from backup tapes could only be restored on September 20, 2005.

2. Recommendations for Future Improvement

As an interim measure, HP has been requested to monitor the behaviour of the SAN Storage and its RAID groups more closely so as to expose and fix any possible inherent defects of the equipment early before recurrences of similar problems.  For near-term measure, the existing configuration of the SAN storage systems should be reviewed and enhanced so as to allow the needed capacity and capability for carrying out full-scale data recovery from backup tapes regularly.  Implementation of the real-time data replication function to enable much faster storage recovery and set up a disaster recovery site with installation of the necessary server, storage and network equipment for quick recovery of the mission critical systems and services would be considered as the medium and long term measures for improvement respectively.

Computer Centre would like to apologize for all the inconvenience caused to the University members from this unfortunate incident.