Every DBA knows the importance of a good backup and recovery plan, but finding one that actually works isn't so...
For weeks, Craig MacPherson, a database team leader for a Canadian energy company, studied the details of a strategy that would get their databases back up within 24 hours of disaster, losing no more than 15 minutes of access to data.
The plan seemed failsafe on paper. Of course, MacPherson would not have been chosen to star in our weekly newsletter dedicated to DBA bloopers had the plan been perfect. He recalled his small oversight, and big headache, for SearchDatabase.com readers.
"I lead the database team for a large energy company, and earlier this year, we were engaged to participate in a disaster recovery project. We were asked to provide for the disaster recovery of 20 large production databases.
The process we designed followed the KISS methodology and worked like this: Force archive log generation every 15 minutes to the data recovery (DR) site and supplement that with weekly online hot backups to the DR site. At the DR site, recovery is done by simply bringing up the hot backup copy and rolling the database forward, all within 15 minutes of disaster.
The online backup process to the DR site was automated via a simple Unix shell script that ran over the weekend. Its job was to delete the old database copies at the DR site, fire a remote script in production to provide online hot backups of the production databases, and copy the production online hot backups out to the DR site. Of course, we wrapped the script with all sorts of error checking to ensure the process worked and to alert us on any failures.
We had tested this process for some weeks without any problems. It seemed pretty slick. We could recover all production databases within a 24-hour period to within 15 minutes of the disaster simulation. We were ready to begin a large-scale test of the DR site which included all systems, applications and databases, as well as end users to verify the applications.
Late one Friday afternoon, the network was cut between the DR site and the production site to simulate a disaster. Recoveries were to start on Monday, and application testing was to being on Tuesday. We were proud of our work and confident we would be able to have the databases recovered and ready for Tuesday morning application testing.
Imagine my horror on Monday morning, when I discovered we had no databases at the DR site. During the weekend, the scripts worked perfectly; they deleted the old backup copies, but of course, they couldn't contact production to create the new ones--because we had broken the connection!
A very obvious, and embarrassing error, in the original process was instantly revealed. Doh!
For more true DBA bloopers, click:
Have your own tale of woe to share? Submit your backup/recovery snafus, tuning disasters and ugly upgrades. Stories of good intentions gone bad, over-ambitious and under-trained newbies, clueless consultants, and even more clueless managers will all be accepted. The submitter of the most amusing or wince-inducing blooper of the month will receive a free copy of Craig Mullins' new book Database administration: The complete guide to practices and procedures. Send your bloopers to us today!