The following was based upon a real-life experience in a Tier 1 bank. We used Sybase tools like ASE and Replication Server but this case is very applicable to other engines such as Oracle with Data Guard, etc.
When I was working last year for a large bank, we had a disastrous experience that was saved by replication. Now I'm convinced that any hardware-based solution such as clustering is nowhere near as good as a software-based solution such as replication. Why bother spending large sums of money on hardware clustering when you can achieve the same results with a bit of ingenuity and thinking? The notes below will make my points a bit clearer.
I have always been fascinated by Peter Thawley's article entitled "High Availability in the Internet Age." This article was published in the International Sybase User Group (ISUG) journal in Q3 1999. The author brought up the downtime that eBay suffered as a result of a hardware fault that rendered their Oracle database corrupt. It took 24 hours for the Oracle database to be loaded back and in the meantime, the business lost a lot of income. Regardless of what caused the downtime, one would have expected the architect and DBAs to account for such failures. Although failures due to hardware problems are relatively rare due to redundancy and mirroring, they can and do happen.
The bank I worked for had a trading system that had been running for two years. It had relatively large databases on the order of
Within two weeks of being on site we managed to create a BC system using Sybase replication, capable of handling the above criteria. By mid-November, we put the system into production and the client did some initial tests over a week-end that established that the replication was working in a timely manner and the trades were getting to the BC site. What is interesting is the rest of the story.
In the afternoon on a fateful business day, disaster struck in the form of a hardware problem on the data server. The Sybase error log started sending the following out:
00:00000:00000:2003/xx/xx xx:xx:xx.xx kernel sddone: write error on virtual disk 2 block 8601952: 00:00000:00000:2003/xx/xx xx:xx:xx.xx kernel sddone: I/O error 09:00000:00757:2003/xx/xx xx:xx:xx.xx server Error: 694, Severity: 24, State: 10 09:00000:00757:2003/xx/xx xx:xx:xx.xx server An attempt was made to read logical page '11311459', virtpage '42156387' from virtual d evice '2' for object '1723869208' in database '5'. The page was not read successfully. You may have a device problem or an operating system problem. 00:00000:00006:2003/xx/xx xx:xx:xx.xx server Error: 823, Severity: 24, State: 2 00:00000:00006:2003/xx/xx xx:xx:xx.xx server I/O error detected during wait for BUF pointer = '0xb8b3d5a0', MASS pointer = '0xb8b3d5 a0', (Buf#: '0'), page ptr = '0x9801c000', dbid = '5', Mass virtpage = '42156384', Buffer page = '11311456', Mass status = '0x4689110 0', Buffer status = '0x1', size = '16384', cache (id: 0) = 'default data cache'. 09:00000:00757:2003/xx/xx xx:xx:xx.xx kernel ************************************
ASE was running on an E4500 server with two Sun A5200 arrays for data storage. The problem was that a full array was taken out of action as a result of a single disk failure! You may argue the reasons for this. However, it appears that this is an inherent design fault of this type of array. (On three specific disk slots on A5200 you need to have disks. These disks are used to re-issue the "fcal signal" and if one of these fails and fcal signal does not get around, the array can fail.) As we had two arrays, one would have expected that the mirror array would have saved us. Unfortunately, that didn't happen in this case.
The problem was that although one of the storage arrays (out of two) was taken out of action, a Sybase device was residing on a volume that had both the underlying damaged disk and its mirror on the same array. This in turn was traced to Veritas that a few months before had a hot swap disk in the same array as the originally affected disk. As a result of human error, this problem was not detected or it was just ignored. So when the single array failed in the middle of writing data, a hole appeared in the trading database.
The point that I am trying to make is about the speed with which we managed to recover. We checked the BC database and the last trade entry there was recorded at two minutes before the crash. The counters showed that particular trade entry was indeed the last trade gone into the trading system. In this scenario all we did was to swap the interfaces files and asked the users to start their applications again. In effect we carried out a full test of the BC database and it tuned out to be fine, so that was the sign off!
The fundamental question is whether or not hardware redundancy in the form of hardware-level disk mirroring can ever provide complete protection for databases. Many organizations tend to use a form of local area (traditional disk mirroring) or wide area (Storage Area Networks) hardware mirroring to create redundancy and resiliency. For example, EMC's SRDF is an example of an extended-distance replication solution at the hardware level. As an alternative, software replication can be deployed. I tend to argue in favor of software solutions for the reasons explained below.
As a practitioner, I have always believed that software high availability (HA) solutions in the form of data replication (though somehow asynchronous) do have an edge on hardware-based disk-to-disk mirroring. My arguments here are based on the axioms that although databases are built on file systems or raw partitions, they fundamentally behave in a different way compared to pure file systems. Indeed it is a simplistic view to assume that disk-to-disk, bit-by-bit mirroring will be sufficient protection for your data.
HA disk-to-disk mirroring in its most common form is designed to do a bit-by-bit copy of data from the primary disk to the mirror. As long as the distance between two disk locations is not great and a fast connection such as Gigabit exists between these two sites, the writes will be almost simultaneous. At a first glance such solutions may look ideal; they can be almost off-the-shelf. Apply the solution using local or remote clustering and we have high availability of data.
In reality, databases are more complicated than a simple matter of bit-by-bit copying. A database engine performs a lot of tasks through the background processes in a process-based RDBMS like Oracle or through internal processes in a thread-based model like Sybase. These processes routinely perform tasks such as checkpointing and flushing of transaction logs and data that are necessary for maintenance of the database but in reality have no bearing to the availability of data. In other words, most of this form of data need not be copied or replicated. In comparison, in HA designs based on data replication, this redundant data is never replicated because it is not deemed necessary for the BC database. This brings us to the point of how software replication is different from hardware mirroring.
Replication in its most common form moves transactions (inserts, updates and deletes) at the table level from a source database to one or more destination databases. Compared to hardware mirroring, which is basically all or nothing, replication can
- Move data from one source to another.
- Move only a subset of data from source to destination. So you can subscribe to a subset of data; i.e., replicate selected tables.
- Allow manipulation/transformation of data when moving from source to destination.
This points out another benefit of using replication technology for availability. Often customers use disk mirroring and disk snapshot technologies to maintain a hot database, often at a relatively high cost. These technologies do a good job with physical problems like a bad block on a disk. However, if there is data corruption caused by a software error in the I/O device driver or DBMS layer, that corruption will simply be propagated to the remote device! Since Sybase Replication Server operates at the logical level by turning transactions back into SQL operations against the replicate server(s), this risk is entirely avoided.
This was first published in May 2004