This is an excerpt from Chapter 1 of Oracle Database 10g Real Application Clusters Handbook by K. Gopalakrishnan, copyright 2007 from Oracle Press, a division of McGraw-Hill. Click here to read the full chapter.
In today's super-fast world, data and application availability can make or break a business. With access to these businesses granted via the ubiquitous and "always-on" Internet, data availability is an extremely important component in any business function.
Database systems are growing at an enormous rate in terms of both the number of simultaneously connected and active users as well as the volume of data they handle. Even though the servers used to store huge, active databases have also improved in performance and capacity, a single server, powerful though it may be, may not be able to handle the database load and capacity requirements, making it necessary to scale the processing load or scale the hardware or software to accommodate these requirements.
When availability is everything for a business, extremely high levels of disaster tolerance must allow the business to continue in the face of a disaster, without the end users or customers noticing any adverse consequences. The effects of global businesses across time zones spanning 24 × 7 × forever operations, e-commerce, and the risks associated with the modern world drive businesses to achieve a level of disaster tolerance capable of ensuring continuous survival and profitability.
Different businesses require different levels of risk with regard to loss of data and potential downtime. A variety of technical solutions can be used to provide various levels of protection with respect to these business needs. The ideal solutions would have no downtime and allow no data to be lost. Such solutions do exist, but they are expensive, and their costs must be weighed against the potential impact to the business of a disaster and its effects.
Computers are working faster and faster, and the businesses that depend on them are placing more and more demands on them. The various interconnections and dependencies in the computing fabric consisting of different components and technologies is becoming more complex every day. The availability of worldwide access via the Internet is placing extremely high demands on businesses and the IT departments and administrators that run and maintain these computers in the background.
Adding to this complexity is the globalization of businesses, which ensures that there is no "quiet time" or "out of office hours" so essential to the maintenance requirements of these computer systems. Hence, businesses' computer systems -- the life blood of the organization -- must be available at all times: day or night, weekday or weekend, local holiday or workday. The term 24 × 7 × forever effectively describes business computer system availability and is so popular that this term is being used in everyday language to describe non-computer–based entities such as 9-1-1 call centers and other emergency services.
The dictionary defines the word available as follows: 1) Present and ready for use; at hand; accessible. 2) Capable of being gotten; obtainable. 3) Qualified and willing to be of service or assistance. When applied to computer systems, the word's meaning is a combination of all these factors. Thus, access to an application should be present and ready for use, capable of being accessed, and qualified and willing to be of service. In other words, an application should be available easily for use at any time and should perform at a level that is both acceptable and useful. Although this is a broad, sweeping statement, a lot of complexity and different factors come into play when availability is present.
The term high availability (HA), when applied to computer systems, means that the application or service in question is available all the time, regardless of time of day, location and other factors that can influence the availability of such an application. In general, it is the ability to continue a service for extremely long durations without any interruptions. Typical technologies for HA include redundant power supplies and fans for servers, RAID (redundant array of inexpensive/independent disks) configuration for disks, clusters for servers, multiple network interface cards and redundant routers for networks.
A fault-tolerant computer system or component is designed so that, in the event of component failure, a backup component or procedure can immediately take its place with no loss of service. Fault tolerance can be provided with software, embedded in hardware, or provided by some combination of the two. It goes one step further than HA to provide the highest possible availability within a single data center.
Disaster recovery (DR) is the ability to resume operations after a disaster -- including destruction of an entire data center site and everything in it. In a typical DR scenario, significant time elapses before a data center can resume IT functions, and some amount of data typically needs to be re-entered to bring the system data back up to date.
The term disaster tolerance (DT) is the art and science of preparing for disaster so that a business is able to continue operation after a disaster. The term is sometimes used incorrectly in the industry, particularly by vendors who can't really achieve it. Disaster tolerance is much more difficult to achieve than DR, as it involves designing systems that enable a business to continue in the face of a disaster, without the end users or customers noticing any adverse effects. The ideal DT solution would result in no downtime and no lost data, even during a disaster. Such solutions do exist, but they cost more than solutions that have some amount of downtime or data loss associated with a disaster.
Planned and unplanned outages
So what happens when an application stops working or stops behaving as expected, due to the failure of even one of the crucial components? Such an application is deemed down and the event is called an outage. This outage can be planned for -- for example, consider the outage that occurs when a component is being upgraded or worked on for maintenance reasons.
While planned outages are a necessary evil, an unplanned outage can be a nightmare for a business. Depending on the business in question and the duration of the downtime, an unplanned outage can result in such overwhelming losses that the business is forced to close. Regardless of the nature, outages are something that businesses usually do not tolerate. There is always pressure on IT to eliminate unplanned downtime totally and drastically reduce, if not eliminate, planned downtime. We will see later how these two requirements can be effectively met for at least the Oracle database component.
Note that an application or computer system does not have to be totally down for an outage to occur. It is possible that the performance of an application degrades to such a degree that it is unusable. In this case, although the application is accessible, it does not meet the third and final qualification of being willing to serve in an adequately acceptable fashion. As far as the business or end user is concerned, this application is down, although it is available. We will see later in this book how Oracle Real Application Clusters (RAC) can provide the horizontal scalability that can significantly reduce the risk of an application not providing adequate performance.
Click here to read the rest of this chapter.
About the author
K. Gopalakrishnan is a senior principal consultant with the Advanced Technology Services group at Oracle Corp., specializing in performance tuning, high availability and disaster recovery. He is a recognized expert in Oracle RAC and database internals and has more than a decade of Oracle experience backed by an engineering degree in computer science and engineering from the University of Madras, India. Gopalakrishnan was awarded an Editor's Choice award for Oracle Author of the Year by Oracle Magazine in 2005. He was also chosen as an Oracle ACE by Oracle Technology Network in 2006.