With end users getting accustomed to instantaneous response times, Oracle, more than ever, is challenged to provide continuous availability to its database products. An important tool the folks at Redwood Shores have to help them accomplish that is Oracle Real Application Clusters (RAC).
What is RAC? In a nutshell, it is a software tool that allows a single database to be accessed by many Oracle programs. If one server fails, transactions can be redirected to another live server with a minimum of downtime.
Oracle advertises RAC as a cure for many ailments. IT shops can misunderstand such marketing hype, however, and not recognize the cost and benefits of using RAC in a high availability (HA) environment.
Let’s explore some Oracle RAC best practices and in the process shed some light on common mistakes users make when using this cluster-based technology. In this Oracle RAC guide, we’ll take a look at:
- RAC planning best practices
- RAC implementation best practices
- RAC infrastructure considerations
- Hardware architecture and RAC performance
- RAC backup and recovery best practices
- Performance and tuning best practices
One of the most common mistakes with Oracle RAC is misunderstanding its functions and limitations. Oracle Real Application Clusters is used as part of a comprehensive capacity planning strategy, but the technology’s strengths and limitations are not always understood. Here is a list of the most common misperceptions about the technology.
Learn more about how Oracle’s RAC improves availability
Spotlight on RAC, grid and availability
Oracle Database 10g high availability with RAC,
Oracle RAC is ideal for scalability
Even though Oracle Corporation wants you to buy tiny “blade servers” and use their grid computing solution for “horizontal scaling,” that’s not how most shops use RAC. Keep in mind that RAC is only a legitimate scalability option for very large IT shops that need more horsepower than a single server can deliver.
Instead, it’s an Oracle best practice to scale-up first, and then scale out by first building up within a single server through “vertical scaling.” Only after you have saturated a large server do you need to use RAC to “scale out” the application across multiple servers. Today, a single server’s memory and CPU horsepower can be significantly expanded compared with just several years ago, making it easier to add resources instead of plunking in a new server to the RAC environment. In real-world environments, a single server can handle thousands of transactions per second. Only the world’s largest Oracle databases need to scale-out using RAC nodes.
Oracle RAC is a standalone high-availability solution
Remember that RAC only protects you against instance failure, and that’s only one of many components that can cause an unplanned outage. For true continuous availability, we must deploy triple-mirrored disks (with a mean-time-to-failure rate expressed in centuries) and redundant network components.
For complete availability on each RAC server node, you will want multiple host bus adapters, multiple network cards and multiple power sources. Just as we have failover at the instance layer, you need to purchase software to allow the multiple host bus adapter cards to automatically failover and issue a notification that one has failed.
As we have noted, RAC systems require a cluster interconnect in order to accommodate RAM-to-RAM transfers of data blocks in the RCA cache fusion layer. This interconnect must be very fast, with high bandwidth and low latency. Interconnects include:
- Dark fiber: Dense Wavelength Division Multiplexing (DWDM) technology
This cache fusion bottleneck is another reason why RAC scale-out, or horizontal scalability, is problematic. If your cluster interconnect cannot handle the traffic, extra servers will actually degrade your performance instead of helping it. The only way around this problem is to change your entire application to accommodate RAC, or to purchase faster storage such as Solid State Disk.
Oracle RAC ensures fast response time
Response time for transactions is always important, but it’s especially important for RAC databases. This is because of the connection wait-time that is used to detect whether a RAC node, or server, has failed. Consequently, you must plan to ensure that new transactions are serviced in less than one second wall-clock time so that you can set a failover time of two seconds.
Oracle RAC does not need a disaster recovery component
Except in the rare cases where you can deploy Dense Wavelength Division Multiplexing (DWDM) technology, known as dark fiber, you still need to create a disaster recovery solution. Because RAC nodes are normally located within a few miles of each other, a natural disaster like a hurricane would still cause a global outage. So it has become a RAC best practice to also deploy a fast-failover geographical solution like Data Guard -- or better still, n-way Streams.
Now that we understand the planning aspect of best practices, let’s take a closer look at RAC best practices issues after we have implemented our new database.
Oracle RAC implementation best practices
Operational RAC databases follow many of the same best practices as any Oracle database, but there are some that are unique to Oracle RAC systems. First, it’s an important best practice to plan RAC servers in a way that minimizes the geographical distance between the RAC nodes while still keeping them separate, in order to avoid a failure of all nodes.
As a reference, you can take a look at what I wrote on how to implement RAC implementation guidelines.
In a busy RAC database, the speed of the server interconnect is critical for fast response times. It’s a commonly accepted best practice to use the fastest possible interconnect, typically a fiber optics solution like dark fiber.
Some shops will place RAC nodes in separate buildings in the same neighborhood, but with the advent of the superfast dark fiber interconnect, you can use “Extended RAC” and place RAC nodes up to 100 miles apart. This allows you to combine high availability with disaster recovery.
Dark fiber is rather expensive, however. To reduce costs, most shops adopt a best practice where they combine RAC with disaster recovery solutions like n-way Streams replication.
The whole point of RAC is to make end users automatically reconnect to a surviving server when one server fails. This is done either at the Web-server level or with the Oracle Transparent Application Failover (TAF) option. Whichever tool you choose, you should wait about three seconds before assuming that the server is dead and re-trying a new RAC server.
Next, let’s take a closer look at specific RAC technical best practices.
Oracle RAC interconnect best practices
Since RAC is a method in which many instances share the same database, shared data blocks are transferred between the servers using a high-speed interconnect called “cache fusion.” In order to keep performance fast, it’s critical that you pay close attention to the interconnect layer and remember these points:
RAC likes small block sizes, the interconnect must have extremely fast network hardware, and RAC load balancing is critical to performance.
Oracle RAC node load balancing best practices
I disagree with Oracle’s practice of load balancing using a least-loaded approach because of the overhead it lays on top of the cache fusion layer. In the real world, like-minded end users are directed to the same RAC server. If we have a RAC system with different types of end users, we would want to load balance according to their data needs. For example, customer processing might be on node one, order processing on node two, and product processing on node three. Grouping RAC end users by data needs ensures that cache fusion overhead is minimized.
Oracle RAC disk storage management best practices
In order to implement a RAC system, you should use a shared storage device because many servers must have concurrent access to the disks. A single instance database can, however, use Direct Attached Storage (DAS), which is an array of inexpensive disks connected to a single server. You must now use what is known as a Storage Area Network (SAN). A SAN, which is more expensive and complex, is a disk array capable of connecting to many servers, usually through Fibre Channel. This requires a unique set of hardware, ranging from host bus adapters to the SAN itself. It’s important that your DBA have complete knowledge of the internals of the data storage layer.
Oracle RAC block size best practices
It has become a best practice in RAC to use a small 2 kilobyte block size in order to minimize the “baggage” shipped across the cache fusion layer. Because the block size is the unit of work, the smaller the block size, the higher the granularity of data being transferred, with less overhead. If you have long rows (greater than 2 kilobytes), then you will want to move to a 4 kilobyte block size.
The implementation of a RAC cluster is only the beginning, and it’s critical to constantly monitor the health of your RAC clusters so that you can spot and fix impending problems before you inconvenience your end users.
Oracle RAC monitoring best practices
To ensure that a RAC node never experiences a global problem, a proper monitoring infrastructure is an absolute requirement. RAC databases rarely fail without warning. If the DBA understands the proper metrics to watch, he can create an alert system that notifies him of a looming problem so that he can fix it before the instance crashes.
The DBA must monitor the cluster, the shared disk setup, ASM (or OCFS), the database instance, listeners, and more in-depth metrics such as cache coherency, interconnect latency, disk times from multiple systems, and a range of other things.
While higher-cost performance monitoring tools such as Oracle Grid Control can help perform rudimentary RAC monitoring for beginners, a RAC DBA should have the coding skills to build his own RAC monitoring infrastructure using dictionary queries, dbms_scheduler and email alert mechanisms.
Wrapping up the discussion of Oracle RAC best practices, let’s focus on the best way to define job roles for a RAC database.
Oracle RAC staffing best practices
One best practice for RAC databases is to always hire an experienced RAC DBA to manage your cluster, avoiding people who have had the RAC training but have no job experience.
It’s important to recognize that human resource costs are the most expensive part of an Oracle shop. Over the decades, hardware costs have steadily fallen while manpower costs have remained the same.
It’s important to note that Oracle professionals with RAC skills command a hefty premium over an ordinary DBA. A recent Oracle salary survey notes that an average DBA earns about $97,000 per year, whereas RAC experts commonly earn $140,000 a year. Those who manage multi-billion-dollar RAC databases typically command upwards of $250,000 per year.
Sadly, there is no easy way to “grow your own” RAC DBA. The training courses are very expensive, and there is no substitute for real-world experience. And training your own DBA in RAC may make him more marketable. It’s not uncommon to spend tens of thousands of dollars teaching RAC to your DBA only to lose him to a better job offer.
Oracle RAC job role best practices
There is a perpetual conflict between systems administrators (SAs), who traditionally manage servers and disks, and the RAC DBAs who are responsible for managing the RAC database. There are also clearly defined job roles for network administrators, who are especially challenged in a RAC database environment to manage the cluster interconnect and packet shipping between servers.
If your DBA is going to be held responsible for the performance of the RAC database, then it’s only fair that he be given root access to the servers and disk storage subsystem. However, not every DBA will have the required computer science skills to manage a complex server and SAN environment, so each shop makes this decision on a case-by-case basis.
Oracle RAC training best practices
One of the sure ways to set your company up for an unplanned outage is to fail to train your SA, DBA and network administrator properly. SAN environments like EMC, Tagmastore and NetApp have complex architectures, and they frequently require training classes.
Disk configuration is also challenging, and RAC will function only when using specific disk setups such as ASM, OCFS, RAW, or a third-party cluster file system. These tools require training classes.
Network administrators must also receive training on how to work with the cluster interconnect, as well as specialized interconnects such as Infiniband and DWDM.
Of all those on a RAC staff, DBAs will have the greatest learning curve. They will have to understand how to set up and administer all of the complex RAC components, including the clusterware and file system storage.
In summary, while RAC offers continuous availability, it’s not magic. There is a lot of work required to ensure that a RAC database is always available. Every RAC database has some unique properties, but there are some well-known perils and pitfalls as well. Using Oracle RAC best practices from other shops is a must for ensuring success. The vast majority of the best practices with RAC relate to properly planning the infrastructure and configuring and deploying the RAC database.
About the author:
Donald K. Burleson is a leading Oracle expert, with more than 25 years of DBA experience. He has authored more than 30 Oracle books, including five officially authorized O books on Oracle tuning. Burleson also manages a popular DBA website, www.dba-oracle.com.