Organizations are increasingly expanding their deployments of enterprise performance management (EPM) technologies...
to a wider user base. EPM applications help them share financial information enterprise-wide and bring analytics on business performance to bear across the board -- not just in the finance department. That makes these systems more important to companies than ever before. This article covers considerations for implementing highly available and fault tolerant Oracle EPM systems to protect against failures that cause costly downtime and to enable rapid recovery from failures that do occur.
The first step in creating a highly available system for EPM is to establish what service levels you need. This involves examining the probability of different types of failure and assessing the tolerance your business has for the system downtime and data loss that can result from such problems. Filling out a simple matrix can help document the service-level requirements. For example, you could create a matrix like the one below with two service-level metrics for specific failure events: a recovery point objective (RPO), which sets the amount of time for which it is tolerable for data to be lost, and a recovery time objective (RTO), which quantifies the desired time period for recovering data.
Now let's look at some common types of failure and the steps you can take to avoid them or minimize their impact to your Oracle EPM system so you can meet the service-level needs.
Data corruption. Most failures in EPM systems are due to human error, which means that data corruption is the most common type of failure. Combat data corruption issues with periodic exports for backup purposes.
The Hyperion product line is the cornerstone of Oracle's EPM suite. The lifecycle management utility in Hyperion can be scripted and scheduled to automatically take periodic exports of security settings, applications, data, reports and other EPM artifacts. Scripts need to be maintained and monitored, since they will most likely need to be continuously massaged to fit the application profile as the system changes over time.
It's also important to take backups of the EPM server itself in case you get corruption in the operating system layer. Finally, a backup of the relational database repository is needed in case it encounters corruption problems.
Timing is important, too: The database, operating system and file system may need to be restored to the same point in time. You'll need to coordinate system, database and Hyperion administrators for backup recovery planning to ensure consistency across the platform.
Hardware failure. Failures in server hardware are less common these days, but they're still possible. A common strategy for handling system downtime related to hardware failure is the use of server clusters, which failover processing services in the event of a system going offline. There are two different types of clusters: active-active clusters and active-passive clusters.
In active-active clusters, servers are configured to distribute EPM workloads across multiple servers, all of which are running the same services simultaneously. This is useful for load balancing. Commonly, a physical load balancer is used as the single entry point to the cluster -- it distributes requests for processing resources across the servers. If one server fails, the remaining healthy nodes continue operations. Active-active clusters are commonly used in the web layer of Oracle EPM products.
In some cases, though, having multiple active-active load balancing components is either unsupported or impossible to set up in the Oracle EPM suite. For these situations, an active-passive arrangement is needed. In an active-passive cluster, only one server runs processing services at any given time. If it fails, a standby server detects the failure and starts services to resume operations. Active-passive clusters are common for systems in the EPM data layer, like the Essbase multidimensional database that underpins many Hyperion deployments.
Data center failure. Depending on the available technology, there are a variety of ways to protect against entire data center outages. One way is to do frequent lifecycle management exports of EPM artifacts to a disaster recovery instance in another data center. This process can be complicated, and requires a lot of scripting, automation and maintenance, but it's a common approach for users who don't have expensive data replication technology and have a higher level of tolerance to downtime.
For users with less tolerance to downtime, replication software can be used to synchronize systems from one data center to another in real time. This means that there is little to no loss of data in the event of a data center failure. The downside is that the software can be costly.
Failure prevention via quality assurance
Perhaps the best way to fight downtime is to prevent it in the first place. Make sure to establish strict quality controls that delineate clearly defined roles and responsibilities -- with the associated security access and authority -- for developers, testers and others involved in your EPM project. Quality controls also provide a framework that enforces proper testing of all developmental objects before they go into production. In addition, they set up workflow and approvals, audit trails, back out procedures and pass/fail factors.
It's also very important to do your due diligence on the "care and feeding" needed by Hyperion. Just like with any other system, there are things that need to be done on a daily, weekly and monthly basis to keep an Oracle EPM system properly tuned. These tasks include log rotation, file system cleanup, system health monitoring, disaster recovery tests and performance monitoring.
EPM system monitoring and security
Of course, all the preparation and protective processes in the world can't completely eliminate failures. Plan for these events and make sure you have procedures in place to detect them and react quickly.
Issue detection is key to minimizing downtime. There are many commercial and free software packages readily available to monitor EPM system health. These packages can detect a failure and immediately alert the appropriate people to resolve it, sometimes before end users even know there's a problem.
In addition, alerts can be set to flag events related to issues that are dangerously close to becoming failures before they cause actual outages. These alerts include slower response times, disk space approaching full, some services down, errors in system logs and either CPU or memory utilization getting close to the maximum level.
Security is a top priority for CIOs in every industry -- or it should be. Oracle EPM systems commonly hold sensitive and confidential financial data. Security processes and tools need to be put in place at every level of a system. These processes include password strength and rotation policies, operating system hardening, network firewalls, segregation of duties, continuous intrusion detection monitoring and encryption of data, both at rest and in transit.
If this all sounds like a lot -- well, it is. And, of course, there are costs involved. However, you also need to calculate the costs of not doing anything. In most cases, the costs associated with protecting business data, hardware and data center facilities are significantly less than what EPM system failures and the resulting data loss could cost your company. Doing nothing could turn out to be very expensive indeed.
About the author:
Eric Helmer is VP of application services at Mercury Technology Group and an Oracle ACE Director.
Use Hyperion to calculate natural class and functional income statements
Physical standby databases are key to Oracle disaster recovery
Three things to keep in mind when implementing Hyperion