One of the many advantages of upgrading to Oracle 10g is that the cost-based optimizer (CBO) has been improved to allow for better performance by using a cost model that accounts for I/O and CPU. This cost model allows the optimizer to choose the best execution plans based on statistics of the data. However, the decisions made by the CBO depend on the statistics that are available at the time that the execution plan is parsed. One of the most common causes of performance issues in 10g databases is bad statistics. Luckily, Oracle 10g also introduces several extremely useful tools for monitoring database performance and activity including Active Workload Repository (AWR) and Automatic Database Diagnostic Monitor (ADDM). These tools can be used to pinpoint where bad statistics might be causing the optimizer to make bad decisions.
The obvious question is, "What constitutes bad statistics?" Unfortunately, the answer is not as obvious as the question. In fact, unless you understand the nuts and bolts of how the CBO chooses its execution plan and how parsed execution plans are used in the library cache, it is unlikely that you will even know when bad statistics are causing your database to fail. The term "database failure" can have different meanings but what I am referring to is when CPU, I/O and wait times spike and stay elevated until the database is unusable, usually necessitating a database bounce.
I am going to discuss one case where a client required a root cause analysis for reoccurring database failures. Each time their database failed, they noticed the same symptoms -- suddenly the CPU, I/O and wait times spiked and stayed elevated. If the database failure occurred during production hours, the DBAs would bounce the database almost immediately and rely on Statspack reports to figure out what happened. But if the database failed outside normal business hours, the management rightfully insisted on a root cause determination prior to any corrective action. Since this client is a fairly large company, this meant that there would be a phone conference consisting of approximately 20 people trying to figure out what changed while the database was inaccessible for 6 to 18 hours. This team consisted of network administrators, database developers, DBAs, consultants and managers. I witnessed this happen numerous times in the few weeks I was at this client. The first time, the team decided to double the CPUs on the machine to combat the spiking CPU consumption. After this didn't solve the problem, the team of 20 determined that the corrective action was to create indexes and assign them to the keep pool. After the corrective action was taken and the database was restarted, the response times would become normal.
These intermittent failures were a top priority for this client since each failure had a substantial cost in resources and lost production time. Adding to the client's frustration, they had recently completed an engagement with a consulting company for performance tuning that same database.
My general approach to determining a root cause was similar to the client's approach. I wanted to find out what changed and then determine the cause for that change. Using the AWR to analyze the database CPU and I/O, I was able to pinpoint the exact snapshot where the sudden resource spike occurred.
Using the data above, it is obvious that something changed during snapshot 39244. The source of resource consumption on a segment can be ascertained by understanding all of the operations taking place in the database at a particular time. The next action in the diagnosis was to look at the SQL stats in AWR for the top SQL during the trouble snapshot. I focused on the major consumers identified in the top SQL of the Statspack report during the trouble period, verified by running an ADDM report. Starting with the greatest consumer, we see that the logical and physical read deltas had greatly increased while the executions remained fairly consistent.
SQL> SELECT s.snap_id "BEGIN|
SNAPSHOT", SQL.executions_delta EXEC|DELTA, SQL.buffer_gets_delta "Buffer|Gets|DELTA", SQL.ccwait_delta "Cncr|WAIT|DELTA", SQL.cpu_time_delta "CPU|TIME|DELTA", SQL.elapsed_time_delta "Elpsd|TIME|DELTA" FROM dba_hist_sqlstat SQL, dba_hist_snapshot s WHERE s.snap_id = SQL.snap_id AND sql_id = (:1) ORDER BY BEGIN_INT_TIME ASC; /
This situation held true for each SQL statement in the top SQL. Furthermore, I noticed that the text of each of the top SQL had a similar predicate on the same table. Naturally, the next question was, "What caused the marginal cost of the same SQL query to increase exponentially?" This question can be answered by querying the AWR's SQL stats again for additional information.
SQL> SELECT b.snap_id "BEGIN|SNAPSHOT", a.plan_hash_value FROM dba_hist_sqlstat a, dba_hist_snapshot b WHERE a.sql_id IN (:1) AND a.snap_id BETWEEN 39240 AND 39257 AND a.snap_id = b.snap_id ORDER BY a.snap_id ASC; /
Immediately, we notice that the sql plan hash value had changed. In other words, the execution plans for the same SQL has changed during the trouble snapshot. Comparing the two execution plans using AWR, I saw that the original plan used an efficient index (using an ID) while the new execution plan used a very inefficient plan (using a DATE).
The field CREATION_TS is a date field that shows the time that the record was first created. This particular table has several thousands of rows inserted daily. Why would the optimizer choose to range scan a time period with thousands of records to return one record, rather than using a selective index that only has one to three values? To any reasonable person viewing these two plans, it is obvious that the old plan is much more efficient. So what is causing the CBO to choose such a poor plan? Also, what is causing the execution plan to be reparsed causing the plans to change?
Since my objective was to determine the root cause for the database failure, my primary concern was the cause for the CBO using a poor execution plan. There are a number of reasons that the plan might be reparsed but an infrequent invalidation from the library cache should not cause a major problem. However, in our case, this invalidation was causing a new execution plan that was extremely inefficient leading to the failure of the database. Given this situation, I could easily have recommended using plan stability, but this would have been treating the symptom instead of curing the cause.
There are a few ways to determine the optimizer's cost-based decisions. Arguably, the most effective is a CBO trace. With this particular situation, the root problem was signified by more circumstantial data. Specifically, all of the TOP sql in the ADDM and Statspack reports during the trouble snapshot showed the exact same join on the same table within their SQL text. As shown above, the optimizer was driving decisions based on two predicates -- a date column and an ID column. For this table, there were only a few values for each ID while there are thousands of records for each date. The optimizer changed execution plans from using an index on the ID column to using an index on the DATE column.
Comparing the two plans, the cost for the plan using ID was being calculated equal to the cost using the DATE. However, after executing each plan, the session stats show that the plan using ID has only scanned 17 blocks while the plan using DATE had scanned over 38,000 blocks. Why is the cost for the poor plan being evaluated as lower than a much more efficient plan?
This client has an OLTP environment. Each of the queries that were identified as the top consumers were all querying for the most recent day's data. In other words, the value of the DATE range is between TRUNC(SYSDATE) and SYSDATE. Given all of this diagnostic data, my suspicion was that there was a column histogram on the DATE field causing the CBO to calculate costs based on old statistics. Sure enough, I was able to confirm that there was a height balanced column histogram on the DATE field.
SQL> SELECT histogram FROM dba_tab_cols where table_name = 'LEADS' and column_name = 'CREATION_TS' / SQL> HEIGHT BALANCED
With statistics being gathered weekly to biweekly, it was logical that the CBO believed the cost of using the most recent dates would be very low. In other words, when reparsing the SQL, the CBO believed that there were no values in the specified date range since the column histogram showed no values for that bucket. A CBO trace verified this situation and showed that the CBO calculated the cost for using that date range to be very low and chose the DATE index in the execution plan. In reality, this proved to be catastrophic to the database since this new plan consumed massive amounts of I/O and was executed many thousands of times per hour.
This situation was easy to demonstrate by using an old date range compared with a new date range. All queries using a date range prior to the date of the latest statistics formed execution plans using the index on the ID column while any date ranges after the latest statistics used the index on the DATE field. This demonstration showed that any plans using these predicates that were reparsed after the date of the latest statistics would favor using the index on the DATE field, resulting in very inefficient execution plans.
According to Oracle, it is only proper to use column histograms when the values of the column are highly skewed. Obviously a DATE field with a fairly consistent distribution of values does not fit this prescription. After asking the client's DBAs why they were gathering histograms for all indexed columns, they said that their developers had requested this option.
Finally, we had the root cause determination: the use of histograms on DATE fields in an OLTP database where the queries on this field were specifying the newest dates. This caused the CBO to miscalculate costs based on its belief that there was no data for the specified dates, hence resulting in fewer blocks to scan. The triggering event for the sudden degradation of performance was the reparsing of the SQL much after the date of the latest statistics. By removing the histogram on this column and reparsing the SQL, an immediate database-wide improvement and lasting stability was realized.
The really remarkable part about this root cause analysis is that the time of the analysis was completely independent from the date of the actual failure. In other words, Oracle 10g AWR and ADDM have made the once crucial need for analysis between crash and recovery a thing of the past. These tools are instrumental in analyzing database issues while allowing for the minimum downtime.
About the author
V.J. Jain is an Oracle Database and Applications consultant. With over 12 years of experience with database systems, he specializes in database performance, development, interfaces and high-performance solutions. Based in Orange County, California, he is active in exploring Oracle's newest technologies and is a member of the Oracle Partner Network and Beta program. Additional material by him can be found at Oracle-Developer.com.