Home > Oracle Tips > Oracle Database Administrator > Using the cost-based optimizer to improve Database 10g performance
Oracle Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

ORACLE DATABASE ADMINISTRATOR

Using the cost-based optimizer to improve Database 10g performance


V.J. Jain
07.20.2007
Rating: -4.68- (out of 5)


Oracle tips, scripts, and expert advice
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


One of the many advantages of upgrading to Oracle 10g is that the cost-based optimizer (CBO) has been improved to allow for better performance by using a cost model that accounts for I/O and CPU. This cost model allows the optimizer to choose the best execution plans based on statistics of the data. However, the decisions made by the CBO depend on the statistics that are available at the time that the execution plan is parsed. One of the most common causes of performance issues in 10g databases is bad statistics. Luckily, Oracle 10g also introduces several extremely useful tools for monitoring database performance and activity including Active Workload Repository (AWR) and Automatic Database Diagnostic Monitor (ADDM). These tools can be used to pinpoint where bad statistics might be causing the optimizer to make bad decisions.

The obvious question is, "What constitutes bad statistics?" Unfortunately, the answer is not as obvious as the question. In fact, unless you understand the nuts and bolts of how the CBO chooses its execution plan and how parsed execution plans are used in the library cache, it is unlikely that you will even know when bad statistics are causing your database to fail. The term "database failure" can have different meanings but what I am referring to is when CPU, I/O and wait times spike and stay elevated until the database is unusable, usually necessitating a database bounce.

I am going to discuss one case where a client required a root cause analysis for reoccurring database failures. Each time their database failed, they noticed the same symptoms -- suddenly the CPU, I/O and wait times spiked and stayed elevated. If the database failure occurred during production hours, the DBAs would bounce the database almost immediately and rely on Statspack reports to figure out what happened. But if the database failed outside normal business hours, the management rightfully insisted on a root cause determination prior to any corrective action. Since this client is a fairly large company, this meant that there would be a phone conference consisting of approximately 20 people trying to figure out what changed while the database was inaccessible for 6 to 18 hours. This team consisted of network administrators, database developers, DBAs, consultants and managers. I witnessed this happen numerous times in the few weeks I was at this client. The first time, the team decided to double the CPUs on the machine to combat the spiking CPU consumption. After this didn't solve the problem, the team of 20 determined that the corrective action was to create indexes and assign them to the keep pool. After the corrective action was taken and the database was restarted, the response times would become normal.

These intermittent failures were a top priority for this client since each failure had a substantial cost in resources and lost production time. Adding to the client's frustration, they had recently completed an engagement with a consulting company for performance tuning that same database.

My general approach to determining a root cause was similar to the client's approach. I wanted to find out what changed and then determine the cause for that change. Using the AWR to analyze the database CPU and I/O, I was able to pinpoint the exact snapshot where the sudden resource spike occurred.

BEGIN|
SNAPSHOT
CPU|BUSY PHYSICAL|READS LOGICAL|READS
39240 10.83 2,632,960 5,540
39241 10.35 2,389,283 5,540
39242 9.84 1,928,372 4,765
39243 12.07 2,358,480 319,280
39244 19.41 367,416,720 76,615
39245 41.47 4,383,027,600 191,815
39246 41.79 5,344,862,800 350,030
39247 48.62 7,070,087,680 599,760
39248 44.97 2,385,390,000 420,200
39249 53.50 6,154,475,760 138,420
39250 63.46 8,277,101,680 135,990
39251 67.05 8,269,566,480 200,240
39252 86.25 4,821,330,360 230,178
39253 75.40 5,879,349,080 420,036
39254 86.19 7,777,096,448 719,712
39255 92.51 2,623,929,000 504,240
39256 89.90 6,769,923,336 166,104
39257 13.35 383,027,600 91,815

Using the data above, it is obvious that something changed during snapshot 39244. The source of resource consumption on a segment can be ascertained by understanding all of the operations taking place in the database at a particular time. The next action in the diagnosis was to look at the SQL stats in AWR for the top SQL during the trouble snapshot. I focused on the major consumers identified in the top SQL of the Statspack report during the trouble period, verified by running an ADDM report. Starting with the greatest consumer, we see that the logical and physical read deltas had greatly increased while the executions remained fairly consistent.

SQL> SELECT s.snap_id "BEGIN|
SNAPSHOT", SQL.executions_delta EXEC|DELTA, SQL.buffer_gets_delta "Buffer|Gets|DELTA", SQL.ccwait_delta "Cncr|WAIT|DELTA", SQL.cpu_time_delta "CPU|TIME|DELTA", SQL.elapsed_time_delta "Elpsd|TIME|DELTA" FROM dba_hist_sqlstat SQL, dba_hist_snapshot s WHERE s.snap_id = SQL.snap_id AND sql_id = (:1) ORDER BY BEGIN_INT_TIME ASC; /

BEGIN|
SNAPSHOT
EXEC|
DELTA
Buffer|Gets|DELTA Cncr|WAIT|DELTA CPU|TIME|DELTA Elpsd|TIME|DELTA
39240 3,658 9,854 0 329,982 329,982
39241 3,473 8,987 0 311,709 311,709
39242 3,161 8,630 0 293,170 293,170
39243 3,470 9,414 0 324,012 324,012
39244 1,014 20,939,367 5,225 717,640,129 859,772,865
39245 3,857 81,967,204 133,178 2,754,815,969 3,478,521,604
39246 3,976 88,320,579 11,639 2,973,254,882 3,751,332,591
39247 4,237 98,588,734 1,719,105 3,266,404,245 4,415,732,409
39248 3,643 88,857,830 906,682 2,937,868,965 3,831,412,484
39249 4,427 112,303,040 5,374,334 3,645,190,672 5,224,145,417
39250 4,460 117,456,518 31,105,732 3,770,210,494 6,212,396,813
39251 4,535 123,818,077 35,105,732 3,908,183,700 6,647,598,948
39252 5,171 148,715,473 893,610,223 4,390,247,982 18,029,668,378
39253 2,546 75,187,326 25,417,157 2,343,609,592 4,284,188,925
39254 3,657 112,091,347 1,531,886,898 3,285,867,054 20,014,193,638
39255 4,884 156,864,666 3,182,023,828 4,571,435,731 34,772,659,966
39256 2,289 91,288,199 1,281,384,071 2,123,128,775 19,283,382,187
39257 0 0 0 0 0

This situation held true for each SQL statement in the top SQL. Furthermore, I noticed that the text of each of the top SQL had a similar predicate on the same table. Naturally, the next question was, "What caused the marginal cost of the same SQL query to increase exponentially?" This question can be answered by querying the AWR's SQL stats again for additional information.

SQL> SELECT b.snap_id "BEGIN|SNAPSHOT", a.plan_hash_value 
FROM 
dba_hist_sqlstat a, dba_hist_snapshot b 
WHERE a.sql_id IN (:1) 
AND a.snap_id BETWEEN 39240 AND 39257 
AND a.snap_id = b.snap_id ORDER BY a.snap_id ASC;
/

BEGIN|SNAPSHOT PLAN_HASH_VALUE
39240 3058762805
39241 3058762805
39242 3058762805
39243 3058762805
39244 6038888
39245 6038888
39246 6038888
39247 6038888
39248 6038888
39249 6038888
39250 6038888
39251 6038888
39252 6038888
39253 6038888
39254 6038888
39255 6038888
39256 6038888
39257 6038888

Immediately, we notice that the sql plan hash value had changed. In other words, the execution plans for the same SQL has changed during the trouble snapshot. Comparing the two execution plans using AWR, I saw that the original plan used an efficient index (using an ID) while the new execution plan used a very inefficient plan (using a DATE).


Click image to enlarge.

The field CREATION_TS is a date field that shows the time that the record was first created. This particular table has several thousands of rows inserted daily. Why would the optimizer choose to range scan a time period with thousands of records to return one record, rather than using a selective index that only has one to three values? To any reasonable person viewing these two plans, it is obvious that the old plan is much more efficient. So what is causing the CBO to choose such a poor plan? Also, what is causing the execution plan to be reparsed causing the plans to change?

Since my objective was to determine the root cause for the database failure, my primary concern was the cause for the CBO using a poor execution plan. There are a number of reasons that the plan might be reparsed but an infrequent invalidation from the library cache should not cause a major problem. However, in our case, this invalidation was causing a new execution plan that was extremely inefficient leading to the failure of the database. Given this situation, I could easily have recommended using plan stability, but this would have been treating the symptom instead of curing the cause.

There are a few ways to determine the optimizer's cost-based decisions. Arguably, the most effective is a CBO trace. With this particular situation, the root problem was signified by more circumstantial data. Specifically, all of the TOP sql in the ADDM and Statspack reports during the trouble snapshot showed the exact same join on the same table within their SQL text. As shown above, the optimizer was driving decisions based on two predicates -- a date column and an ID column. For this table, there were only a few values for each ID while there are thousands of records for each date. The optimizer changed execution plans from using an index on the ID column to using an index on the DATE column.

Comparing the two plans, the cost for the plan using ID was being calculated equal to the cost using the DATE. However, after executing each plan, the session stats show that the plan using ID has only scanned 17 blocks while the plan using DATE had scanned over 38,000 blocks. Why is the cost for the poor plan being evaluated as lower than a much more efficient plan?

This client has an OLTP environment. Each of the queries that were identified as the top consumers were all querying for the most recent day's data. In other words, the value of the DATE range is between TRUNC(SYSDATE) and SYSDATE. Given all of this diagnostic data, my suspicion was that there was a column histogram on the DATE field causing the CBO to calculate costs based on old statistics. Sure enough, I was able to confirm that there was a height balanced column histogram on the DATE field.

SQL> SELECT histogram FROM dba_tab_cols where table_name 
= 'LEADS' and column_name = 'CREATION_TS'
/
SQL> HEIGHT BALANCED

With statistics being gathered weekly to biweekly, it was logical that the CBO believed the cost of using the most recent dates would be very low. In other words, when reparsing the SQL, the CBO believed that there were no values in the specified date range since the column histogram showed no values for that bucket. A CBO trace verified this situation and showed that the CBO calculated the cost for using that date range to be very low and chose the DATE index in the execution plan. In reality, this proved to be catastrophic to the database since this new plan consumed massive amounts of I/O and was executed many thousands of times per hour.

This situation was easy to demonstrate by using an old date range compared with a new date range. All queries using a date range prior to the date of the latest statistics formed execution plans using the index on the ID column while any date ranges after the latest statistics used the index on the DATE field. This demonstration showed that any plans using these predicates that were reparsed after the date of the latest statistics would favor using the index on the DATE field, resulting in very inefficient execution plans.

According to Oracle, it is only proper to use column histograms when the values of the column are highly skewed. Obviously a DATE field with a fairly consistent distribution of values does not fit this prescription. After asking the client's DBAs why they were gathering histograms for all indexed columns, they said that their developers had requested this option.

Finally, we had the root cause determination: the use of histograms on DATE fields in an OLTP database where the queries on this field were specifying the newest dates. This caused the CBO to miscalculate costs based on its belief that there was no data for the specified dates, hence resulting in fewer blocks to scan. The triggering event for the sudden degradation of performance was the reparsing of the SQL much after the date of the latest statistics. By removing the histogram on this column and reparsing the SQL, an immediate database-wide improvement and lasting stability was realized.

The really remarkable part about this root cause analysis is that the time of the analysis was completely independent from the date of the actual failure. In other words, Oracle 10g AWR and ADDM have made the once crucial need for analysis between crash and recovery a thing of the past. These tools are instrumental in analyzing database issues while allowing for the minimum downtime.

About the author

V.J. Jain is an Oracle Database and Applications consultant. With over 12 years of experience with database systems, he specializes in database performance, development, interfaces and high-performance solutions. Based in Orange County, California, he is active in exploring Oracle's newest technologies and is a member of the Oracle Partner Network and Beta program. Additional material by him can be found at Oracle-Developer.com.

Rate this Tip
To rate tips, you must be a member of SearchOracle.com.
Register now to start rating these tips. Log in if you are already a member.


Submit a Tip




Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


RELATED CONTENT
Oracle Database Administrator
What is the difference between a database engineer, architect and administrator?
Import on one table from dump file
Error during RMAN backup
Can I drop a column in SYS schema?
STATSPACK tool: transaction vs. execution measurement
Should I port from Microsoft Access?
Installing multiple Oracle homes
How can I find statistics on total memory usage and database connections?
Modifying SYS password in a RAC environment
How to create Datafiles in a Data Guard (10g) environment

Oracle database performance problems and tuning
Oracle 11g data compression
Varchar or number for better performance?
Do statistics on SYS-owned objects hurt performance in 10g?
Inside the Oracle 11g SQL Performance Advisor, part 1
Inside the Oracle 11g SQL Performance Advisor, part 2
Difference between driving table and driver table in Oracle
Best design for E-Business Suite on hard drive
20GB data dictionary causing performance problems
Online tablespace reorganization in Oracle 9i
How to read a STATSPACK report

Oracle database installation, upgrades and patches
Oracle extends Real Application Testing to 10g, 9i
Third-party Oracle database tools remain an attractive option
Enterprise search a key component of Oracle portal project
Oracle Enterprise Manager leverages multiple environments for pharmaceutical giant
Oracle 11g upgrade veteran offers step-by-step advice
Oracle Database 11g gets early use at Burlington Coat Factory
Oracle database upgrades require soul searching
With software upgrades, pain leads to gain
How do I do that in Oracle?
The Oracle Database user's guide to Oracle OpenWorld 2007

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
10g  (SearchOracle.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary

DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.

HomeNewsTopicsTipsAsk the ExpertsMultimediaWhite PapersProductsBlogs
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides enterprise IT professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective IT purchase decisions and managing their organizations' IT projects - with its network of technology-specific Web sites, events and magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Reprints  |  Site Map




All Rights Reserved, Copyright 2003 - 2008, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts