The following is the final part of a 12-part series on Oracle10g CBO internals and SQL tuning optimization. Each tip is excerpted from the not-yet-released Rampant TechPress book, "Oracle SQL and index internals," by Kimberly Floss. Check back to the main series page for upcoming installments.
In some cases, the distribution of values within a column of a table will affect the optimizer's decision to use an index vs. performing a full-table scan. This scenario occurs when the value with a where clause does not have an equivalent distribution of values, making a full-table scan cheaper than index access in some cases. The problem is the optimizer can't tell when it is using a value with a few rows, and when it is using a value with a large number of rows.
A column histogram should only be created when we have data skew exists or suspected. In the real world, that happens rarely, and one of the most common mistakes with the optimizer is the unnecessary introduction of histograms into optimizer statistics. The histograms signals the optimizer that the column is not linearly distributed, and the optimizer will peek into the literal value in the SQL where clause, and compare that value to the histogram buckets in the histogram statistics.
While they are used to make a yes-or-no decision about the use of an index to access the table, histograms are most commonly used to predict the size of the intermediate result set from a multi-way table join.
For example, assume that we have a five-way table join whose result set will be only 10 rows. Oracle will want to join the tables together in such a way as to make the result set (cardinality) of the first join as small as possible. By carrying less baggage in the intermediate result sets, the query will run faster. To minimize intermediate results, the optimizer attempts to estimate the cardinality of each result set during the parse phase of SQL execution. Having histograms on skewed columns will greatly aid the optimizer in making a proper decision. (Remember, you can create a histogram even if the column does not have an index and does not participate as a join key.)
Because a complex schema might have tens of thousands of columns, it is impractical to evaluate each column for skew and thus Oracle provides an automated method for building histograms as part of the dbms_stats utility. By using the method_opt=>'for all columns size skewonly' option of dbms_stats, you can direct Oracle to automatically create histograms for those columns whose values are heavily skewed.
As a general rule, histograms are used to predict the cardinality and the number of rows returned in the result set. For example, assume that we have a product_type index and 70% of the values are for the HARDWARE type. Whenever SQL with where product_type='HARDWARE' is specified, a full-table scan is the fastest execution plan, while a query with where product_type='SOFTWARE' would be fastest using index access.
Because histograms add additional overhead to the parsing phase of SQL, you should avoid them unless they are required for a faster optimizer execution plan. But, there are several conditions where creating histograms is advised:
- When the column is referenced in a query — There is no point in creating histograms if the queries do not reference the column.
- When there is a significant skew in the distribution of columns values — This skew should be sufficiently significant that the value in the WHERE clause will make the optimizer choose a different execution plan.
- When the column values causes an incorrect assumption — If the optimizer makes an incorrect guess about the size of an intermediate result set, it may choose a sub-optimal table join method. Adding a histogram to this column will often provide the information required for the optimizer to use the best join method.
So how do you find those columns that are appropriate for histograms? There is a feature in dbms_stats that provides for the ability to automatically look for columns that should have histograms, and create the histograms. Multi-bucket histograms add a huge parsing overhead to SQL statements, and histograms should only be used when the SQL will choose a different execution plan based upon the column value.
To aid in intelligent histogram generation, Oracle uses the method_opt parameter of dbms_stats. There are also important new options within the method_opt clause, namely skewonly and auto (and others).
method_opt=>'for all columns size skewonly' method_opt=>'for all columns size auto'The first is the "skewonly" option, which is very time-intensive because it examines the distribution of values for every column within every index. If dbms_stats discovers an index with columns that are unevenly distributed, it will create histograms for that index to aid the cost-based SQL optimizer in making a decision about index vs. full-table scan access. For example, if an index has one column that is in 50% of the rows, a full-table scan is faster than an index scan to retrieve these rows.
Histograms are also used with SQL that has bind variables and SQL with cursor_sharing enabled. In these cases, the optimizer determines if the column value could affect the execution plan, and if so, replace the bind variable with a literal and performs a hard parse.
begin dbms_stats.gather_schema_stats( ownname => 'SCOTT', estimate_percent => dbms_stats.auto_sample_size, method_opt => 'for all columns size skewonly', degree => 7 ); end; /The auto option is used when monitoring is implemented (alter table xxx monitoring) and creates histograms based upon data distribution and the manner in which the column is accessed by the application (e.g., the workload on the column as determined by monitoring). Using method_opt=>'auto' is similar to using the gather auto in the option parameter of dbms_stats:
begin dbms_stats.gather_schema_stats( ownname => 'SCOTT', estimate_percent => dbms_stats.auto_sample_size, method_opt => 'for all columns size auto', degree => 7 ); end; /
Finding the poorly running SQL
While complex queries may have extremely complex execution plans, most Oracle professionals must tune SQL with the following problems:
- Sub-optimal index access to a table — This problem occurs when the optimizer cannot find an index or the most restrictive where clause in the SQL is not matched with an index. When the optimizer cannot find an appropriate index to access table rows, the optimizer will always invoke a full-table scan, reading every row in the table. Hence, a large-table full-table scan might indicate a sub-optimal SQL statement that can be tuned by adding an index that matches the where clause of the query.
- Sub-optimal join methods — The optimizer has many join methods available including a merge join, a nested loop join, hash join and a star join. To choose the right join method, the optimizer must guess at the size of the intermediate result sets from multi-way table joins. To make this guess, the optimizer has incomplete information. Even if histograms are present, the optimizer cannot know for certain the exact number of rows returned from a join. The most common remedy is to use hints to change the join (use_nl, use_hash) or re-analyze the statistics on the target tables.
This is when business knowledge comes in handy. Very frequently, and I'm talking from experience, queries are monstrous because the developer doesn't understand what he goal is and what the data actually means. The old saying that there is no substitute for experience is confirmed here again.
About the author
Kimberly Floss is one of the most-respected Oracle database administrators in the U.S., and is president of the International Oracle Users Group (IOUG). With more than a decade of experience, Kimberly specializes in Oracle performance tuning and is a respected expert in SQL tuning techniques. She is an active member of the Chicago Oracle Users Group, and the Midwest Oracle Users Group, in addition to the IOUG. Kimberly Floss has over 15 years of experience in the information technology industry, with specific focus on relational database technology, including Oracle, DB2, Microsoft SQL Server and Sybase. She holds a bachelor's of science degree in computer information systems from Purdue University, specializing in systems analysis and design, and has an MBA with emphasis in management information systems from Loyola University.