This tip is excerpted from Craig Mullins' new book Database administration: The complete guide to practices and...
procedures. The second and third parts of this series are also available.
Database performance focuses on tuning and optimizing the design, parameters, and physical construction of database objects, specifically tables and indexes, and the files in which their data is stored. The actual composition and structure of database objects must be monitored continually and changed accordingly if the database becomes inefficient. No amount of SQL tweaking or system tuning can optimize the performance of queries run against a poorly designed or disorganized database.
Techniques for Optimizing Databases
The DBA must be cognizant of the features of the DBMS in order to apply the proper techniques for optimizing the performance of database structures. Most of the major DBMSs support the following techniques although perhaps by different names. Each of the following techniques can be used to tune database performance and will be discussed in subsequent sections.
- Partitioning — breaking a single database table into sections stored in multiple files.
- Raw partitions versus file systems — choosing whether to store database data in an OS-controlled file or not.
- Indexing — choosing the proper indexes and options to enable efficient queries.
- Denormalization — varying from the logical design to achieve better query performance.
- Clustering — enforcing the physical sequence of data on disk.
- Interleaving data — combining data from multiple tables into a single, sequenced file.
- Free space — leaving room for data growth.
- Compression — algorithmically reducing storage requirements.
- File placement and allocation — putting the right files in the right place.
- Page size — using the proper page size for efficient data storage and I/O.
- Reorganization — removing inefficiencies from the database by realigning and restructuring database objects.
A database table is a logical manifestation of a set of data that physically resides on computerized storage. One of the decisions that the DBA must make for every table is how to store that data. Each DBMS provides different mechanisms that accomplish the same thing — mapping physical files to database tables. The DBA must decide from among the following mapping options for each table:
- Single table to a single file. This is, by far, the most common choice. The data in the file is formatted such that the DBMS understands the table structure and every row inserted into that table is stored in the same file. However, this setup is not necessarily the most efficient.
- Single table to multiple files. This option is used most often for very large tables or tables requiring data to be physically separated at the storage level. Mapping to multiple files is accomplished by using partitioned tablespaces or by implementing segmented disk devices.
- Multiple tables to a single file. This type of mapping is used for small tables such as lookup tables and code tables, and can be more efficient from a disk utilization perspective.
Partitioning helps to accomplish parallelism. Parallelism is the process of using multiple tasks to access the database in parallel. A parallel request can be invoked to use multiple, simultaneous read engines for a single SQL statement. Parallelism is desirable because it can substantially reduce the elapsed time for database queries.
Multiple types of parallelism are based on the resources that can be invoked in parallel. For example, a single query can be broken down into multiple requests each utilizing a different CPU engine in parallel. In addition, parallelism can be improved by spreading the work across multiple database instances. Each DBMS offers different levels of support for parallel database queries. To optimize database performance, the DBA should be cognizant of the support offered in each DBMS being managed and exploit the parallel query capabilities.
Raw Partition vs. File System
For a UNIX-based DBMS environment, the DBA must choose between a raw partition and using the UNIX file system to store the data in the database. A raw partition is the preferred type of physical device for database storage because writes are cached by the operating system when a file system is utilized. When writes are buffered by the operating system, the DBMS does not know whether the data has been physically copied to disk or not. When the DBMS cache manager attempts to writes the data to disk, the operating system may delay the write until later because the data may still be in the file system cache. If a failure occurs, data in a database using the file system for storage may not be 100% recoverable. This is to be avoided.
If a raw partition is used instead, the data is written directly from the database cache to disk with no intermediate file system or operating system caching, as shown in Figure 11-1. When the DBMS cache manager writes the data to disk, it will physically be written to disk with no intervention. Additionally, when using a raw partition, the DBMS will ensure that enough space is available and write the allocation pages. When using a file system, the operating system will not preallocate space for database usage.
Figure 11-1 Using raw partitions to avoid file system caching
From a performance perspective, there is no advantage to having a secondary layer of caching at the file system or operating system level; the DBMS cache is sufficient. Actually, the additional work required to cache the data a second time consumes resources, thereby negatively impacting the overall performance of database operations. Do not supplement the DBMS cache with any type of additional cache.
Creating the correct indexes on tables in the database is perhaps the single greatest performance tuning technique that a DBA can perform. Indexes are used to enhance performance. Indexes are particularly useful for
- Locating rows by value(s) in column(s)
- Making joins more efficient (when the index is defined on the join columns)
- Correlating data across tables
- Aggregating data
- Sorting data to satisfy a query
Without indexes, all access to data in the database would have to be performed by scanning all available rows. Scans are very inefficient for very large tables.
Designing and creating indexes for database tables actually crosses the line between database performance tuning and application performance tuning. Indexes are database objects created by the DBA with database DDL. However, an index is built to make SQL statements in application programs run faster. Indexing as a tuning effort is applied to the database to make applications more efficient when the data access patterns of the application vary from what was anticipated when the database was designed.
Before tuning the database by creating new indexes, be sure to understand the impact of adding an index. The DBA should have an understanding of the access patterns of the table on which the index will be built. Useful information includes the percentage of queries that access rather than update the table, the performance thresholds set within any service level agreements for queries on the table, and the impact of adding a new index to running database utilities such as loads, reorganizations, and recovery.
One of the big unanswered questions of database design is: "How many indexes should be created for a single table?" There is no set answer to this question. The DBA will need to use his expertise to determine the proper number of indexes for each table such that database queries are optimized and the performance of database inserts, updates, and deletes does not degrade. Determining the proper number of indexes for each table requires in-depth analysis of the database and the applications that access the database.
The general goal of index analysis is to use less I/O to the database to satisfy the queries made against the table. Of course, an index can help some queries and hinder others. Therefore, the DBA must assess the impact of adding an index to all applications and not just tune single queries in a vacuum. This can be an arduous but rewarding task.
An index affects performance positively when fewer I/Os are used to return results to a query. Conversely, an index negatively impacts performance when data is updated and the indexes have to be changed as well. An effective indexing strategy seeks to provide the greatest reduction in I/O with an acceptable level of effort to keep the indexes updated.
Some applications have troublesome queries that require significant tuning to achieve satisfactory performance. Creating an index to support a single query is acceptable if that query is important enough in terms of ROI to the business (or if it is run by your boss or the CEO). If the query is run infrequently, consider creating the index before the process begins and dropping the index when the process is complete.
Whenever you create new indexes, be sure to thoroughly test the performance of the queries it supports. Additionally, be sure to test database modification statements to gauge the additional overhead of updating the new indexes. Review the CPU time, elapsed time, and I/O requirements to assure that the indexes help. Keep in mind that tuning is an iterative process, and it may take time and several index tweaks to determine the impact of a change. There are no hard and fast rules for index creation. Experiment with different index combinations and measure the results.
When to Avoid Indexing
There are a few scenarios where indexing may not be a good idea. When tables are very small, say less than ten pages, consider avoiding indexes. Indexed access to a small table can be less efficient than simply scanning all of the rows because reading the index adds I/O requests.
Index I/O notwithstanding, even a small table can sometimes benefit from being indexed — for example, to enforce uniqueness or if most data access retrieves a single row using the primary key.
You may want to avoid indexing variable-length columns if the DBMS in question expands the variable column to the maximum length within the index. Such expansion can cause indexes to consume an inordinate amount of disk space and might be inefficient. However, if variable-length columns are used in SQL WHERE clauses, the cost of disk storage must be compared to the cost of scanning. Buying some extra disk storage is usually cheaper than wasting CPU resources to scan rows. Furthermore, the SQL query might contain alternate predicates that could be indexed instead of the variable-length columns.
Additionally, avoid indexing any table that is always accessed using a scan, that is, the SQL issued against the table never supplies a WHERE clause.
Query performance can be enhanced in certain situations by overloading an index with additional columns. Indexes are typically based on the WHERE clauses of SQL SELECT statements. For example, consider the following SQL statement.
select emp_no, last_name, salary
where salary > 15000.00
Creating an index on the salary column can enhance the performance of this query. However, the DBA can further enhance the performance of the query by overloading the index with the emp_no and last_name columns, as well. With an overloaded index, the DBMS can satisfy the query by using only the index. The DBMS need not incur the additional I/O of accessing the table data, since every piece of data that is required by the query exists in the overloaded index.
DBAs should consider overloading indexes to encourage index-only access when multiple queries can benefit from the index or when individual queries are very important.
This tip is excerpted from Craig Mullins' new book Database administration: The complete guide to practices and procedures. The second and third parts of this series are also available.
For more information
- Read Craig Mullins' new book Database administration: The complete guide to practices and procedures.
- Craig's webcast Database Administration 101.
- Craig answers your tough DB2 questions.
- More great DBA, Oracle, SQL Server, and DB2 tech tips.