This tip is excerpted from Craig Mullins' new book Database administration: The complete guide to practices and...
procedures. The first and second parts of this series are also available.
Free space, sometimes called fill factor, can be used to leave a portion of a tablespace or index empty and available to store newly added data. The specification of free space in a tablespace or index can reduce the frequency of reorganization, reduce contention, and increase the efficiency of insertion. Each DBMS provides a method of specifying free space for a database object in the CREATE and ALTER statements. A typical parameter is PCTFREE, where the DBA specifies the percentage of each data page that should remain available for future inserts. Another possible parameter is FREEPAGE, where the DBA indicates the specified number of pages after which a completely empty page is available.
Ensuring a proper amount of free space for each database object provides the following benefits:
- Inserts are faster when free space is available.
- As new rows are inserted, they can be properly clustered.
- Variable-length rows and altered rows have room to expand, potentially reducing the number of relocated rows.
- Fewer rows on a page results in better concurrency because less data is unavailable to other users when a page is locked.
However, free space also has several disadvantages.
- Disk storage requirements are greater.
- Scans take longer.
- Fewer rows on a page can require more I/O operations to access the requested information.
- Because the number of rows per page decreases, the efficiency of data caching can decrease because fewer rows are retrieved per I/O.
The DBA should monitor free space and ensure that the appropriate amount is defined for each database object. The correct amount of free space must be based on
- Frequency of inserts and modifications
- Amount of sequential versus random access
- Impact of accessing unclustered data
- Type of processing
- Likelihood of row chaining, row migration, and page splits
Don't define a static table with free space — it will not need room in which to expand.
Compression can be used to shrink the size of a database. By compressing data, the database requires less disk storage. Some DBMSs provide internal DDL options to compress database files; third-party software is available for those that do not provide such features.
When compression is specified, data is algorithmically compressed upon insertion into the database and decompressed when it is read. Reading and writing compressed data consumes more CPU resources than reading and writing uncompressed data: The DBMS must execute code to compress and decompress the data as users insert, update, and read the data.
So why compress data? Consider an uncompressed table with a row size of 800 bytes. Five of this table's rows would fit in a 4K data page (or block). Now what happens if the data is compressed? Assume that the compression routine achieves 30% compression on average (a very conservative estimate). In that case, the 800-byte row will consume only 560 bytes (800 x 0.30 = 560). After compressing the data, seven rows will fit on a 4K page. Because I/O occurs at the page level, a single I/O will retrieve more data, which will optimize the performance of sequential data scans and increase the likelihood of data residing in the cache because more rows fit on a physical page.
Of course, compression always requires a trade-off that the DBA must analyze. On the positive side, we have disk savings and the potential for reducing I/O cost. On the negative side, we have the additional CPU cost required to compress and decompress the data.
However, compression is not an option for every database index or table. For smaller amounts of data, it is possible that a compressed file will be larger than an uncompressed file. This is so because some DBMSs and compression algorithms require an internal dictionary to manage the compression. The dictionary contains statistics about the composition of the data that is being compressed. For a trivial amount of data, the size of the dictionary may be greater than the amount of storage saved by compression.
File Placement and Allocation
The location of the files containing the data for the database can have a significant impact on performance. A database is very I/O intensive, and the DBA must make every effort to minimize the cost of physical disk reading and writing.
This discipline entails
- Understanding the access patterns associated with each piece of data in the system
- Placing the data on physical disk devices in such a way as to optimize performance
The first consideration for file placement on disk is to separate the indexes from the data, if possible. Database queries are frequently required to access data from both the table and an index on that table. If both of these files reside on the same disk device, performance degradation is likely. To retrieve data from disk, an arm moves over the surface of the disk to read physical blocks of data on the disk. If a single operation is accessing data from files on the same disk device, latency will occur; reads from one file will have to wait until reads from the other file are processed. Of course, if the DBMS combines the index with the data in the same file, this technique cannot be used.
Another rule for file placement is to analyze the access patterns of your applications and separate the files for tables that are frequently accessed together. The DBA should do this for the same reason he should separate index files from table files.
A final consideration for placing files on separate disk devices occurs when a single table is stored in multiple files (partitioning). It is wise in this case to place each file on a separate disk device to encourage and optimize parallel database operations. If the DBMS can break apart a query to run it in parallel, placing multiple files for partitioned tables on separate disk devices will minimize disk latency.
Database Log Placement
Placing the transaction log on a separate disk device from the actual data allows the DBA to back up the transaction log independently from the database. It also minimizes dual writes to the same disk. Writing data to two files on the same disk drive at the same time will degrade performance even more than reading data from two files on the same disk drive at the same time. Remember, too, every database modification (write) is recorded on the database transaction log.
Distributed Data Placement
The goal of data placement is to optimize access by reducing contention on physical devices. Within a client/server environment, this goal can be expanded to encompass the optimization of application performance by reducing network transmission costs.
Data should reside at the database server where it is most likely, or most often, to be accessed. For example, Chicago data should reside at the Chicago database server, Los Angeles-specific data should reside at the Los Angeles database server, and so on. If the decision is not so clear-cut (e.g., San Francisco data, if there is no database server in San Francisco), place the data on the database server that is geographically closest to where it will be most frequently accessed (in the case of San Francisco, L.A., not Chicago).
Be sure to take fragmentation, replication, and snapshot tables into account when deciding upon the placement of data in your distributed network.
The DBMS may require disk devices to be allocated for database usage. If this is the case, the DBMS will provide commands to initialize physical disk devices. The disk initialization command will associate a logical name for a physical disk partition or OS file. After the disk has been initialized, it is stored in the system catalog and can be used for storing table data.
Before initializing a disk, verify that sufficient space is available on the physical disk device. Likewise, make sure that the device is not already initialized.
Use meaningful device names to facilitate more efficient usage and management of disk devices. For example, it is difficult to misinterpret the usage of a device named DUMP_DEV1 or TEST_DEV7. However, names such as XYZ or A193 are not particularly useful. Additionally, maintain documentation on initialized devices by saving script files containing the actual initialization commands and diagrams indicating the space allocated by device.
Page Size (Block Size)
Most DBMSs provide the ability to specify a page, or block, size. The page size is used to store table rows (or more accurately, records that contain the row contents plus any overhead) on disk. For example, consider a table requiring rows that are 125 bytes in length with 6 additional bytes of overhead. This makes each record 131 bytes long. To store 25 records on a page, the page size would have to be at least 3275 bytes. However, each DBMS requires some amount of page overhead as well, so the practical size will be larger. If page overhead is 20 bytes, then the page size would be 3295 — that is, 3275 + 20 bytes of overhead.
This discussion, however, is simplistic. In general practice, most tablespaces will require some amount of free space to accommodate new data. Therefore, some percentage of free space will need to be factored into the above equation.
To complicate matters, many DBMSs limit the page sizes that can be chosen. For example, DB2 for OS/390 limits page size to 4K, 8K, 16K, or 32K. In this case, the DBA will need to calculate the best page size based on row size, the number of rows per page, and free space requirements.
Consider this question: "In DB2 for OS/390, what page size should be chosen if 0% free space is required and the record size is 2500 bytes?"
The simplistic answer is 4K, but it might not be the best answer. A 4K page would hold one 2500-byte record per page, but an 8K page would hold three 2500-byte records. The 8K page would provide for more efficient I/O, because reading 8K of data would return three rows, whereas reading 8K of data using two 4K pages would return only two rows.
Choosing the proper page size is an important DBA task for optimizing database I/O performance.
Relational technology and SQL make data modification easy. Just issue an INSERT, UPDATE, or DELETE statement with the appropriate WHERE clause, and the DBMS takes care of the actual data navigation and modification. In order to provide this level of abstraction, the DBMS handles the physical placement and movement of data on disk. Theoretically, this makes everyone happy. The programmer's interface is simplified, and the RDBMS takes care of the hard part — manipulating the actual placement of data. However, things are not quite that simple. The manner in which the DBMS physically manages data can cause subsequent performance problems.
Every DBA has encountered the situation where a query or application that used to perform well slows down after it has been in production for a while. These slowdowns have many potential causes — perhaps the number of transactions issued has increased, or the volume of data has expanded. However, the performance problem might be due to database disorganization. Database disorganization occurs when a database's logical and physical storage allocations contain many scattered areas of storage that are too small, not physically contiguous, or too disorganized to be used productively. Let's review the primary culprits.
- The first possibility is unclustered data. If the DBMS does not strictly enforce clustering, a clustered table or index can become unclustered as data is added and changed. If the data becomes significantly unclustered, the DBMS cannot rely on the clustering sequence. Because the data is no longer clustered, queries that were optimized to access data cannot take advantage of the clustering sequence. In this case, the performance of queries run against the unclustered table will suffer.
- Fragmentation is a condition in which there are many scattered areas of storage in a database that are too small to be used productively. It results in wasted space, which can hinder performance because additional I/Os are required to retrieve the same data.
- Row chaining or row migration occurs when updated data does not fit in the space it currently occupies, and the DBMS must find space for the row. With row chaining, the DBMS moves a part of the new, larger row to a location within the tablespace where free space exists. With row migrations, the full row is placed elsewhere in the tablespace. In each case, a pointer is used to locate either the rest of the row or the full row. Both row chaining and row migration will result in the issuance of multiple I/Os to read a single row. Performance will suffer because multiple I/Os are more expensive than a single I/O.
- Page splits can cause disorganized databases, too. If the DBMS performs monotonic page splits when it should perform normal page splits, or vice versa, space may be wasted. When space is wasted, fewer rows exist on each page, causing the DBMS to issue more I/O requests to retrieve data. Therefore, once again, performance suffers.
- File extents can negatively impact performance. An extent is an additional file that is tied to the original file and can be used only in conjunction with the original file. When the file used by a tablespace runs out of space, an extent is added for the file to expand. However, file extents are not stored contiguously with the original file. As additional extents are added, data requests will need to track the data from extent to extent, and the additional code this requires is unneeded overhead. Resetting the database space requirements and reorganizing can clean up file extents.
Let's take a look at a disorganized tablespace by comparing Figures 11-4 and 11-5. Assume that a tablespace consists of three tables across multiple blocks, such as the tablespace and tables depicted in Figure 11-4. Each box represents a data page.
Figure 11-4 Organized tablespace
Figure 11-5 Disorganized tablespace
Now, let's make a couple of changes to the data in these tables. First, we'll add six rows to the second table. However, no free space exists into which these new rows can be stored. How can the rows be added? The DBMS requires an additional extent to be taken into which the new rows can be placed. This results in fragmentation: The new rows have been placed in a noncontiguous space. For the second change, let's update a row in the first table to change a variable-length column; for example, let's change the value in a LASTNAME column from WATSON to BEAUCHAMP. Issuing this update results in an expanded row size because the value for LASTNAME is longer in the new row: "BEAUCHAMP" contains 9 characters whereas "WATSON" only consists of 6. This action results in row chaining. The resultant tablespace shown in Figure 11-5 depicts both the fragmentation and the row chaining.
Depending on the DBMS, there may be additional causes of disorganization. For example, if multiple tables are defined within a tablespace, and one of the tables is dropped, the tablespace may need to be reorganized to reclaim the space.
To correct disorganized database structures, the DBA can run a database or tablespace reorganization utility, or REORG, to force the DBMS to restructure the database object, thus removing problems such as unclustered data, fragmentation, and row chaining. The primary benefit of reorganization is the resulting speed and efficiency of database functions because the data is organized in a more optimal fashion on disk. In short, reorganization maximizes availability and reliability for databases.
Tablespaces and indexes both can be reorganized. How the DBA runs a REORG utility depends on the DBMS. Some DBMS products ship with a built-in reorganization utility. Others require the customer to purchase the utility. Still others claim that the customer will not need the utility at all when using their DBMS. I have found the last claim to be untrue. Every DBMS incurs some degree of disorganization as data is added and modified.
Of course, DBAs can manually reorganize a database by completely rebuilding it. However, accomplishing such a reorganization requires a complex series of steps. Figure 11-6 depicts the steps entailed by a manual reorganization.
Figure 11-6 Typical steps for a manual reorganization
If a utility is available for reorganizing, either from the DBMS vendor or a third=party vendor, the process is greatly simplified. Sometimes the utility is as simple as issuing a simple command such as
REORG TABLESPACE TSNAME
A traditional reorganization requires the database to be down. The high cost of downtime creates pressures both to perform and to delay preventive maintenance — a no-win situation familiar to most DBAs. Some REORG utilities are available that perform the reorganization while the database is online. Such a reorganization is accomplished by making a copy of the data. The online REORG utility reorganizes the copy while the original data remains online. When the copied data has been reorganized, an online REORG uses the database log to "catch up" by applying to the copy any data changes that occurred during the process. When the copy has caught up to the original, the online REORG switches the production tablespace from the original to the copy. Performing an online reorganization requires additional disk storage and a slow transaction window. If a large number of transactions occur during the online reorganization, REORG may have a hard time catching up.
Determining When to Reorganize
System catalog statistics can help to determine when to reorganize a database object. Each DBMS provides a method of reading through the contents of the database and recording statistical information about each database object. Depending on the DBMS, this statistical information is stored either in the system catalog or in special pages within the database object itself.
One statistic that can help a DBA determine when to reorganize is cluster ratio. Cluster ratio is the percentage of rows in a table that are actually stored in a clustering sequence. The closer the cluster ratio is to 100%, the more closely the actual ordering of the rows on the data pages matches the clustering sequence. A low cluster ratio indicates bad clustering, and a reorganization may be required. A low cluster ratio, however, may not be a performance hindrance if the majority of queries access data randomly instead of sequentially.
Tracking down the other causes of disorganization can sometimes be difficult. Some DBMSs gather statistics on fragmentation, row chaining, row migration, space dedicated to dropped objects, and page splits; others do not. Oracle provides a plethora of statistics in dynamic performance tables that can be queried. Refer to the sidebar "Oracle Dynamic Performance Tables" for more details.
Oracle Dynamic Performance Tables
Oracle stores vital performance statistics about the database system in a series of dynamic performance tables. These tables are sometimes referred to as the "V$ tables" because the table names are prefixed with the characters V$.
The V$ tables are used by the built-in Oracle performance monitoring facilities and can be queried by the DBA for insight into the well-being and performance of an Oracle instance. Examples of some of the statistics that can be found in the V$ tables include
- Free space available
- Chained rows
- Rollback segment contention
- Memory usage
- Disk activity
Of course, there is quite a lot of additional performance information to be found in these tables. Oracle DBAs should investigate the V$ tables and query these tables regularly to analyze the performance of the Oracle system, its databases, and applications.
Tablespaces are not the only database objects that can be reorganized. Indexes, too, can benefit from reorganization. As table data is added and modified, the index too must be changed. Such changes can cause the index to become disorganized.
A vital index statistic to monitor is the number of levels. Recall from Chapter 4 that most relational indexes are b-tree structures. As data is added to the index, the number of levels of the b-tree will grow. When more levels exist in the b-tree, more I/O requests are required to move from the top of the index structure to the actual data that must be accessed. Reorganizing an index can cause the index to be better structured and require fewer levels.
Another index statistic to analyze to determine if reorganization is required is the distance between the index leaf pages, or leaf distance. Leaf distance is an estimate of the average number of pages between successive leaf pages in the index. Gaps between leaf pages can develop as data is deleted from an index or as a result of page splitting. Of course, the best value for leaf distance is zero, but achieving a leaf distance of zero in practice is not realistic. In general, the lower this value, the better. Review the value over time to determine a high-water mark for leaf distance that will indicate when indexes should be reorganized.
If possible, the DBA should look into using the database utilities or third-party tools to automate reorganizations. The automation tool can be used to query the database statistic and trigger reorganization only for those database objects that have surpassed the high-water mark for a combination of statistics. For example, the DBA may want to automatically reorganize all tablespaces where the cluster ratio is below 85% and all indexes where the leaf distance has a value greater than 100.
Reorganizations can be costly in terms of downtime and computing resources. Furthermore, determining when reorganization will actually create performance gains can be difficult. However, the performance gains that can be accrued are tremendous when data is unclustered, fragmented, or disorganized in any way. The wise DBA will plan for and schedule reorganizations to resolve disorganization problems in their database systems.
Applications and data are constantly changing. Users require instant response time and 24/7 availability. The database structures that support these applications must be properly maintained to ensure optimal application performance. Proper database design, appropriate choice of clustering, and reorganizing databases based on statistics help to deliver efficient databases. Furthermore, DBAs can ensure database performance by automating these processes to reduce the risk and errors associated with manual database maintenance.
This tip is excerpted from Craig Mullins' new book Database administration: The complete guide to practices and procedures. The first and second parts of this series are also available.
For more information
- Read Craig Mullins' new book Database administration: The complete guide to practices and procedures.
- Craig's webcast Database Administration 101.
- Craig answers your tough DB2 questions.
- More great DBA, Oracle, SQL Server, and DB2 tech tips.