The fundamental objective for a relational database management system (RDBMS) is to keep data independent of the applications or analysis that use it. (Date, C. J., An Introduction to Database Systems, Seventh Edition, Chapter 1) There are two reasons for this:
- Different applications require different views of the same data.
- Data must be extendable and new applications and analysis introduced without having to modify existing applications.
Applications implemented in pre-relational database systems tend to be data dependent, meaning that the physical representation of the data and the methods of accessing it are built into the application code. The main objective of the RDBMS, which has come to dominate the database industry, is data independence. For years, the RDBMSs employed to run businesses (often called on-line transaction processing, or OLTP, systems) made data independence obligatory. Data was stored in tables that eliminated redundancy and demanded that every column of the table be rigorously related to it alone and to no other tables. This ensured that information was available to all applications and analysis that used it, and it provided for consistency and reliability across applications because there was a single source of each particular data element.
Data independence worked well for OLTP systems because the applications accessing the data generally accessed single tables or joined together only a few, and the queries were relatively simple. With the introduction of the relational data warehouse, suddenly huge demands were placed on the relational systems. Large tables were scanned and large result sets were returned, many tables were joined together, calculations were made and detailed data aggregated directly in the queries. In addition, large data volumes were being extracted, transformed and loaded into the tables while users running queries were accessing the data. Databases that had been created and tuned for OLTP could not handle the demands of business intelligence, with the main issue being performance. The OLTP databases could not perform the queries in time or, in some cases, at all.
The key to data independence is data normalization, and normalized data schemas are the most performance demanding. To address the issue of performance in many environments, data independence was abandoned and denormalized schemas were created to address particular analytical needs. Although this addressed short-term decision support needs, it compromised adaptability and the enterprise view. Data independence, adaptability and cross-enterprise functionality go hand in hand, and the normalized data schema is critical to reaching these objectives. Here's why.
1. The star schema--the most common form of denormalization used in a data warehouse--cannot handle every kind of relationship found in a normalized environment, e.g., recursive relationships.
A recursive relationship exists when the parent of a member in a hierarchical relationship is a member of the same entity. There are two ways that this can manifest itself--allowing only a single parent or allowing multiple parents:
In a recursive structure, we can have an unlimited number of levels without knowing how many levels each member hierarchy will have. One hierarchy may be only two levels deep, and another 15. Herein lies the limitation of the star schema. The star schema demands a fixed number of levels because each level is set up by a series of fixed columns in a dimension table. Because we do not know the number of levels in a recursive structure, we cannot predefine the columns.
Often the most critical entities in the data model have recursive structures. Organizational hierarchies (e.g. internal, customer, supplier, competitor) usually are recursive. Other examples of typical recursive structures include employee tables (because employees have managers who are also employees), bill of material and project-work breakdown structures.
2. It is often argued that denormalization is better for the end user than third normal form. However, usability, when separated from performance, actually favors normalization.
Denormalized schemas certainly seem more user-friendly than the complexity of a highly generalized, normalized data model. However, denormalization is driven by known queries, so the ease is only an illusion. The normalized model will allow more flexible new analysis. In any case, users are not typically writing queries; this is often the responsibility of a tool or dedicated application developer. Moreover, views can still make the normalized data model easier to navigate. Few DBAs allow users to query base tables directly, so DBAs can create views from the base tables that look exactly like their denormalized table counterparts.
If there were no issues of performance, all denormalization could be handled in database views. Star schemas, snowflakes, summary tables, derived data, etc. could be built as virtual clusters of tables that look exactly like their physical counterparts. By handling denormalization virtually, the relationships within the underlying base tables remain intact, and referential integrity is supported regardless of how many denormalized relationships are created. This frees the DBA to create any number of denormalized database views for users while maintaining data integrity and eliminating the redundant data requirements of denormalized models.
Database views actually are better for end users than denormalized data because the DBA can create virtual, subject-oriented schemas for specific applications without destroying the underlying base-table data. Views can be created to apply security constraints for groups of business users.
A physical star schema has physical dimensions that support a physical fact table. However, for some dimensions there may be mutually exclusive substitutes for the same data. For example, an airline is interested in both point-to-point travel between segments and between the true origins and destinations (O&D) of their customers. The true O&D dimension is different from the segment O&D, although it looks the same. Moreover, the consolidation of facts is different as well, although the detailed base table data is the same. If the star schemas are physicalized, two very large schemas need to be maintained and coordinated, whereas with virtual star schemas, the data is maintained only in the base tables.
3. It also is argued that denormalization offers better performance than third normal form. However, to address performance, the first consideration should not be to sacrifice the data architecture; rather, it should be to acquire a database engine that can handle a normalized model and scale linearly. If there still is a need to improve performance, aside from scaling the database, then implement a well-planned data-propagation strategy that maintains and complements the underlying normalized base table substructure.
To begin, consider propagating denormalized data within the same database instance. Look at propagating to another environment only if there are other considerations for the target data source, such as geographic needs or the need to support proprietary data structures. In any case, the propagation is from the data warehouse and not directly from source systems.
There are three reasons for this strategy:
- The full parallel capabilities of the data warehouse can be used to propagate data.
- By keeping the data in the same instance of the database, hybrid queries can be implemented that take advantage of both the denormalized and normalized data. For example, large volumes of nonvolatile data from transaction detail can be propagated into new physical fact tables, and smaller volume, highly volatile dimensional data can be built into virtual dimension tables.
- Administration of the complete data warehousing environment will be easier and less expensive.
4. A denormalized model creates ambiguity, so that within a denormalized table it is impossible to tell which columns are related. Normalized models maintain all relationships through the association of primary keys with foreign keys.
Consider, for example, the following denormalized table: Commodity_ID (Primary Key), Commodity_Name, Group_ID, Group_Name, Family_ID, Family_Name and Sequence.
A user viewing this table does not know the relationships among Commodity_ID, Group_ID and Family_ID. It might be that the commodity has separate relationships with groups and families, and it might be that the commodity is related to groups, which are related to families. Moreover, Sequence is ambiguous. Is it used to order commodities in groups, commodities in families or groups within families? It is surprising how often such dangling relationships appear in denormalized models. In contrast, normalized models maintain all relationships through the association of primary keys with foreign keys. Users can clearly discern the relationships simply by looking at the data model.
5. Referential integrity cannot be effectively maintained within a denormalized structure.
Denormalized structures compromise data quality. An RDBMS supports referential integrity to ensure that when foreign keys exist, the instances of the objects to which they refer also exist. Denormalized tables cannot support referential integrity. For example, business users have no guarantee that the children in a denormalized table have "real" parents in the same table, unless some other, more costly method of programmatically ensuring data integrity has been used.
Also, as new incremental data loads are inserted, changed fields will not be reflected in the old data. When an attribute is changed in the source system, only the new records loaded will reflect the changes. For example, suppose the marital status, last name or another field in the customer table changed. Any new loads for that customer will reflect the different data in the changed fields, causing inconsistencies in the data and false results in certain types of analysis because both the changed and unchanged attributes are stored redundantly.
6. The normalized structure with denormalized views is far more flexible in creating and managing dimensional consolidation hierarchies for on-line analytical processing (OLAP) models than the star schema.
Typically, modelers who use denormalized methods define a single set of mutually exclusive dimensions for an OLAP model. In reality, though, we can create many nonexclusive dimensional paths for a single model, which also are reusable across many models. The next figure shows two denormalized virtual schemas that share a number of common dimension views:
There are no limits to the number of virtual schemas that can be created and aligned with particular users. As time goes on and DBAs become more sophisticated, they will be able to manage an extensive number of database views tailored toward the specific needs of individuals and groups of users. It is not possible to do such extensive schema management when the data must be physically propagated for performance reasons.
Consider another example: eight potential dimensional consolidation paths could arise from three entities (households, customers and accounts).
1. Households 2. Customers 3. Accounts 4. Households ............. • Customers ............. ............. • Accounts 5. Households ............. • Accounts ............. ............. • Customers 6. Households ............. • Customers 7. Households ............. • Accounts 8. Customers ............. • Accounts
When these paths or the previously mentioned schemas are created virtually, we have no difficulty managing referential integrity because we need to maintain the data only in the base tables. However, when these are created as separate physical tables, maintaining referential integrity across alternate dimensional paths becomes difficult. As we add more dimensional entities and potentially hundreds more dimensional consolidation paths, it becomes impossible to continue to physically propagate data. As a result, if the physical star schema is used, then compromises will be made.
7. A star schema is designed for dimensional analysis. Yet, not all analysis is dimensional. Many types of quantitative analysis, such as data mining, statistical analysis and case-based reasoning, will be inhibited by the star schema design.
If we consider the household/customer/ account example in a different light, we find an interesting problem with denormalization. An OLAP modeler, focused on creating dimensional views and drill downs, probably would assume that a household could have many customers but that the customer belonged to only one household. However, the underlying normalized data model may reveal a many-to-many relationship between households and customers. Over time, a customer may belong to many households.
The OLAP modeler typically would not be concerned with this relationship because it is unlikely that users would want to drill down into previous households. However, the data miner might find tremendous value in examining this relationship for patterns of behavior that revolve around the changing composition of households: Children leave their parents' household to start their own; couples marry and form one household from two; couples divorce and form two households from one; etc. We need a logical model to address the precise needs of any analytical technique (OLAP, data mining and others).
8. Not all levels in a dimensional model are balanced. A star schema dimension table, which requires fixed columns for each level, has considerable difficulty in handling unbalanced hierarchies.
Unbalanced or ragged hierarchies (see next figure) are hierarchies in which a particular level in the hierarchy can have its parent at different levels above it. An account executive, for example, might report to a region or a sub-region.
Because the star schema creates specific columns to handle each level, it requires that unbalanced hierarchies be balanced so that the account executive is always at level three in this example.
9. Many types of analysis are driven by the normalized relationships in the RDBMS.
There is a considerable amount of analysis within a data warehouse that is not quantitative at all. For example, market-basket studies are based on relational algebra.
Tracking the cycle times for various entities in a data warehouse requires a sophisticated method of tracking the relationships between statuses and events. Modeling strategic frameworks (such as the value chain, supply chain management, competency models, industry structure analysis and total quality management) requires sophisticated relational models for which denormalized approaches are inadequate.
10. In the long term, the actual cost of maintaining a denormalized environment for an enterprise data warehouse will exceed the costs of a normalized environment, and the opportunity cost of decreased cross-functional information will be much greater.
There are economic costs associated with denormalization that often are not considered. In a star schema, each row in a dimension table contains all of the attributes of every entity mapped within it. In the accounts/customers/households example, we would carry all the redundant household data and all the redundant customer data for each account. Not only does the redundancy from the expansion of columns exist, but in the case of the many-to-many relationship between accounts and customers, the number of rows also would increase. A row for each legitimate account/customer combination would be required. When there are millions of accounts, thousands of which are jointly held, this horizontal and vertical redundancy can add considerable storage overhead.
These costs generally are not significant in light of performance gains, but they are significant in light of additional DBA costs to programmatically maintain referential integrity.
More important is the cost of lost opportunity due to compromise. The total benefits are much greater for the normalized approach because of its adaptability. As an enterprise begins to define its business analysis requirements it will initially identify only a fraction of what it ultimately will need. Possibly as much as 95% of its real needs will go undefined. The key is flexibility. By building an adaptable structure from the beginning, the enterprise will be able to address new business needs as they are identified without having to compromise or restructure the database. They will also be able to address these new challenges in the quickest time frame as they do not need to involve IT in designing, building and propagating data for the new queries. The faster a company can respond, the higher the value that can be achieved.
This brings up another related point-normalizing for a data warehouse where adaptability is important requires a high degree of generalization in abstracting the database. Generalization provides flexibility as unanticipated changes are made.
11. Cross-functional analysis becomes increasingly difficult as more and more denormalized models are added.
Data warehousing provides a means to build complex interrelated models for cross-subject area analysis in ways that no other system can. These models can move beyond the traditional financial measures. They can begin to interrelate internal process measures and customer-oriented measures as well. More importantly, they can begin to push from results-oriented or outcome-oriented measures toward those directly linked to organizational activities. The next figure shows an example of interrelated measures from a credit-card model.
The problem with denormalizing is that measures in this example will cross many different denormalized schemas. Star joins are extremely cumbersome, and creating joins across many models makes them extremely complex. In this model we might have four or more distinct star schema structures.
12. The tendency to compromise is much greater for those building denormalized models than for those building third normal form models.
While working with individuals using denormalized methods and raising issues of future growth and applications that a denormalized approach will not support, frequent responses are, "We will never do that." As business discovery proceeds, it becomes clear these opinions do not reflect the requirements and vision of the business.
13. Not only is normalization a consideration, but normalization tailored specifically for business intelligence also is important. Often third normal form data warehouses are built as if they were OLTP systems.
Designers often fail to consider:
- The integration of subject areas
- Versioning over time
- The need to generalize certain entities in anticipation of change
- Life cycles interrelated by many cross-subject area entities and events
- The integration of measures, calculations and dimensional context paths across the enterprise
These reasons show the immediate and long-term benefits that a normalized data model provides. The normalized model provides the structure for continued growth and greater insight for business users. Do not compromise your model when you select tools to extend the value and deliver information to end users. When necessary, to facilitate ease of use or to improve tool performance, views will help ensure the continued integrity of your data model. When evaluating what type of data model you want to install in your data warehouse, make sure the model will support your business today and well into the future.
About the Authors
Tom Russell is a principal consultant with Teradata. He has more than 15 years of experience in the project management, design, development and implementation of data warehousing systems. His experience spans the complete data warehousing process, with particular emphasis on design and planning, including the development of data warehousing architectures and processes and the design of object-oriented applications. Rob Armstrong is director of Teradata's Strategic Warehousing and Technologies team. With more than 15 years in the justification, design and implementation of some of the world's most profitable data warehouse systems, Rob is well-versed in both the business and technical aspects of such systems. He also co-authored "Secrets of the Best Data Warehouses in the World."
For More Information
- What do you think about this article? E-mail the Editor at [email protected] with your feedback.
- The Best Database Web Links: tips, tutorials, scripts, and more.
- Have a tip to offer your fellow DBAs and developers? The best tips submitted will receive a cool prize--submit your tip today!
- Ask your technical admin questions--or help out your peers by answering them--in our live discussion forums.
- Ask the Experts: Our industry gurus are waiting to answer your toughest technical questions.