There are (at least) four different ways to design your data warehouse/data mart environment. Which one you chose to use should not really make a difference. Having stated that, I expect a lot of hate mail. That's fine. Go ahead. But wait until after you have read the rest of the story.
The four ways to design your data warehouse/data mart (DW/DM):
- Normalized Data Structures
- Multi-Dimensional Database Structures (cubes, etc.)
- Star Schemas (a group of stars are constellations)
- Snowflake Schemas (a group of snowflakes make a snow storm)
While there may be other terms out in the marketplace, they are not really the strong players and are most likely a combination of the ones above.
When designing your DW/DM, you need to take into account many things:
- direct access to the data by end-users
- the purpose of your DW/DM
- the end-user tools that will be supported
Normalized Data Structures
This is a point of strong discussion between my good friend Ray Ferrara at KPMG and myself. We have had many length discussions over our favorite beverages (Coke for me!). Ray believes that normalized data structures are the only way to build the data warehouse. The data warehouse feeds the data mart into whatever type of environment is best for the end-user: multi-dimensional database, snowflake, or star. He believes in the true Wholesale/Retail data warehousing environment. In this environment, the end-user will not touch the data warehouse directly, much like we generally cannot purchase directly from a wholesaler. The data mart is where the end-user reads the data. I believe (and have implemented) many data warehouses where the designs are star or snowflake schemas (although I tend to lean toward the star!). I believe it will be difficult to provide every user with the data set the want to do analysis on. But this is a very common practice.
Ralph Kimball is attributed to working with the team the created the concept of the star schema. He certainly is the person fully involved in the promotion of this concept. The basis of the star is that you have dimensions (character descriptions that end-users will like to look at measurements of the company by) and Facts (the actual measurements). Examples of dimensions are customer information (name, address, gender, postal code), product information (name, vendor, category, subcategory) and point of sale information (store, address, salesperson). Examples of facts are quantity sold, unit price, discount applied, tax amount, and total amount. The intersection of the dimensions is where the fact is. In fact (no pun intended), a star is a flat representation of a multi-dimensional cube.
This is popular because the performance is good (not as many joins as a normalized data structure and DBMSs are not yet good a dealing with 5+ way joins!) and is easily understood by our business users. You can, with most query tools, define the semantic layer to look like a star schema if you are using a normalized data structure, but you will still have to do lots more joins to get the data. The biggest negative against the star schema is that you have so much duplicate data. I, like Ralph, believe this is a non-issue, since dimensions, where most duplicate data arises, are smaller than fact tables, and thus take up minimal disk requirements as compared to the whole data warehouse.
I have often said that the snowflake schema is a star schema with normalized dimensions. Mostly I think snowflakes make data modelers feel better since it doesn't have duplicate data in the dimensions. The biggest problem that I have with the snowflake schema is that it will require more joins to provide the results to the end-users. As I said earlier, 5+ way joins are very difficult for most DBMS optimizers to deal with, so having fewer joins are will provide better performance.
There is a group of products that store data in a proprietary format that provides extremely quick access to data and its functions. They represent the data as a cube. The good thing about multi-dimensional databases is they are fast for drill down, drill up and drill across. The negative centers on what happens when the dimensions change - do I have to rebuild the whole cube? Other questions center on whether multi-dimensional databases are able to hold all the data required for your DW/DM environment. Vendors say yes, but I am not yet convinced.
I think the most important part of defining the data warehouse/data mart environment is that you design the data that best fits the tool your end-users are using. If your tool works best in a snowflake schema, build them a snowflake. If you are never going to let the end-users access the data warehouse and you feel more confident in your abilities of building a normalized data structure, then build that. However, don't discount the star schema. It has proven itself in many data warehouse/data mart environments. I believe that the star schema is the best general-purpose design for the data warehouse/data mart environment.
About the Author
Chuck Kelley is president and founder of Excellence In Data, Inc. and an internationally known expert in database technology. He has more than 20 years of experience in designing and implementing operational/production systems and data warehouses. Kelley has worked in some facet of the design and implementation phase of more than 35 data warehouses and data marts. He also teaches seminars, co-authored a book with W. H. Inmon on data warehousing and has been published in many trade magazines on database technology, data warehousing and enterprise data strategies. Please feel free to email him at firstname.lastname@example.org with comments (negative or positive) about this column or ideas for future columns.
For More Information
- The best data warehousing and business intelligence web links: tips, tutorials, scripts, reviews, and much more.
- Do you have any technical questions about data warehousing or BI? Post them--or help out your peers by answering them--in our live discussion forums
- Have a data warehousing or BI tip to offer your fellow developers and administrators? The best tips submitted will receive a cool prize--submit your tip today!