Data Management.com

data warehouse

By Mary K. Pratt

A data warehouse is a repository of data from an organization's operational systems and other sources that supports analytics applications to help drive business decision-making. Data warehousing is a key part of an overall data management strategy: The data stored in data warehouses is processed and organized for analysis by business analysts, executives, data scientists and other users.

Typically, a data warehouse is a relational database or columnar database housed on a computer system in an on-premises data center or, increasingly, the cloud. Data from online transaction processing (OLTP) applications and additional internal or external sources is extracted and consolidated in the data warehouse for business intelligence (BI) uses that include ad hoc querying, decision support and enterprise reporting. Users access the data through BI software and other types of analytics tools.

Basic components of a data warehouse architecture

At a fundamental level, a data warehouse architecture contains a set of three tiers that include the following main components:

An architecture commonly also contains a data integration layer with tools that extract and combine data from operational systems and a staging area where data is cleansed, transformed and organized before being loaded into the data warehouse. A combination of data integration and data quality software is used to carry out the tasks at the staging level.

An enterprise data warehouse stores analytical data for all of an organization's business operations; alternatively, individual business units may have their own data warehouses, particularly in large companies. Data warehouses can also be connected to multiple data marts, which are smaller systems containing subsets of an organization's data for a specific department or group of users.

Data warehouses also support online analytical processing (OLAP) technologies, which organize data into cubes categorized by different dimensions to help accelerate the analysis process. In addition, the data records stored in a data warehouse contain detailed metadata and summary data to help make them searchable and useful to business users.

Types of data warehouses and deployment options

There are two main approaches to implementing a data warehouse: the top-down method and the bottom-up method.

The top-down method was created by data warehouse pioneer William H. Inmon. It calls for building the enterprise data warehouse first and then using the data stored in it to set up data marts for business units and departments. Under Inmon's approach, data is extracted from source systems and validated in a staging area before being integrated into a normalized data model and transformed for planned BI and analytics uses.

The bottom-up method was developed by consultant Ralph Kimball as an alternative data warehousing approach that calls for dimensional data marts to be created first. Data is extracted from sources and modeled into a star schema design, with one or more fact tables connected to one or more dimensional tables. The data is then processed and loaded into data marts, which can be integrated with one another or used to populate an enterprise data warehouse.

There is also a hybrid approach to data warehouse design that includes aspects of both the top-down and bottom-up methods. Organizations that adopt a hybrid strategy often seek to combine the development speed of the bottom-up approach with the data integration capabilities that can be achieved in a top-down design.

Finally, some organizations have adopted federated data warehouses that integrate separate analytical systems already put in place independently of one another -- an approach that proponents describe as a practical way to take advantage of existing deployments.

Data warehouse benefits

Data warehouses can benefit organizations from both a business and IT perspective. For example, potential benefits include the following:

Common data warehouse use cases and applications

Because a data warehouse enables faster and more efficient access to various data sets and typically has the compute and memory resources needed to run complicated queries, it can help businesses quickly derive insights and, thus, value from their data. To help meet business objectives, data warehouses are commonly used for the following purposes:

Best practices for designing and managing a data warehouse

Here are some best practices to adopt as part of the data warehouse design and management process.

Understand the business goals and strategies that drive the need for a data warehouse. The data warehouse holds data that's structured and processed so it's ready for analytical queries. That's why it's important to start by first understanding the organization's data warehousing needs and the business reasons behind them. IT leaders and data management teams should involve business stakeholders in these discussions, as the business objectives driving the need for the data warehouse will help shape decisions on the data to include, required data sources and how to format data sets.

Review the organization's data governance program, data management plan and supporting processes. The value of a data warehouse stems from the data it holds, not the underlying hardware and software. As a result, an organization should review its data governance program and overall data management strategy and update them if needed to support the data warehouse's planned business use cases. Data management processes at all the designated source systems should also be reviewed to help ensure the data being fed into the data warehouse is clean, accurate and consistent.

This planning should also consider how often data needs to be loaded and whether to go with batch or real-time data processing, based on the business use cases. Additionally, project managers and their teams should confirm that there's an ongoing process for revisiting such considerations and updating data management plans and processes. For example, this step might include deploying change data capture capabilities, so that any changes made in databases are replicated in the data warehouse. Teams need to define user permissions and access controls and address broader data security and compliance requirements upfront, too.

Select the right data warehouse architecture, platform and tools. Business use cases and requirements should also be considered to determine the right technologies for the data warehouse. There's a host of questions to ask. For example, do business requirements necessitate an on-premises data warehouse or would a cloud-based one better meet the organization's objectives? Also, should extract, transform and load (ETL) tools or the alternative extract, load and transform (ELT) method be used to feed data into the warehouse? Should there be a source-agnostic integration layer? Should the organization deploy a data warehouse platform itself or use a managed service?

Adopt new processes to optimize the data warehouse and maximize its business value. Established and emerging practices can help organizations optimize the management of a data warehouse and maximize the value it delivers. For example, data observability techniques can aid in maintaining the health of data in enterprise systems and data pipelines. Applying Agile development methodologies to data warehouse management can make it possible to deliver business value more quickly and with lower risks than using a traditional waterfall approach. Enabling self-service BI and analytics capabilities can similarly speed up value delivery by making it easier for business users to analyze data themselves.

On-premises vs. cloud data warehouses

Like other types of IT systems, data warehouses increasingly are moving to the cloud. All major data warehouse vendors now offer cloud-based systems that are available both for conventional user-managed deployments and as part of fully managed DWaaS offerings. They include vendors that first offered on-premises data warehouses -- most notably, IBM, Microsoft, Oracle, SAP and Teradata -- as well as ones that developed data warehouses specifically for the cloud, such as Amazon Web Services (AWS), Google and Snowflake.

In general, the underlying architecture is the same in on-premises and cloud data warehouses. But they both come with potential benefits and drawbacks in comparison to one another. For example, a data warehouse running on an on-premises server can have the following advantages over a cloud-based one:

On the other hand, cloud data warehouses offer the following advantages:

Another option is a hybrid data warehouse environment that combines cloud and on-premises systems. For example, an organization could keep sensitive data in an on-premises data warehouse for data privacy and regulatory compliance reasons, while moving other data sets to a cloud-based repository.

Data warehouses vs. data lakes vs. databases

Data lakes and data warehouses both support analytics applications, but there are notable differences between the two data repositories. Data warehouses typically store processed data in predefined schemas designed for specific BI, analytics and reporting applications. Usually, they hold conventional structured data from transaction processing systems and other business applications.

By contrast, a data lake is a repository for all types of raw data, whether it's structured, unstructured or semistructured. Data lakes are most commonly built on Hadoop, Spark or other big data platforms, not databases. A schema doesn't need to be defined upfront in a data lake -- instead, data sets can be analyzed as is or filtered and prepared for individual analytics applications. ELT processes are common in data lakes, vs. the ETL approaches most often used in data warehouses.

As a result of their flexibility and support for varied data sets, data lakes can handle more advanced types of analytics than data warehouses. For example, they can be used for machine learning, predictive modeling, text mining and other data science applications.

Data warehouses also differ from operational databases. While a data warehouse stores data from multiple sources for analysis, an operational database is generally used to collect, process and store data from a single system or application to support ongoing business processes. The data in such databases is then consolidated, cleansed and fed into data warehouses.

Data warehouse history and innovations

The concept of data warehousing can be traced back to work conducted in the mid-1980s by IBM researchers Barry Devlin and Paul Murphy. The duo coined the term business data warehouse in their 1988 paper, "An architecture for a business and information system," which described the framework of an information retrieval and reporting system built for use within IBM. The paper stated:

"The architecture is based on the assumption that such a service runs against a repository of all required business information that is known as the Business Data Warehouse (BDW). ... A necessary prerequisite for the physical implementation of a business data warehouse service is a business process and information architecture that defines (1) the reporting flow between functions and (2) the data required."

Bill Inmon, as he is more familiarly known, furthered data warehouse development with his 1992 book Building the Data Warehouse, as well as by writing some of the first columns about the topic. As part of his top-down design methodology for creating a data warehouse, Inmon described the technology as a subject-oriented, integrated, time-variant and non-volatile collection of data that supports an organization's decision-making process.

The technology's growth continued with the founding in 1995 of The Data Warehousing Institute, an education and research organization now known as TDWI that focuses more broadly on data analytics and management technologies. That was followed by the 1996 publication of Ralph Kimball's book The Data Warehouse Toolkit, which introduced his dimensional modeling and bottom-up approach to data warehouse design.

Early adopters began deploying data warehouses in the mid-1990s, and mainstream usage by organizations began to grow later that decade and became widespread in the 2000s. In 2008, Inmon updated his methodology by introducing a Data Warehouse 2.0 concept that focused on the inclusion of data lifecycle management capabilities, unstructured data, metadata and compartmentalized processing environments.

The next big evolution of the data warehouse came in 2012, when AWS launched the first cloud-based data warehouse, Amazon Redshift. Other vendors soon followed suit, and the number of cloud data warehouse options proliferated in subsequent years. Vendors have also layered more automation and intelligence features into their data warehouse software. For example, Oracle in 2018 released Oracle Autonomous Data Warehouse, a version of its Autonomous Database technology that the company describes as "self-driving."

In addition, big data systems have become a valuable extension of data warehouses in many organizations. In some cases, Hadoop clusters or other big data platforms serve as a staging area for traditional data warehouses. In others, data warehouses and data lakes are deployed in a unified analytics environment.

That also has led to the development of the data lakehouse, which combines a data lake's flexibility and scalability with the querying and data management features of a data warehouse. The concept was first outlined in 2017, and data lakehouse technologies have become available from various vendors since then.

28 Feb 2023

All Rights Reserved, Copyright 2005 - 2024, TechTarget | Read our Privacy Statement