The construction of data warehouses, which involves data cleaning and data integration, can be viewed as an important preprocessing step for data mining. Moreover, data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective datamining. Furthermore, many other data mining functions such as classification, prediction, association, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction.
Hence, data warehouse has become an increasingly important platform for data analysis and online analytical processing and will provide an effective platform for datamining. Therefore, prior to presenting a systematic coverage of data mining technology in the remainder of this book, we devote this chapter to an overview of data warehouse technology. Such an overview is essential for understanding data mining technology.
2.1 What is a data warehouse?
Data warehousing provides architectures and tools for business executives to systematically organize,understand, and use their data to make strategic decisions. A large number of organizations have found that data warehouse systems are valuable tools in today's competitive, fast evolving world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latestmust-have marketing weapon | a way to keep customers by learning more about their needs.
So", you may ask, full of intrigue, what exactly is a data warehouse?"
Data warehouses can be defined in many ways, making it difficult to formulate a complete definition. Vaguely speaking, a data warehouse refers to a database that is maintained separately from an organization's operational databases.Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated, historical data for analysis.
According to W. H. Inmon, an architect in the making of data warehouse systems, a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support ofmanagement's decision making process." [Inm96]. This short, but comprehensive definition presents the most important characteristics of a data warehouse. The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems. Let's explain eachof these key features.
Subject-oriented: A data warehouse is organized around important subjects, such as customer, vendor, product, and sales. It doesn’t concentrates on the daily operations and a transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view aroundparticular subject issues by excluding data that are not useful in the decision support process.
Integrated: A data warehouse is made by integrating multiple heterogeneous sources, such as relational databases, at files, and on-line transaction records. Data cleaning and data integration techniques are applied to make sure of the consistency in naming conventions, encoding structures, attributemeasures, and so on.
Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years).
Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time.
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment....