A great number of folks get instantly attracted to data visualization, whether it is powered by Power BI, Tableau, or other “big data tools” that they forget the beauty should be built on a solid foundation – a sound data warehouse. When the data set is small, we can still get away with using the OLTP database but as it grows, performance becomes sluggish. The first reaction from some IT managers might be throwing more resources to it, be it allocating more memory or processors virtually, but this should be considered a temporary fix. This “solution” is only good for buying more time to develop a permanent fix because like everything else, the economic law of diminishing marginal returns still applies. In this post, I want to cover some basics on how to design a data warehouse.
First let’s differentiate the entity relationship (ER) on which an online transaction processing (OLTP) database is based vs a dimensional model on which a data warehouse is based. ER modeling seeks to represent business rules of a system in terms of data types and cardinalities of the relationships among data members. The primary goal is to get rid of all non-key data redundancy. In dimensional modeling, data redundancy is controlled by conforming dimension and fact tables. The relationships do not represent business rules but only used to help write reports or create graphs.
In the BI world, measures are mathematically manipulable elements. A dimensional model is designed to delivered lots of measures to users; measures are the core of the model. Relationship in a dimensional model is not intended to enforce referential integrity (as data is supposedly clean after ETL process) but to associate keys in the fact table with expanded definitions found in dimension tables.
To create a dimensional model from an ER model, first separate the ER model into its discrete business process, which is expressed as a data mart – a modular, focused component of the enterprise data warehouse. Examples of single-source data marts are retail sales, purchase orders, shipments, RMAs, etc. A multiple-source data mart could be customer buying behavior.
The grain of a fact table is the level of detail being captured. When designing a fact table, we need to be careful about not summarizing pre-maturely, such as whether the fact table contains each sale transaction vs transactions are already grouped by customers (Bad Idea!). The finer the granularity, the more robust the design (which means the model can easily expand to accommodate new sources of data). The counter argument is for performance.
Most data marts consist of multiple related fact tables that are from the one value chain. When designing, it is beneficial to understand what information users request from the data.