The standard data warehouse design from Kimball with facts and dimensions has been around for almost 25 years. In this article, Vince Iacoboni describes another way to design slowly changing dimensions.

We owe a lot to Ralph Kimball and friends. His practical warehouse design and conformed-dimension bus architecture are the industry standard. Business users can understand and query these warehouses directly and gain valuable insights into the business. Kimball’s practical approach focuses squarely on clarity and ease of use for the business users of the warehouse. Kudos to you and yours, Mr. Kimball.

That said, can the mainstay Type 2 slowly changing dimension be improved? I here present the concept of historical dimensions as a way to solve some issues with the basic Type 2 slowly changing dimension promoted by Kimball. As we will see, clearly distinguishing between current and past dimension values pays off in clarity of design, flexibility of presentation, and ease of ETL maintenance.

Warehouse facts are inherently historical since transactions happen on a transaction date, balances are kept on a balance date, and so on. The values of Dimensions are either static (date and time, limited code sets) or change slowly. Not every dimension change needs to be recorded as history, but many do. When dimensions change, how should it be handled?

Kimball’s general answer is to choose between the standard slowly changing dimension (SCD) Types 1, 2, and 3. For each column in the dimension table, a determination should be made to 1) overwrite the old value, 2) insert a new row in the dimension with a new dimension key to record the new value, preserving the old, or 3) copy the old value to a previous value column in the row.

SCD Type 1 is a simple overwrite, and SCD Type 3 is somewhat special-purpose and limited. The workhorse of dimension history is, therefore, SCD Type 2. It is made possible by the use of a surrogate key on the dimension rather than the natural key. Historical fact rows are linked through the surrogate key to the version of the dimension row that was current when the fact was recorded. Usually, dimensions containing Type 2 history have effective and expiration dates, as well as a current indicator, which must be maintained as Type 2 SCD rows are inserted.

Limitations of Type 2 SCD

Type 2 SCD is usually presented as one of several choices as to how history is stored – a somewhat technical distinction that can be hidden from the business users. A designer might get the sense that he could present the first version of the warehouse using SCD Type 1 overwrites and add Type 2 SCD history in a later version. Since no structural changes are required, it should be able to drop right in, but this is not the case.

Storing Type 2 history in a dimension table fundamentally changes what that dimension contains. That leads directly to user confusion and incorrect results.

Consider how we treat fact tables. Kimball makes a strong point that one must declare the grain of the fact table, stating precisely what a fact table record represents. He writes that the most common error he sees is to not declare the grain at the beginning of the design process. Further, while the grain may be equivalent to the primary key of the fact table, the grain is properly declared in business terms first. Once the business definition is clear, the dimensional keys used in the fact become obvious.

Should we not treat dimension tables in a similar fashion? We must know exactly what a row in a dimension table represents, in business terms. While the primary key will always be the surrogate key of the dimension, both designers and users should be clear about what each row in the dimension table represents.

For example, what does each row in a Dim_Customer table represent? If SCD Type 1 overwrites are in place, we can say that it represents the latest information for each customer. If SCD Type 2 inserts are in place, the row now represents customer information at a certain point in time. Therefore, a business user must be fully aware of the history technique before he can understand what he is looking at in that dimension.

We can imagine the business user who is attempting to answer the question, “How many different customers have we had?” A simple

SELECT COUNT ( * ) FROM Dim_Customer

provides that answer for an SCD Type 1 table, but one would need to use

SELECT COUNT ( DISTINCT Customer_Id ) FROM Dim_Customer

(assuming Customer_Id is a natural key of the customer from the operational system) or

SELECT COUNT ( * ) FROM Dim_Customer WHERE Current_Indicator = ‘ Y’

to answer that simple question in an SCD Type 2 table.

SCD Type 2 introduces complications because it is trying to be both the current view and the historical view at the same time. It is like an actor on a stage who is trying to play two characters in the same scene. The audience is confused, and so is the actor.

Let’s examine another limitation of SCD Type 2. What happens if we wish to use current dimension values when examining historical facts? For example, we may wish to send emails to all customers who bought certain products in the previous quarter. We use the fact table to obtain the list of distinct Customer_Keys, but those keys refer to potentially historical records in the Type 2 Dim_Customer table. We cannot simply pull the email address from the dimension matching that key, because the customer may well have updated their email address since the last transaction we have recorded. In this case, we don’t want historical customer values; we want current customer values. Going back into the dimension to retrieve the current rows requires some tricky SQL that is likely beyond our business users.

One more irritation of SCD Type 2 arises with accumulating snapshot fact tables. This kind of fact table tracks statistics of a dimensional entity (e.g. a customer) as it changes over time. When our dimension is using SCD Type 2, there are several dimension keys that point to the same dimensional entity. We must ensure that we update the accumulating snapshot’s dimension key with the latest SCD Type 2 dimension key to avoid double-counting the rows.

#bi #homepage #automate #data science

Clarifying Data Warehouse Design with Historical Dimensions
1.35 GEEK