0

Type II Dimension or a Fake Fact Table!

You are sales manager of North-East region in 2010 to Oct 23 2011 and did total sales of 1 million. The table that stores your information looks like….
Dim Sales Person
Sales Person Key =99
Sales Person Name = Tony Peterson
Sales Person Designation = Sales Manager
Sales Region = North-East
Phone = 203 203 9999
Date Of Birth = 9/9/1977

Sales fact
Transaction Date Key
Sales Person Key
Product Key
Product Qty
Product $Amount

As of Oct 23 management asks for the sales amount by sales person and region. The result would look like: Tony Peterson, North-East, and 1 Million.
From 24th Oct 2012 there is change and you have been transferred to South-East region. So some IT guy updated the sales region in Sales Person table to South-East.
Sales Person
Sales Person Key =99
Sales Person Name = Tony Peterson
Sales Person Designation = Sales Manager
Sales Region = South-East
Phone = 203 203 9999
Date Of Birth = 9/9/1977

As you do your sales next few month you did a sales of .4 million in South-East. On Jan 2012 management again asked for the sales figure by employee and region. What will the report look like?
Tony Peterson, South-East, and 1.4 Million.
Is the report correct? The answer is obvious total sales figure is incorrect for South-East as it will show 1.4 millions. You only did .4 million in the South-East and not 1.4. The reason it is incorrect is due to the fact that somebody updated the dimension record with the current region.

Now that we know the problem and exact cause of it, we need to look for a solution. By this time most you will say …this is no big deal it’s a classic issue of slowly changing dimension. Let me create a type II dimension… So I can preserve history of transfer.
Dim Sales Person II
Sales Person Key =99, 100
Sales Person SSN=000-00-9999, => 000-00-9999
Sales Person Name = Tony Peterson, => Tony Peterson,
Sales Person Designation= Sales Manager, => Sales Manager
Sales Region = North-East, => South-East
Phone = 203 203 9999,=> 203 203 9999
Date Of Birth = 9/9/1977, => 9/9/1977

Now since there are two records for same sales person you can join with sales facts and you can separate the regions. The correct facts are as follows.
Tony Peterson, North-East, and 1.0 Million
Tony Peterson, South-East, and 0.4 Million

Observe the Type II dimension carefully. You will note that
- The key identifier (PK) for the Sales person will keep changing. Note it changed from 99 to 100.
- Instead of having 1 one row per salesman now it has two or even more with time.
- The Sales Region is time variant fact about the sales person
- We are indeed storing time variant facts in type II dimension.
So Type II dimension is really more of a fact table. So is storing time variants facts a good design at all? Why not store the facts in a fact table? Read other articles on pitfalls of Type II dimension.

2

How ETL-IA is beneficial to you?

As the previous ETL-IA blog post mentions that an addition of a source, adds an additional ETL process. It means that the time and effort that it takes to build the data warehouse is directly proportional to the number of sources. The more the number of sources, the more the warehouse system would cost.

An enterprise data warehouse contains a unified data model to obtain integrated data. Therefore, a unified entity is defined for various business entities. For example, there will be tables defined for various business entities such as employee, customer, and sales transaction. Furthermore, regardless of the source systems or the number of source systems, the attributes of business entity remain same. Therefore, in an ideal scenario in a unified data model, a unified process should populate such unified tables. Can this ideal scenario be translated into a meaningful solution in an imperfect world? Can the time and money spent on a project be independent of the number of sources?

Effort and Cost chart in Traditional ETL v/s ETL-IA
Can we achieve something that is represented by ETL-IA represented by Red line? The ETL-IA approach indicates that the cost and effort of the project does not linearly increase with the increase in sources. It happens because the process that ultimately loads into data warehouse is reusable. Doesn’t matter how many sources you add, the cost and effort becomes non-linear after the initial development cost. The above chart displays how the ETL-IA is better to the traditional ETL development methodology.
If we look back and analyze the traditional code, what is the reusability of the code from source to source; may be no reusability or very limited.

Traditional ETL Methodology

Traditional Approach

Another Traditional ETL Methodology

Traditional Approach

Problems with Traditional Approaches
There are many problems with both the above ETL methodologies. The first methodology does not offer any reusability. It requires a separate process every time a new source is added, which is very costly solution to implement. Unfortunately, most of the enterprises are spending $s on this methodology.
And the problem with the second approach is that it contains the logic for the entire source at one place. If something happens, the whole system is at risk. And also, the troubleshooting becomes more difficult in such a design or lack of design.

Then what is the solution?
The ETL-IA specifies that the target is in unified format and designed for data integration. Then why not having only one process irrespective of the number of sources and their type? ETL-IA designs the target process that is source independent. The whole solution is based on this theory, which is further implemented using the object oriented programming concepts such as interface and controller objects.
After the implementation of ETL-IA, the ETL process facilitates the addition of new source as plug and play device. Thus, the more the number of sources added to the system, the more the $s savings.

Stay tuned for more information on ETL-IA.

7

Testing in ETL, Data-Centric Projects

Testing is an investigation process that is conducted to check the quality of the product. Product can either be an applications or data. The quality of the data can only be determined by checking data against some existing standards by following a set of processes. By doing so, you find out the symptoms in form of invalid/incorrect data that happened because of erroneous processes. So the data-centric testing results in achieving high quality data by getting the erroneous processes fixed.

This article highlights the types of data-centric testing approaches and discovers what each testing approach uncovers.

Types of Testing

  • Application Testing: The focus of this article is data-centric testing so we’ll not discuss application testing here.
  • Data-Centric Testing: Data-centric testing revolves around testing quality of the data. The objective of the data-centric testing is to ensure valid and correct data is in the system. Following are the couple of reasons that cause the requirement of performing data-centric testing:
    • ETL Processes/Data Movement: When you apply ETL processes on source database, and transform and load data in the target database
    • System Migration/Upgrade: When you migrate your database from one database to another or you upgrade an existing system where the database is currently running.

Data-Centric Testing
The data-centric testing validates data using the following approaches:

  • Technical Testing: Technical testing ensures that the data is moved, copied, or loaded from the source system to target system correctly and completely. Technical testing is performed by comparing the target data against the source data. Following is a list of functions that can be performed under technical testing:
    • Checksum Comparison: The data-centric testing makes use of checksum approach to discover errors. Checksum can be performed on source and target databases in n number of ways such as counting the number of rows, and adding the data of a column. Later the result of checksum calculated on source database is compared against the checksum calculated on the target database.
      For example, row count compares the number of rows in the target database with the number of corresponding rows in the source database. Or the target database may contain the summarized annual data for monthly salaries in the source database. So the target database should contain sum of the monthly salaries paid within a year for each year.
    • Domain comparison: The domain list in the target database is compared against the corresponding domain list in the source database. For example, the source system has 100 employees and the target system also has 100 employees but it does not guarantee that the employees in both the source and target domain lists are same unless compared. In Domain comparison, employee names in the target domain list are compared against the employee names in the source domain list.
    • Multi-value comparison: Similar to List comparison, multi-value comparison compares the whole record or the critical columns in a record against the corresponding values in the source system.
      For example, the domain comparison reports that all the employees are same and checksum comparison reports that the salaries on source and target match, but still it does not guarantee of a valid record because the data movement may result in assigning wrong employee name to the salary entry. Such issues are not discovered by Checksum comparison and Domain comparison. The multi-value comparison discovers such issues by comparing the key attributes of each entity in source and target databases.
  • Business Testing: Business testing is done to ensure that the data fulfills the requirements of the business. Data may have been moved, copied, or loaded completely and accurately, and technical testing does not report any issue still there are chances that the system still contains the invalid data. To ensure high quality data, the data is evaluated against the business rules.

    For example, value of certain fields such as salary and commission cannot be less than zero. These types of errors can occur because of data manipulation between the source and target systems. Unless you test data for any such errors, the quality of the data is not guaranteed.
    For every business, the list of the business requirements is different. An exhaustive list of business rules against which the data is compared, ensures high quality data.

  • Reconciliation: Reconciliation ensures that the data in the target system is in agreement with the overall system requirements. Following are the couple of examples of how the reconciliation helps in achieving high quality data:
    • Internal reconciliation: In this type of reconciliation, the data is compared within the system against the corresponding data set. For example shipping would also always be less than or equal to the orders. If the shipping ever exceeds the orders then it means data is invalid.
    • External reconciliation: In this type of reconciliation, data in system is compared against its counterpart in other systems. For example, in a module or in an application, number of employees can never be more than the number of employees in HR Employee database. Because HR Employee database is the master database that keeps record of all the employees. If such a situation occurs where the number of employees anywhere in the system is more than the HR Employee database, then the data is invalid.

My company developed the ICE± product, which targets to accomplish all types of testing mentioned in this article. For more information, visit www.iCEdq.com