Type II Dimension or a Fake Fact Table!

January 10th, 2012

You are sales manager of North-East region in 2010 to Oct 23 2011 and did total sales of 1 million. The table that stores your information looks like….
Dim Sales Person
Sales Person Key =99
Sales Person Name = Tony Peterson
Sales Person Designation = Sales Manager
Sales Region = North-East
Phone = 203 203 9999
Date Of Birth = 9/9/1977

Sales fact
Transaction Date Key
Sales Person Key
Product Key
Product Qty
Product $Amount

As of Oct 23 management asks for the sales amount by sales person and region. The result would look like: Tony Peterson, North-East, and 1 Million.
From 24th Oct 2012 there is change and you have been transferred to South-East region. So some IT guy updated the sales region in Sales Person table to South-East.
Sales Person
Sales Person Key =99
Sales Person Name = Tony Peterson
Sales Person Designation = Sales Manager
Sales Region = South-East
Phone = 203 203 9999
Date Of Birth = 9/9/1977

As you do your sales next few month you did a sales of .4 million in South-East. On Jan 2012 management again asked for the sales figure by employee and region. What will the report look like?
Tony Peterson, South-East, and 1.4 Million.
Is the report correct? The answer is obvious total sales figure is incorrect for South-East as it will show 1.4 millions. You only did .4 million in the South-East and not 1.4. The reason it is incorrect is due to the fact that somebody updated the dimension record with the current region.

Now that we know the problem and exact cause of it, we need to look for a solution. By this time most you will say …this is no big deal it’s a classic issue of slowly changing dimension. Let me create a type II dimension… So I can preserve history of transfer.
Dim Sales Person II
Sales Person Key =99, 100
Sales Person SSN=000-00-9999, => 000-00-9999
Sales Person Name = Tony Peterson, => Tony Peterson,
Sales Person Designation= Sales Manager, => Sales Manager
Sales Region = North-East, => South-East
Phone = 203 203 9999,=> 203 203 9999
Date Of Birth = 9/9/1977, => 9/9/1977

Now since there are two records for same sales person you can join with sales facts and you can separate the regions. The correct facts are as follows.
Tony Peterson, North-East, and 1.0 Million
Tony Peterson, South-East, and 0.4 Million

Observe the Type II dimension carefully. You will note that
– The key identifier (PK) for the Sales person will keep changing. Note it changed from 99 to 100.
– Instead of having 1 one row per salesman now it has two or even more with time.
– The Sales Region is time variant fact about the sales person
– We are indeed storing time variant facts in type II dimension.
So Type II dimension is really more of a fact table. So is storing time variants facts a good design at all? Why not store the facts in a fact table? Read other articles on pitfalls of Type II dimension.

How ETL-IA is beneficial to you?

July 20th, 2011

As the previous ETL-IA blog post mentions that an addition of a source, adds an additional ETL process. It means that the time and effort that it takes to build the data warehouse is directly proportional to the number of sources. The more the number of sources, the more the warehouse system would cost.

An enterprise data warehouse contains a unified data model to obtain integrated data. Therefore, a unified entity is defined for various business entities. For example, there will be tables defined for various business entities such as employee, customer, and sales transaction. Furthermore, regardless of the source systems or the number of source systems, the attributes of business entity remain same. Therefore, in an ideal scenario in a unified data model, a unified process should populate such unified tables. Can this ideal scenario be translated into a meaningful solution in an imperfect world? Can the time and money spent on a project be independent of the number of sources?

Effort and Cost chart in Traditional ETL v/s ETL-IA
Can we achieve something that is represented by ETL-IA represented by Red line? The ETL-IA approach indicates that the cost and effort of the project does not linearly increase with the increase in sources. It happens because the process that ultimately loads into data warehouse is reusable. Doesn’t matter how many sources you add, the cost and effort becomes non-linear after the initial development cost. The above chart displays how the ETL-IA is better to the traditional ETL development methodology.
If we look back and analyze the traditional code, what is the reusability of the code from source to source; may be no reusability or very limited.

Traditional ETL Methodology

Traditional Approach

Another Traditional ETL Methodology

Traditional Approach

Problems with Traditional Approaches
There are many problems with both the above ETL methodologies. The first methodology does not offer any reusability. It requires a separate process every time a new source is added, which is very costly solution to implement. Unfortunately, most of the enterprises are spending $s on this methodology.
And the problem with the second approach is that it contains the logic for the entire source at one place. If something happens, the whole system is at risk. And also, the troubleshooting becomes more difficult in such a design or lack of design.

Then what is the solution?
The ETL-IA specifies that the target is in unified format and designed for data integration. Then why not having only one process irrespective of the number of sources and their type? ETL-IA designs the target process that is source independent. The whole solution is based on this theory, which is further implemented using the object oriented programming concepts such as interface and controller objects.
After the implementation of ETL-IA, the ETL process facilitates the addition of new source as plug and play device. Thus, the more the number of sources added to the system, the more the $s savings.

Stay tuned for more information on ETL-IA.

Testing in ETL, Data-Centric Projects

May 19th, 2011

Testing is an investigation process that is conducted to check the quality of the product. Product can either be an applications or data. The quality of the data can only be determined by checking data against some existing standards by following a set of processes. By doing so, you find out the symptoms in form of invalid/incorrect data that happened because of erroneous processes. So the data-centric testing results in achieving high quality data by getting the erroneous processes fixed.

This article highlights the types of data-centric testing approaches and discovers what each testing approach uncovers.

Types of Testing

  • Application Testing: The focus of this article is data-centric testing so we’ll not discuss application testing here.
  • Data-Centric Testing: Data-centric testing revolves around testing quality of the data. The objective of the data-centric testing is to ensure valid and correct data is in the system. Following are the couple of reasons that cause the requirement of performing data-centric testing:
    • ETL Processes/Data Movement: When you apply ETL processes on source database, and transform and load data in the target database
    • System Migration/Upgrade: When you migrate your database from one database to another or you upgrade an existing system where the database is currently running.

Data-Centric Testing
The data-centric testing validates data using the following approaches:

  • Technical Testing: Technical testing ensures that the data is moved, copied, or loaded from the source system to target system correctly and completely. Technical testing is performed by comparing the target data against the source data. Following is a list of functions that can be performed under technical testing:
    • Checksum Comparison: The data-centric testing makes use of checksum approach to discover errors. Checksum can be performed on source and target databases in n number of ways such as counting the number of rows, and adding the data of a column. Later the result of checksum calculated on source database is compared against the checksum calculated on the target database.
      For example, row count compares the number of rows in the target database with the number of corresponding rows in the source database. Or the target database may contain the summarized annual data for monthly salaries in the source database. So the target database should contain sum of the monthly salaries paid within a year for each year.
    • Domain comparison: The domain list in the target database is compared against the corresponding domain list in the source database. For example, the source system has 100 employees and the target system also has 100 employees but it does not guarantee that the employees in both the source and target domain lists are same unless compared. In Domain comparison, employee names in the target domain list are compared against the employee names in the source domain list.
    • Multi-value comparison: Similar to List comparison, multi-value comparison compares the whole record or the critical columns in a record against the corresponding values in the source system.
      For example, the domain comparison reports that all the employees are same and checksum comparison reports that the salaries on source and target match, but still it does not guarantee of a valid record because the data movement may result in assigning wrong employee name to the salary entry. Such issues are not discovered by Checksum comparison and Domain comparison. The multi-value comparison discovers such issues by comparing the key attributes of each entity in source and target databases.
  • Business Testing: Business testing is done to ensure that the data fulfills the requirements of the business. Data may have been moved, copied, or loaded completely and accurately, and technical testing does not report any issue still there are chances that the system still contains the invalid data. To ensure high quality data, the data is evaluated against the business rules.

    For example, value of certain fields such as salary and commission cannot be less than zero. These types of errors can occur because of data manipulation between the source and target systems. Unless you test data for any such errors, the quality of the data is not guaranteed.
    For every business, the list of the business requirements is different. An exhaustive list of business rules against which the data is compared, ensures high quality data.

  • Reconciliation: Reconciliation ensures that the data in the target system is in agreement with the overall system requirements. Following are the couple of examples of how the reconciliation helps in achieving high quality data:
    • Internal reconciliation: In this type of reconciliation, the data is compared within the system against the corresponding data set. For example shipping would also always be less than or equal to the orders. If the shipping ever exceeds the orders then it means data is invalid.
    • External reconciliation: In this type of reconciliation, data in system is compared against its counterpart in other systems. For example, in a module or in an application, number of employees can never be more than the number of employees in HR Employee database. Because HR Employee database is the master database that keeps record of all the employees. If such a situation occurs where the number of employees anywhere in the system is more than the HR Employee database, then the data is invalid.

My company developed the ICE± product, which targets to accomplish all types of testing mentioned in this article. For more information, visit www.iCEdq.com

Pitfalls of type II dimension

September 15th, 2009

Type II dimension has been popularized by R Kimball. it has become so popular that in any interview related to data warehouse, the interviewer will surely ask to explain the concepts. And chances are that if you don’t know; they will laugh at your ignorance and reject you.
Here’s your chance to laugh at them..
If you read this article, probably you will end up knowing something more than them. This is not because you will find the definition of type II dimension, but for an entirely different reason.

To be continued…

To clearly explain the pitfalls of Type II dimension, let’s take an example. In the example, there are three tables, that is, DIM_Instrument, FACT_Trade, and FACT_Settlement. Each of the tables contains data as shown below:

DIM_Instrument table
DIM_Instrument

In the Dim_Instrument table, the property of instrument IBM changes from X to Y. So to maintain type II dimensions, a new entry is added in the table by updating the status of current entry to obsolete (denoted by ‘O’) and adding date in the TODT column as well. In the new entry, the ToDT column is NULL and the status is current (denoted by ‘C’).

FACT_Trade table
FACT_Trade

The trade table contains information about just one trade and that was executed on April 29th, 2011, which means it was processed for InstrumentKey ‘1’ as Instrumentkey ‘2’ did not exist on April 29th, 2011.

FACT_Settlement table
FACT_Settlement

Generally, it takes three days for a trade to settle. So the trade that was executed on 29th April 2011, got settled only on 2-MAY-2011. During this period the property of instrument ‘IBM’ changed on May 1st 2011, a new entry was made in the DIM_Instrument table for Instrument IBM, which incremented the instrumentKey to 2.

Now in the settlement table, the instrument key against the same trade is different, which can cause the issues/concerns that are described using various scenarios.

Scenario 1:
Get data for each trade and settlement for the current instrument, using the following query:

SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
and I.Status=’C’
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey
and I.Status=’C’;

Scenario1

The output does not show any data from Fact_Trade table. Because the current record in Dim_instrument is with InstrumentKey as ‘2’ and the Fact_Trade table does not contain InstrumentKey ‘2’.

It happens because the instrumentkey is changed between trade execution and trade settlement. Though, the settlement is done against the trade processed on April 29th, 2011 but with the change in InstrumentKey in between, there is no way to retrieve it.

Scenario 2:

Rather than querying data for current instrument, get data for all the instrument records. You can do so by using the following query:

SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey

The above query returns the following output:
Scenario2OP1

If you analyze the output closely, it returns the trade information for InstrumentKey ‘1’ but no settlement information for InstrumentKey ‘1’. Similarly for InstrumentKey ‘2’ there is no trade information but the settlement information. This output can be best viewed as the following:
Scenario2OP2

For an instrument when trade exists, settlement does not exist and vice versa.

Scenario 3:
Maintain a relationship between two fact tables by maintaining a common key (i.e. TradeID) and join both the tables based on this common key. In this case, the TradeID uniquely identifies the trade and settlement data that refer to the same trade.
But this scenario results in fact-to-fact relationship, which is not recommended because of the following reasons:

  • Fact tables being the largest tables adversely impact the performance of the join.
  • It is not a practical scenario when multiple fact tables need to be joined because doing so complicates the design and is not always possible to achieve.

All the above scenarios highlight various pitfalls of Type II dimension. All the scenarios and examples used throughout this article are purely for demonstrating the highlighted pitfalls.

Difference between Reference Data and Master Data

June 19th, 2008

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences.
Lets try to understand the differences with an example of sales transaction.

A sales transaction contains information like….
Store,
Products Sold,
Sales Person,
Store Name,
Sales Date,
Customer,
Price,
Quantity,
etc.

Attributes from the above example can be separated into two types: Factual (transactional) and Dimensional information
Price and Quantity are measurable attributes of a transaction.
Store, Products Sold, Sales Person, Store Name, Sales Date, and Customer are dimensional attributes of a transaction.

We can see that the dimensional data is already embedded in the transaction. And with dimensional attributes we can successfully complete the transaction.Dimensional data that directly participates in a transaction is master data.

But is the list of dimensional attributes in the transaction complete? 

Asking few analytical questions can help us discover the answer. 
     -What is the Male to Female ratio of customers doing purchase at the store?
     -What type of products are customers buying? Ex: Electronic, Computers, Toys
     -What type of Store is it?  Ex: Web store, Brick & Mortar, Telesales, Catalog Sales

The above questions cannot be answered by attributes in the transaction. These dimensional data is missing in the transactions.  This missing dimensional data that does not directly participate in transaction but are attributes of the dimension is reference data.

Why it is important for an ETL person to understand the differences? Well once the  ‘Reference Data Management’ (RDM) was popular then suddenly in last few years there is this new word ‘Master Data Management’ (MDM). These words mean different things and they have significant implication on how they are managed. But that will be a topic of discussion for some future post!  I hope this article will help clear atleast some confusion.

 

 

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse -Contents of ETL Book

November 5th, 2007

Section A ,The Beginning
01 ETL The Basics
02 ETL Strategy

Section B, Analysis
03 Target Systems Analysis
04 Source Systems Analysis
05 Source Target Mapping- Part I
06 Understanding Data Quality
07 Data Profiling

Section C, Develop Part I
08 Understanding Data Patterns for ETL
09 Simple ETL Development

Section D,  ETL Architecture & Design
10 ETL & Data Integration
11 ETL-IA (Interface Architecture)
12 ETL-IA Implementation
13 Designing Standard ETL Templates

Section E, Develop Part II
14 File Management & Transportation
15 Extraction
16 Staging Data
17 ETL Transformation Development
18 Unit Testing ETL Processes
19 Coding Wrappers
20 Automation of ETL Processes

Section F, Migration
21 Migration ETL Processes

Section G, Post Production
22 Reference Data Management
23 Exception & Error Management
24 Production Support & Change Management For ETL Processes
25 ETL & Performance Tuning

Section H, Other
26 ETL Tools
27 ETL & Metadata Management

Slower Development Databases and Servers

November 2nd, 2007

It is a normal trend in it to buy the most powerful machine for production usage. Example you production database box will have 16 CPUs your QA 8 and your Development box 4 or 6. Similar ratio is maintained on ETL servers, Application servers, and Hard Disk performance.

Logic being the production environment is critical for end user experience.
Agreed! That the response time is critical for end users; however that does not mean you buy slower machine for your development. Imagine the time wasted while a resource is sitting in front of the machine, waiting for the query to return with data. A developer usually spends all his time working in the development database. So imagine a development box slower by 50% then production. That means a developer is loosing 50% of his time waiting for his result. It’s not just about his time but also about his focus and concentration that he will loose because his experience is bad with the system. It will be not an exaggeration to say that he will loose 4 hours from an 8 hours day.

In US total cost on an ETL developer is about $100 an hour. So the company is loosing $400 per developer per day. You do the math for 5 to 10 developers working for 1 year.

I am not saying that Production and Development systems must be same, but I firmly believe the difference should be more in the reliability and availability of the servers rather then on the performance difference between the servers.

I thought of this article while waiting for a query to return from a development box. It took exactly half the time as the production server and I remembered that the management is already discussing of upgrading the development server to improve the performance. 

Informatica FAQ’s Additional

October 22nd, 2007

What are the different types of Lookups?
The different types of lookups are static Lookup, dynamic lookup, static persistent Lookup 

What is Test Load in informatica?
If we want to test the execution of the mapping without loading any data in the target database
then we check the option of test load.

Why is parameterization imp and how does it help?
Parameterization helps us to avoid making changes in the mappings in case any of the variables in the
mapping have undergone change. By putting values that are likely to change in parameter file(Like schema name,
Update user_id) we improve maintainability of the code. 

Is there any advantage of having less number of Transformations in a mapping?
More Transformations mean increased time of execution for a mapping as the data has to be
passed thru another process of Transformation.

What is the overhead in a dynamic lookup over a static lookup?
As the dynamic lookup can undergo change in its values as the mapping is currently executing
hence PowerCenter has to continuously look for changes that are likely to happen. This causes additional overhead. 

What is the advantage of sending sorted data into a joiner transformation?
When sorted data is sent thru a joiner Transformation, we can take advantage of the sorted input option
available in Joiner. This causes the join operation to take less time than otherwise.

Is it possible to have a mapping with update strategy, using bulk mode?
No, Update strategy can be executed only in Normal mode. 

How can a mapping be failed if the numbers of rejects are more than a certain threshold?
Session option abort on error offers us the number of errors before which the session will be aborted.
By default it is zero. 
By Sujith Nair

Find me the attribute!……….. What attribute?

September 26th, 2007

An attribute with a context is a meaningless attribute. (Even if it has a definition associated with it).

One of the interesting phases in ETL is the data mapping exercise. By the time data-mapping phase starts the target model and source model is already defined. But sometimes a report or excel file is pulled out by a user just like magician pulls a rabbit from the hat.  And declare it as a target. They will ask for the data to be mapped from the source. There is not target data model.

Probably the first thing will be to identify the list of attributes from the target and then off you go finding them in various source systems. Is this the right approach? Absolutely not! Will it work? Probably yes, but at the cost of time, money, resources and lots of unnecessary confusion. Why?

Lets take an example…. User wants to see qty of products sold by the Sales Representative (SR)…DATE, QUANTITY, PRICE , etc…

Is Sales Representative a person or a group of people?
Is ‘DATE’ Date of order, date of execution or date of shipping?
Is ‘QUANTITY’ qty of order, qty of execution or qty of shipping, also do we consider for qty of return?

Is ‘PRICE’ price of Order, price of actual execution or price of invoice?

Point is a floating data element is meaningless without the context of subject area or entity. The Attribute might itself be a combination of many other attributes from many other entities

So what the right approach?
Define a data model for target even if it is not an actual database. It will help you understand the entities to which it belongs. Then it will be easy to find similar entity/s in the source and map it to the target element.  

In short build a data-model. Before actually building start mapping the data.

Conclusion:  A attribute has no meaning unless it is attached to an entity.

 

For every rule there is an exception; for each exception there are more exceptions…

May 13th, 2007

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation.

Ex. Table / column map between source and target, rules to identify unique rows, not null attributes, unique values, and range of a attributes, transformations rules, etc.


Without going into further details of the document, lets analyze the very next step. It seems obvious and natural to start development of the of the ETL process. The ETL developer is all fired up and comes up with a design document and starts developing, few days time the code is ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed. And then it fails again. What’s happening? Analysis was done properly; rules were chalked out & implemented according to the mapping document. But why are issues popping up?  Was something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?
 
Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has told you rules as the how the data is structured in the source system and how it is supposed to behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.
 
To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot. You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the source system and try to break all the rules told by the analyst. Or you can choose to wait for the process to go live and then wakeup every night as the load fails.  If you are lucky enough you deal with an unhappy user every morning you go to the office. 
 
Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single line of code. Question every rule. Try to find exception to the rules. There must be at least 20 tables. One table on an average will have 30 columns; each column will have on an average 100k values. If you make matrix of number of tables * columns * data values, it will give the number of reasons the why your assumptions may be wrong.   It’s like unit testing source data even without loading. There is a reason why machines alone cannot do your job; there is reason why IT jobs are more paying.
 
Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’

Type II dimension: What is expiry date? Does it come from Source or is it ETL date?

February 6th, 2007

I assume everyone is aware of type II dimension concept.  In type II dimension when we get an updated row we retire old row and insert new one. But what do you populate  in the from_date —to_date attributes. ETL create/update datetime or when that particular row was retired in the source system?

The question can be asked in a different way… When a employee changes from one department  to another department; While populating in the data warehouse type II dimension EMPLOYEE table… Do you use the actual date of his transfer or the ETL date when you are populating the table?

What are you doing in your ETL code?

ETL Strategy for Reversing wrong file/data load.

January 19th, 2007

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! Simple inserts is easy, but what about integrated data because of updates? How do you recover from such situation? Can you actually recover? What should be the ETL strategy?

Has your Data warehouse Architect/Manager/ETL Lead come up with an ETL Strategy?

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

ETL Interface Architecture (ETL-IA) tm

December 23rd, 2006

An enterprise data warehouse has by definition has more then one data source. In simple language a table employee in the EDW will have multiple sources like HR dept, Sales dept, and so on. Also if the company is a group of companies then same department can repeat across various companies.

Let’s say the DIM_EMPLOYEE table was designed by the data modeler and then he asks the ETL team to develop an ETL process to populate it. Very easy, write an ETL process to read from the source and load it into the target table.
Now after few days they ask that they have identified another source, should the ETL team start all over again and repeat the whole process? Is this efficient? Or is there a better way?

Well, welcome to my world of “ETL Interface Architecture ™ (ETL-IA)”

With this ETL-IA addition of 3X sources does not result in 3X work. Instead additional work added is just to handle the uniqueness of the each source. Thus saving time and money.

The question is how?

I will be discussing this in details on the website. A detailed information on implementation of process is the core of the book ETLGuru.com: ETL Strategies and Solutions for Data Warehouse.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

Please click here to leave comments or Questions

Reference data Management & Reference Data Hub

December 22nd, 2006

This is not a new concept, but it is fast becoming a specialized area in Data Management. Example in a company employee list & details are managed in central repository and distributed it all over the organization. Similar thing can happen to customer list/details, stock list/ details etc.

What is the relationship of such repositories with data warehouse? Are they part of data warehouse? How can they be integrated into the data warehouse? And many more questions.
 
I will add more on this article. But meanwhile here is a link to read some practical example.

Readers welcome to write at ETLGuru.com

December 5th, 2006

Readers,

I will be focusing on ETL BOOK, any reader wishes to write /contribute to the site are welcome. Let me know if anyone is interested. I will review, edit and publish.

However this does not mean that I will not write here. In fact I  think that the website & the ETL book will be complementary.

Sandesh

Informatica interview questions & FAQs

November 21st, 2006
What is a source qualifier?
What is a surrogate key?
What is difference between Mapplet and reusable transformation?
What is DTM session?
What is a Mapplet?
What is a look up function? What is default transformation for the look up function?
What is difference between a connected look up and unconnected look up?
What is up date strategy and what are the options for update strategy?
What is subject area?
What is the difference between truncate and delete statements?
What kind of Update strategies are normally used (Type 1, 2 & 3) & what are the differences?
What is the exact syntax of an update strategy?
What are bitmap indexes and how and why are they used?
What is bulk bind? How does it improve performance?
What are the different ways to filter rows using Informatica transformations?
What is referential Integrity error? How do you rectify it?
What is DTM process?
What is target load order?
What exactly is a shortcut and how do you use it?
What is a shared folder?
What are the different transformations where you can use a SQL override?
What is the difference between a Bulk and Normal mode and where exactly is it defined?
What is the difference between Local & Global repository?
What are data driven sessions?
What are the common errors while running a Informatica session?
What are worklets and what is their use?
What is change data capture?
What exactly is tracing level?
What is the difference between constraints based load ordering and target load plan?
What is a deployment group and what is its use?
When and how a partition is defined using Informatica?
How do you improve performance in an Update strategy?
How do you validate all the mappings in the repository at once?
How can you join two or more tables without using the source qualifier override SQL or a Joiner transformation?
How can you define a transformation? What are different types of transformations in Informatica?
How many repositories can be created in Informatica?
How many minimum groups can be defined in a Router transformation?
How do you define partitions in Informatica?
How can you improve performance in an Aggregator transformation?
How does the Informatica know that the input is sorted?
How many worklets can be defined within a workflow?
How do you define a parameter file? Give an example of its use.
If you join two or more tables and then pull out about two columns from each table into the source qualifier and then just pull out one column from the source qualifier into an Expression transformation and then do a ‘generate SQL’ in the source qualifier how many columns will show up in the generated SQL.
In a Type 1 mapping with one source and one target table what is the minimum number of update strategy transformations to be used?
At what levels can you define parameter files and what is the order?
In a session log file where can you find the reader and the writer details?
For joining three heterogeneous tables how many joiner transformations are required?
Can you look up a flat file using Informatica?
While running a session what default files are created?
Describe the use of Materialized views and how are they different from a normal view.

Contributed by Mukherjee, Saibal (ETL Consultant)

Many readers are asking “Where’s the answer?” Well it will take some time before I get time to write it… But there is no reason to get upset… The informatica help files should have all of these answers! :-)

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse……..ETL book by Sandesh Gawande

November 3rd, 2006

Ok! I am finally planning to publish ETL book with detailed information on all aspects related to ETL and Data Integration. The ETL Book will be used by Data Warehouse Managers, Data Warehouse Architects & ETL Leads. It will also contain solutions for ETL developers. The Book will be independent of ETL tools like Informatica, DataStage, etc.

The good news is that readers will not have to wait for the completion of all chapters. I will make individual completed chapters available so that readers do not have to wait for others to be complete. Once all the chapters are complete readers can return individual chapters and get complete ETL Book free or paying the difference. The ETL forum will have a thread for each chapter so that reader can discuss/ recommend/ suggest. This will make it a complete ETL package.

Comments welcome! Like on chapters, title of the book, what you would like see in the contents, etc. I will also offer money back guarantee if you dislike the book.  For competing ETL books, I would suggest you buy it (The Data Warehouse ETL Toolkit) only if the bookstore near you permits for returns, so that you can compare yourself and decide.

Please wait for link to purchase ETLGuru.com Book /Chapters… Coming soon… First few chapters before Nov 30 2006.  Dec 30 2006.

Oracle recycle bin flashback table feature

October 14th, 2006
When you do ‘Select * from tab;’ some times, you will be surprised to see some tables with garbage names. Welcome to the world of Oracle Recycle Bin Feature. Because of this feature, Oracle saves the dropped table in recycle bins until you clear it.
1. Empty recycle bin use the command:
PURGE RECYCLEBIN;
2. To drop the table without storing the table in recycle bin use:
DROP TABLE employee PURGE;
3. To restore the table from the recycle bin use:
FLASHBACK TABLE employee TO BEFORE DROP;
So don’t forget to clean your database /Schema once in a while.

Data staging table / area design.

October 11th, 2006

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design.

1. Why staging area is needed?

Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source database to data warehouse database. Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse. This staging are serves many purpose above and beyond the primary function
a. The data is most consistent with the source. It is devoid of any transformation or has only minor format changes.
b. The staging area in a relation database can be read/ scanned/ queried using SQL without the need of logging into the source system or reading files (text/xml/binary).
c. It is a prime location for validating data quality from source or auditing and tracking down data issues.
d. Staging area acts as a repository for historical data if not truncated
e. Etc.

2. What is the difference between staging area as compared to other areas of data warehouse?

a. Normally tables in any relational database are relational. Normally tables are not stand alone. Tables have relationship with at least one or more tables. But the staging area greatly differs in this aspect. The tables are random in nature. They are more batch oriented. They are staged in the hope that in the next phase of load there will be a process that will identify the relationship with other tables and during such load a relationship will be established.
 

3. What should the staging table look like?

Staging Table Format

a. The key shown is a meaningless surrogate key but still it has been added the reason being; as may times the data coming from a source has no unique identifier or some times the unique identifier is a composite key; in such cases when data issue is found with any of the row it is very difficult to identify the particular row or even mention it. When a unique row num is assigned to each row in the staging table it becomes really easy to reference it.

b. Various dates have added to the table; please refer date discussion here.

c. The data type has been kept as string because this data type ensures that a bad format or wrong data type row will be at least populated in the stage table for further analysis or follow-up.

d. Source system column has bee added to keep a data reference so that next process step can use this value and can have dynamic behavior during process based on the source system. Also it supports reuse of table, data partitioning etc.

e. Note the table has source as table qualifier as prefix this distinguishes the table from other source system. Example customer from another system called MKT.

d. Other columns can be added example processed flag to indicate if the row has been processed by the down stream application. It also provides incremental restart abilities for the down stream process.  Also exception flag can be added to the table to indicate that while processing the table an exception or error was raised hence the row is not processed.

4. Which design to choose?

a. Should the table be truncated and loaded?
b. Should the table will be append only?
c. Should default data type to be left as alpha numeric string (VARCHAR)?
d. Should constraints be enforced?
e. Should there be a primary Key?

It is normally based on the situation but if not sure or you don’t want to think then design suggested here should more than suffice your requirements.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc. 

ETL Startegy to store data validation rules

September 6th, 2006

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on.

Various test conditions are used to validate data when the ETL process is migrated from DEV-to->QA-to->PRD. These test conditions are can exists in the developer’s/tester’s mind /documented in word or excel. With time the test conditions either lost ignored or scattered all around to be really useful.

In production if the ETL process runs successfully without error is a good thing. But it does not really mean anything. You still need rules to validate data processed by ETL. At this point you need data validation rules again!

A better ETL strategy is to store the ETL business rules in a RULES table by target table, source system. These rules can be in SQL text. This will create a repository of all the rules in a single location which can be called by any ETL process/ auditor at any phase of the project life cycle.

There is also no need to re-write /rethink rules. Any or all of these rules can be made optional, tolerances can be defined, called immediately after the process is run or data can be audited at leisure.

This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules

Benefits:

Rules can be added dynamically with no cange to code.

Rules are stored permanantly.

Tolerance level can be changed with ever changing the code

Biz rules can be added or validated by business experts without worring about the ETL code.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Hard references to database objects, Synonyms and ETL

September 6th, 2006

1. Any time hard references are made to a database object it is an invitation to trouble. Especially in ETL where the object (table /view /procedure) can belong to schema.

2. Oracle has –> Database server name –> Schema –> Object

3. SCHEMA_NAME.TABLE_NAME is a bad way of referencing objects from the ETL architecture point of view as well as from data warehouse architecture point of view. The reason being any hard coding of the objects will cause trouble when they are moved from one schema to another.

4. An alternative could be to log in with the same SCHEMA_NAME where the objects exists.  But this will create a security hazard.  As ETL processes in will be in the same schema as that of the data warehouse structures and processes.

5. Also refereeing a object directly can also be issue because if the table name is changed or the column name is changes it will result in immediate code change.

6. Much argument can made that it’s not really an issue but just a little work or little more work. But what most people don’t understand is it is not how much work but when this work can come up. Most likely when you are migrating/ production run, or when the team is really at short of time.

7. Hard coded references also reduce flexibility… what if ETL wants to read data from some other schema to test an ETL process? What then?

Ok! What’s the solution then… 

hardcodedref.JPG

 

A better way would be to have a dedicated schema for ETL user. And all the objects tables /views /procedures can be references by public synonyms. This way any object can be replaced by a new reference, temporary object, view or private synonym without; ever changing the code. It’s a much better way there are many more advantages but of course there will be more maintenance.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Loading & testing fact/transactional/balances (data), which is valid between dates!

July 25th, 2006
This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc.

variable_bond_interest.JPG

The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change. variable_bond_interest_fct1.JPG

1 .What the table structure should be used in the data warehouse?

Maybe Ralph Kimball or Bill Inmon can come with better data model! :-) But for ETL developers or ETL leads the decision is already made so lets look for a solution.

2. What should be the ETL design to load such a structure?

Design A

  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.

Design B

  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.

Design C

  • Can there be a comprimise. How about using from date (time) – to date (time)! The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).

Which design is better, I have used all depending on the situtation.

3. What should be the unit test plan?

There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans

a. There should be only one value for a given date/date time

b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly.

c. At any given time there should be only one current row.

d. etc

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc.

Introduction to Error and exception management.

July 3rd, 2006

ETL is all about transportation, transformation and organizing of data. Of anytime something moves (as a matter of fact even if you are perfectly stationary and items around moves) accidents are bound to happen. So any ETL specialist believes that their code is perfect and nothing can happen obviously lives in a fool’s paradise.

The next obvious thing is to design to manage accidents, like making a safer car or a factory. And as an ETL specialist if you don’t do it you are no different then others. As in any country there are laws for accidents and accident due to criminal negligence. Later being the worst.

How many times I have seen people putting ETL code into production without actually designing processes to prevent, manage or report accidents. Writing code is one thing writing production worthy code is another. Do ask yourself or your developers, “Is the code production worthy?”

Next

ERRORS:

A programmatic error that causes the the program to fail or makes the program run for uncontrolled time frame.
EXCEPTIONS: A program/code written to handle expected or unexpected errors gracefully so that the program continues run with logging the error and bypassing the erroneous conditions or even logging the error and gracefully exiting with error message.

More detailed description will come with topic…. ‘Unhandled exceptions results in Errors’.

Note: The topic on error and exceptions is relevant to Informatica, Data Stage, Abinitio, Oracle warehouse builder, PLSQL, SQLLDR, Transact SQL or any ETL other tools.

Aggregate aware queries

June 19th, 2006

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour.

What is the solution?
There are three solutions based on the same concept of pre-aggregation, but the utilization is quite different.
1. Write different queries in advance to hit different tables
2. Have a reporting tool, which is aggregate aware (Client side). The reporting tool itself will decide ehich tbales to hit and rewrite it’s own queries.
3. Have a database that supports transparent query re-write. The database it self takes a simple query and re-writes against aggregrate tables.

Here’s an Oracle version of query re-write based on point 3 in this article.

 More details to come………… 

Lazy links to ETL basics.

June 15th, 2006

Ok check this on Wikipedia

http://web.archive.org/web/20070523031241/http://en.wikipedia.org/wiki/ETL

I will add more..

Talk to me!

June 15th, 2006

Dear Guest Readers,

This is the third month of ETL Guru. I am getting visitors from all over the world (USA, India, France, Germany, Australia, Canada). Based on my understanding I will continue to write about data integration. But it will be only from my point of view. Also I never know if the reader is really interested in those articles or if you really understand it.

Hence If you have any questions, suggestions, requests please take initiative and please respond to the articles (you can comment in each article where it says ‘no comment’ or comments…

OR Please put comment in the guest book (– Click here)

OR Go to the ETL forum & Raise questions/Discuss here (– Click here)

Sandesh

NOTE: ETL Guru is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc

Multiple executions ETL process against same set of data.

June 15th, 2006

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?”

Answer: 1. I get the same result set.
Answer: 2. I get multiple result set.

If you go back to the original article on What is ETL & What ETL is not! You will immediately come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.

Why will the process run more than once against the same set of data? Many reasons, example most common being operators mistake, accidental kickoff, old set of data file remaining in the directory, staging table loaded more than once, intentional rerun of ETL process after correction of some data in source data set, etc. Without going into further details, I would advise ETL folks to always include in your process ways to prevent it from happening by one or more combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…

So do these experiments in the development or test environment run the ETL process more than once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the tester is wrong.

Bottom line:
A test case to check multiple runs is must in life cycle of an ETL process.

ETL delta logic & de-normalization of data model.

June 6th, 2006

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided.

Let’s look at the model in the source side, which is perfectly normalized. 

      
Now let’s look at the de normalized model on the target side. 

 
Next lets think of delta logic for loading of the dim_employee table. Ideally you would only check changes in the employee table. Then if there is any changes after the last load date time ; then get those rows from ref_employee and do the lookup to get the department & the designation and load it into the target table.
 
The issue with this delta logic is that it has not considered the effect of de normalization of employee table on the target side. If you carefully look at the two de normalized attributes dept_name and emp_designation_desc, the ETL process will miss any changes in the parent tables, so only new employees or updated employee will get the new definition of department & designation. And any employee that has not been updated in the source side will still have the same dept_name & emp_designation_desc. This is wrong.

The reason it is wrong is the ETL delta logic only picked the row from the employee table when it changed and ignored the changes in the dept & designation tables. The truth of the matter is, ” For any de normalized target table data (affected rows) should be re-captured from the source, any time there is change in the driving/core table as well as when there is change in any parent tables to which the driving table refers to.” In this case, even if there is change in department or designation table, all the rows affected on the employee tables should be re-processed.

It might seem very simple, but ETL developers/designers/modelers always miss this point. Also once developed it is very difficult to catch.

The next question is how you would catch the affected rows. Well there are ways to write SQL that combine the three tables (in this case) and treat them as one single entity and the pull rows based on the any update_dttm greater than the last ETL run. Figure out the SQL… 

Types data elements and entities (Tables) for ETL.

June 3rd, 2006

It is important for an ETL developer to understand the types of tables and data, to intelligently design ETL processes. Once the common types objects are understood, reusable templates for ETL can be developed, regardless of business logic. This will greatly improve the efficiency of an ETL developer.

1. Reference data

2. Dimensional data (master data)

3. Transactional data

4. Transactions

5. Balances

6. Summary/Aggregations

7. Snapshots

8. Staging

9. Out triggers/mini dimensions

10. Log tables

11. Meta data tables

12. Security tables

13. Configuration tables

 

Data integration basics-Dimension Confirmation.

May 25th, 2006

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed. 

A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example Sales data and salary data will have employee as the common set. 

Data Set Intersection.JPG 

Between two or more sources, the subjects, entities or even attributes can be common. So can we integrate the data easily? Mathematically it seems very easy, but the real world is not just about numbers or exact same string values. Every thing can be similar or same but may be not represented in the exact same manner in all the sources. The differences in representation of same information and facts between two or more sources can create some of the most interesting challenges in Data Integration.

Data integration: -The first step in data integration is identification of common elements
1. Identify the common entities.
Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers?
 
2. Identify the common attributes.
What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined.

3. Identify the common values
Same information can be represented in different forms in multiple source system. Example, male sex, can be represented as ‘M’ or ‘1′ or ‘male’ or some thing else by each source system. A common representation must be decided (example, ‘Male’). Also if necessary a finite set of values should be established. Example the employee sex = (‘Male’,’ Female’) and it will not allow more than these two values.

The second step is the identification of Data Steward who will own the responsibility and ownership for particular set data elements.

The third step is to design an ETL process to integrate the data into the target.  This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading.

The final fourth step is to establish a process of maintenance, review & reporting of such elements.

ETL and the importance of dates.

May 24th, 2006

ETL is all about data movement, batch processes and so on. The biggest criterias for data movement are based on dates. Dates like

ETL_CREATE_DTTM,

ETL_UPDATE_DTTM,

SRC_SYS_CREATE_DTTM,

SRC_SYS_UPDATE_DTTM,

BUSINESS_CREATE_DATE,

BUSINESS_UPDATE_DATE

Proper understanding of dates is one of the most essential requirements, while designing, developing, testing, and scheduling ETL process.

Lets discuss each of these in details..

ETL Create Date Time: -This is the date on which the row is loaded in the data warehouse. It has nothing to do with the business or business logic. It simply documents the exact date and time, the row as populated in the database by the ETL process. Once time stam is added it is never updated for the life of the database. It defines the first time the row arrived in the database. Also it is important to note that during a batch load the date and time at the begining of the process should be stored in the variable & same exact date and time should be used for one instance of the batch process. Why… think about it, more details will be offered soon.

ETL Update Date Time: – It is same as ETL Create Date Time except that, it may be populated during inserts, but unlike create timestamp above, it is updated every time any value in the row is updated. It give the information, as to the last time the database row was updated.

Source System Create Date Time: – This is the date time on which the incoming row was created in the source system. Some times due to combination of rows from different tables, or summerization of rows will make it difficult to find out the create date. But a best guess can be used. Also some times the source may not have the create date time at all, in that case the system date on the source may be used.

Source System Update Date Time: -This is similar to source create timestamp. But only applies to the update date time.

Now lets focus on some of the least understood & most misunderstood date concept on dates. Some times it takes so much time to explain this at worksite that it prompted me to write this article.

Business Date: -This is the official business date to which the the data/row belongs. This has nothing to do with the system date or clock time on the source application; or when it was loaded in the data warehouse.

Example 1: If a sales order was booked on the 1/1/2006 and was not entered into the system until system date becomes 1/4/2006 still the business date will be 1/1/2006.

Example 2: Trade are happening during the day 22/12/2006 and the batch process starts within the application at 8 pm and even after the stytem clock passess midnight into next day 23/12/2006; still the data belongs to 22nd and not to 23rd. This is normaly controlled by a function that holds the business date and is only changed when the triggering element is instructs to change the business date to new one. This also helps because the data can be processed for any historical date just by controlling the business date in the system.

Uses of these six types of dates? Think….

Delta processing, Batch processing, Back dated processing, Auditing, etc.

ETL Data file processing questions, you must ask!

May 17th, 2006

Many times file processing is taken quite lightly these days. The reason is either ignorance or sheer carelessness. Another reason being most of them being from RDBMS background they are just not aware of the intricacies of file management.

So wakeup….

Pre process
Before processing the file have you checked if the contents in header/trailer records actually match the file content?

Did the checksum of the file match to file on remote server?

Does the file contain delta, repeat/all rows?

Does the file contain delta, repeat/all rows? Are there flags for update, delete insert?
Does the file contain flags for update, delete insert?
Does the file transfer protocol supports support restart capabilities?

Have you asked data file envelope ( ex source name, date)information in the contents of data file or in the file name?

Have you asked for a header or trailer record? If yes is it in the data file itself or another file?

How do you know that the file is ready to be pulled?

How do you know there are no binary characters in the ASCII file?

If the file has to be pulled; What is the remote server name, IP address, user id and password?

If the file in same format is received from multiple sources how do you distinguish files from various source systems? (Content or by name or by location)

If the same file is processed in multiple schedule how do you distinguish between say a daily file or monthly file or yearly file?

If the file is being pulled from remote server then, will a backup of the file will be maintained on the remote server?

Is the remote file server being identified by IP address rather thana name? In that case, how does the system handle failover & change of server?

Is timestamp appended to the file name?

What are the contact details of the main and backup person?

What file transport protocol should be used?

What is the delimiter of file?

What is the expected size of file?

What is the file type binary or ASCII or ABCEDIC or something else?

What is the format of file multi line/XML/etc?

What is the format of the file?

What is the frequency of the file? Continues / daily / weekly bimonthly etc?

What is the location of the file?

What is the name of file?

What is the name of file? Is timestamp appended to the file name?

What is the source of the file?

What kind of access it provided ftp/secured/proprietary?

Who is the main contact person and backup contact for the file?

Will the file be overwritten daily in the source (remote) server?

Will the file be pushed or pulled?

In process

How do you know that the file that is being process is not a old file?

How will the file moved during processing?

What is the archive policy?

How should the process deal with deletes?

Can you avoid complete processing of file by taking a delta when compare to the previous days file?

Is the file compressed after load?

How do you know that all the file movement actually happened and not failed in any steps?

Did you check for the exit code after every step?

Did you convert the file columns in correct format?

What will you do if some of the columns are not in correct format?

Will you skip the row, replace the value or stop the process?

After how many fails will you terminate the process?

What happens if the load fails in between half load?

How will you identify the half loaded records?

How will you deal with the remaining records corrected manually?

Will you rerun the whole file to fix the problem?

What will happen if during the rerun the file is not completely processed?

Post process

Did the rows table match to that of file?

Does the Checksum match?

Did column shift occurred in the file?

Have you stored the result of the load into a table?

 

What ETL performance (processing speed) is good enough?

May 14th, 2006

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends… 

But I would rather be blunt and say that usually for INSERTs, 1000 and for UPDATEs 300 rows/sec should be good enough. If you don’t have special requirements then any average around these numbers should make you comfortable. 

What ETL is not?

May 10th, 2006

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted.

Data warehouses are not OLTP systems. Duplication of calculations in Source system & the data warehouse should not be attempted, as the numbers will be very difficult to match during QA. Also in future the process in the source system can change that will result in asynchronous data.

ETL cannot change the meaning of data. For example for sex  ‘M’ and ‘F’ in source system sex flag to ‘1’ and ‘2’ is used in the Data Warehouse respectively. This is OK because this does not change the business meaning of the data. It only has changed the representation of the data.

Why Informatica sequences & Sybase/SQL server identity columns, should not be used?

May 7th, 2006

Informatica provides the sequence object to create surrogate keys during load. These object is also sharable within various mappings hence can be used in parallel. But I will recommend never using it. 
Here’s the Reason why….

1. MIGRATION ISSUE:  The sequence is stored in the Informatica repository. That means it is disconnected from the target database. So during migration of code between environments the sequence has to be reset manually. Bigger problem arise when the tables with data is brought from production environment to QA environment; the mapping cannot be immediately run because the sequences are out of sync.

2. Sequences belong to the target schema and the database and do not belong to the processes as the table object should define & hold values for the attributes and not something external to the system.

3. At first it might seem that Sybase/SQL server does not support sequences but checkout the post & that will solve the problem.

4. Logically it seems that Informatica calls to oracle procedure will slowdown the ETL process but in real world, I have found it not to be true. Additionally this process is only called when a new reference data/Dimension is added. New reference data is not as volatile as transactions; so the adverse effect is nullified.

5. If Identity columns are used instead, it causes more problems as you loose programmatic control on it. Example any type II dimension changes are become a nightmare to manage.

Simulating Oracle Sequences in Sybase & SQL Server

April 26th, 2006

Programmatic control is lost when identity columns are used in Sybase and SQL Server. I do not recommend using Identity columns to create surrogate keys during ETL process. There are many more reasons for that. Oracle has the sequence feature which is used extensively by Oracle programmers. I have no clue why other vendors are not providing the same. This custom code has been used extensively by me and thoroughly tested. I ran multiple processes simultaneously to check if there is deadlock and also made sure that the process returns different sequences to different client process.

Notes: –

1. The table should have ‘ROW LEVEL LOCKING’

2. The sequence generator process is stateless (See more details in Object Oriented Programming)

3. Create one row for each target table in the sequence master table. Do not try to use one sequence for multiple tables. It will work but probably is not a good idea.

Step 1: -Create a table with following structure.

CREATE TABLE sequence_master (
sequence_nm varchar (55) NOT NULL ,
sequence_num integer NOT NULL
)
GO

Step 2: -Create a stored procedure that will return the next sequence.

CREATE PROCEDURE p_get_next_sequence
@sequence_name varchar(100)
AS
BEGIN
DECLARE @sequence_num INTEGER
— Returns an error if sequence row is entered into the table.
SET @sequence_num = -1

UPDATE sequence_master
SET @sequence_num = sequence_num = sequence_num + 1
WHERE Sequence_name = @sequence_name

RETURN @sequence_num
END
GO

 

What Is ETL?

April 24th, 2006

ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning.

Disscussion:

Source System can be any application or data store that creates or stores data and acts as a data source to other systems. If data is created in such a source system it can as well called Data Factory.

Automation is critical or the very purpose of ETL will be defeated. ETL no good if processes need to be manually scheduled, executed or manually monitored.

Data quality is a critical factor for the success of data warehousing projects. If data is of inadequate quality, then the knowledge workers who query the data warehouse and the decision makers who receive the information cannot trust the results. In order to obtain clean and reliable data, it is imperative to focus on data quality. While many data warehouse projects do take data quality into consideration, it is often given a delayed afterthought. Even QA after ETL is not good enough the Quality process needs to be incorporated in the ETL process itself.. Data quality has to be maintained for individual records or even small bits of information to ensure accuracy of complete database. One rotten tomato can ruin whole basket. It has to be ensured that the data is consistently updated to represent the source system. The accuracy of individual element of data is important so is the integrity of data across various pieces of information.

Extraction is first major step in physical implementation of ETL. Extraction initiates or triggers further downstream processes.

Needless to say, once data is extracted it has to be hauled and transported to target, because the physical location of the source system might be different from the target warehouse. Also many aspects like security, encryption, bandwidth, and transportation time have to be considered which will be discussed in the upcoming chapters.

Transformation is the series of tasks that prepares the data for loading into the warehouse. Once data is secured, you have worry about its format or structure. Because it will be not be in the format needed for the target. Example the grain level, data type, might be different. Data cannot be used as it is. Some rules and functions need to be applied to transform the data. There are many types of transformation that occur on the incoming data, which will be discussed, latter on.

One of the purposes of ETL is to consolidate the data in a central repository or to bring it at one logical or physical place. Data can be consolidated from similar systems, different subject areas, etc.

ETL must support data integration for the data coming from multiple sources and data coming at different times. This has to be seamless operation. This will avoid overwriting existing data, creating duplicate data or even worst simply unable to load the data in the target.

Loading part of the process is critical to integration and consolidation. Loading process decides the modality of how the data is added in the warehouse or simply rejected. Methods like addition, Updating or deleting are executed at this step. What happens to the existing data? Should the old data be deleted because of new information? Or should the data be archived? Should the data be treated as additional data to the existing one?

The incoming data might not be perfect. New York City might be shown in India or the Taj Mahal in France. Data has to be cleaned before it goes into the database.

Data can be loaded with lots of care. Does that that means data loaded in the Warehouse correct? What is the confidence level in the data? A data auditing process can only establish the confidence level. This auditing process normally happens after the loading of data.

Data Derivation is a process adding derived data values from reference or transactional data with simple string or mathimatical functions functions. Example Income – expense = profit. NY = New York. But ideally no business rules should be invloved & complex or time variant calcuations should be avoided.

Data enrichment is adding/combining external data values, rules to enrich the information already existing in the data. Example if we can get a list that provides a relationship between Zip Code,City and State, then if a address field has Zip code 06905 it be safely assumed and address can be enriched by doing a lookup on this table to get Zip Code 06905 –> City Stamford –> State CT.

 

Slow Running ETL process (read side) & Table Statistics (Oracle).

April 23rd, 2006

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for the sake of discussion.

For ETL process to be slow on read side here are some reasons .1. No indexes on joins and/or ‘where clause’ 2. Query badly written. 3. Source not analyzed.  Out of these three lets rule out 1 & 2.
In the past most of the databases had RULE based optimizer in the INIT.ORA file, but with new development and specially Data warehouses ‘CHOOSE’ optimizer is preferred. With ‘CHOOSE’ option the query uses COST based optimizer if the statistics are available for the tables in the query.

There are two methods to gather statistics 1. DBMS_STATS package, 2. ANALYZE command. Oracle does not recommend ANALYZE command going forward for various reasons (its a command, cannot analyze external tables, it gathers stats that are not essential, inaccurate stats on partitioned tables and indexes, in future ANALYZE will not support cost-based optimizer, no Monitoring for stale statistics, etc.)

DBMS_STATS.GATHER_SCHEMA_STATS (ownname =>’DWH’, options =>’GATHER AUTO’);
This Package will gather all necessary statistics automatically (there is no need to write a process to check for tables that have stale or no stats). Oracle implicitly determines which objects need new statistics, and determines how to gather those statistics. So once it is put in the ETL flow you can sleep at home and everybody will be happy.

Bottom Line:
After major data load the tables should be analyzed with an automated process. Also please do not use ANALYZE , as in future it will not collect statistics needed for COB. The code can be found here.

1This package should be used as the part of ETL workflow to make sure as batch processes are run the benefit of statistics are available to the next process in the que.

Also ETL programmers can call this package within the ETL process/procedure in the begening so that if the tables required for process are not analyzed they will be analyzed automatically. Also if the procedure is going to modify lot of data in the table then the package can be called just before the end of the procedure.

Shell Scripts for Oracle

April 23rd, 2006

Here’s a lazy post for shell scripts. Links In future I will add some specalised scripts for ETL.