Type II Dimension or a Fake Fact Table!

You are sales manager of North-East region in 2010 to Oct 23 2011 and did total sales of 1 million. The table that stores your information looks like…. Dim Sales Person Sales Person Key =99 Sales Person Name = Tony Peterson Sales Person Designation = Sales Manager Sales Region = North-East Phone = 203 203 [...]

Posted in Data Modeling | No Comments

How ETL-IA is beneficial to you?

As the previous ETL-IA blog post mentions that an addition of a source, adds an additional ETL process. It means that the time and effort that it takes to build the data warehouse is directly proportional to the number of sources. The more the number of sources, the more the warehouse system would cost. An [...]

Posted in ETL Interface Architecture (TM) | 2 Comments

Testing in ETL, Data-Centric Projects

Testing is an investigation process that is conducted to check the quality of the product. Product can either be an applications or data. The quality of the data can only be determined by checking data against some existing standards by following a set of processes. By doing so, you find out the symptoms in form [...]

Posted in ETL Testing | 7 Comments

Pitfalls of type II dimension

Type II dimension has been popularized by R Kimball. it has become so popular that in any interview related to data warehouse, the interviewer will surely ask to explain the concepts. And chances are that if you don’t know; they will laugh at your ignorance and reject you. Here’s your chance to laugh at them.. [...]

Posted in Data Architecture, Data Modeling, Realtime ETL | 9 Comments

Difference between Reference Data and Master Data

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences. Lets try to understand the differences with an example of sales transaction. A sales transaction contains information like…. Store, Products Sold, Sales Person, Store Name, Sales Date, Customer, Price, Quantity, etc. Attributes from the above example can be separated [...]

Posted in Data Architecture, ETL Basics, ETL Book, Uncategorized | 7 Comments

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse -Contents of ETL Book

Section A ,The Beginning 01 ETL The Basics 02 ETL Strategy Section B, Analysis 03 Target Systems Analysis 04 Source Systems Analysis 05 Source Target Mapping- Part I 06 Understanding Data Quality 07 Data Profiling Section C, Develop Part I 08 Understanding Data Patterns for ETL 09 Simple ETL Development Section D,  ETL Architecture & Design 10 ETL & Data Integration 11 ETL-IA (Interface Architecture) 12 ETL-IA Implementation 13 Designing [...]

Posted in ETL Book | 28 Comments

Slower Development Databases and Servers

It is a normal trend in it to buy the most powerful machine for production usage. Example you production database box will have 16 CPUs your QA 8 and your Development box 4 or 6. Similar ratio is maintained on ETL servers, Application servers, and Hard Disk performance. Logic being the production environment is critical [...]

Posted in ETL Performance Tuning, ETL Strategy, Informatica | 2 Comments

Informatica FAQ’s Additional

What are the different types of Lookups? The different types of lookups are static Lookup, dynamic lookup, static persistent Lookup  What is Test Load in informatica? If we want to test the execution of the mapping without loading any data in the target database then we check the option of test load. Why is parameterization [...]

Posted in Informatica FAQs, Interview FAQs | 4 Comments

Find me the attribute!……….. What attribute?

An attribute with a context is a meaningless attribute. (Even if it has a definition associated with it). One of the interesting phases in ETL is the data mapping exercise. By the time data-mapping phase starts the target model and source model is already defined. But sometimes a report or excel file is pulled out [...]

Posted in ETL Mapping Document | No Comments

For every rule there is an exception; for each exception there are more exceptions…

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation. Ex. Table / column map between source and target, rules to identify unique rows, not [...]

Posted in Data Profiling, Data Quality, ETL Exception & Error Handling, Source System Analysis, Uncategorized | 1 Comment

Type II dimension: What is expiry date? Does it come from Source or is it ETL date?

I assume everyone is aware of type II dimension concept.  In type II dimension when we get an updated row we retire old row and insert new one. But what do you populate  in the from_date —to_date attributes. ETL create/update datetime or when that particular row was retired in the source system? The question can be [...]

Posted in Data Architecture, ETL Strategy, Uncategorized | 1 Comment

ETL Strategy for Reversing wrong file/data load.

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! [...]

Posted in ETL Strategy, File Management, Uncategorized | 2 Comments

ETL Interface Architecture (ETL-IA) tm

An enterprise data warehouse has by definition has more then one data source. In simple language a table employee in the EDW will have multiple sources like HR dept, Sales dept, and so on. Also if the company is a group of companies then same department can repeat across various companies. Let’s say the DIM_EMPLOYEE [...]

Posted in ETL Interface Architecture (TM), Uncategorized | No Comments

Reference data Management & Reference Data Hub

This is not a new concept, but it is fast becoming a specialized area in Data Management. Example in a company employee list & details are managed in central repository and distributed it all over the organization. Similar thing can happen to customer list/details, stock list/ details etc. What is the relationship of such repositories [...]

Posted in Reference Data, Uncategorized | No Comments

Readers welcome to write at ETLGuru.com

Readers, I will be focusing on ETL BOOK, any reader wishes to write /contribute to the site are welcome. Let me know if anyone is interested. I will review, edit and publish. However this does not mean that I will not write here. In fact I  think that the website & the ETL book will be [...]

Posted in Uncategorized | 5 Comments

Informatica interview questions & FAQs

What is a source qualifier? What is a surrogate key? What is difference between Mapplet and reusable transformation? What is DTM session? What is a Mapplet? What is a look up function? What is default transformation for the look up function? What is difference between a connected look up and unconnected look up? What is [...]

Posted in ETL Tools, Informatica, Informatica FAQs, Interview FAQs, Uncategorized | 45 Comments

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse……..ETL book by Sandesh Gawande

Ok! I am finally planning to publish ETL book with detailed information on all aspects related to ETL and Data Integration. The ETL Book will be used by Data Warehouse Managers, Data Warehouse Architects & ETL Leads. It will also contain solutions for ETL developers. The Book will be independent of ETL tools like Informatica, DataStage, etc. The good [...]

Posted in ETL Book, Uncategorized | 12 Comments

Oracle recycle bin flashback table feature

When you do ‘Select * from tab;’ some times, you will be surprised to see some tables with garbage names. Welcome to the world of Oracle Recycle Bin Feature. Because of this feature, Oracle saves the dropped table in recycle bins until you clear it. 1. Empty recycle bin use the command: PURGE RECYCLEBIN; 2. [...]

Posted in Oracle, Uncategorized | No Comments

Data staging table / area design.

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design. 1. Why staging area is needed? Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source [...]

Posted in Data Staging, Uncategorized | 1 Comment

ETL Startegy to store data validation rules

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on. Various test conditions are used to validate data when the ETL process is migrated from [...]

Posted in ETL Exception & Error Handling, ETL Testing, Uncategorized | 1 Comment

Hard references to database objects, Synonyms and ETL

1. Any time hard references are made to a database object it is an invitation to trouble. Especially in ETL where the object (table /view /procedure) can belong to schema. 2. Oracle has –> Database server name –> Schema –> Object 3. SCHEMA_NAME.TABLE_NAME is a bad way of referencing objects from the ETL architecture point [...]

Posted in Data Architecture, Uncategorized | No Comments

Loading & testing fact/transactional/balances (data), which is valid between dates!

This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc. The table above shows an entity in the source system that contains time variant values but they don’t change [...]

Posted in Data Architecture, Data Modeling, Data Transformation, ETL Strategy, ETL Testing, ETL Tools, Informatica, Uncategorized | 2 Comments

Introduction to Error and exception management.

ETL is all about transportation, transformation and organizing of data. Of anytime something moves (as a matter of fact even if you are perfectly stationary and items around moves) accidents are bound to happen. So any ETL specialist believes that their code is perfect and nothing can happen obviously lives in a fool’s paradise. The [...]

Posted in ETL Exception & Error Handling, ETL Strategy | No Comments

Aggregate aware queries

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour. What is the [...]

Posted in Data Aggregration, Data Modeling, Database Tuning, ETL Performance Tuning, Oracle, Uncategorized | No Comments

Lazy links to ETL basics.

Ok check this on Wikipedia http://web.archive.org/web/20070523031241/http://en.wikipedia.org/wiki/ETL I will add more..

Posted in ETL Basics | No Comments

Talk to me!

Dear Guest Readers, This is the third month of ETL Guru. I am getting visitors from all over the world (USA, India, France, Germany, Australia, Canada). Based on my understanding I will continue to write about data integration. But it will be only from my point of view. Also I never know if the reader [...]

Posted in Uncategorized | No Comments

Multiple executions ETL process against same set of data.

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?” Answer: 1. I get the same result set. Answer: 2. I get multiple result set. If you go back to the original article on What is ETL & What [...]

Posted in Data Quality, Data Quality Issues, Data Staging, ETL Exception & Error Handling, Testing ETL & Data, Uncategorized | 2 Comments

ETL delta logic & de-normalization of data model.

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided. Let’s look [...]

Posted in Data Extraction, ETL Automation, Uncategorized | No Comments

Types data elements and entities (Tables) for ETL.

It is important for an ETL developer to understand the types of tables and data, to intelligently design ETL processes. Once the common types objects are understood, reusable templates for ETL can be developed, regardless of business logic. This will greatly improve the efficiency of an ETL developer. 1. Reference data 2. Dimensional data (master [...]

Posted in Data Loading, Data Modeling, Uncategorized | 2 Comments

Data integration basics-Dimension Confirmation.

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed.  A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example [...]

Posted in Enterprise Dimension Confirmation, ETL Interface Architecture (TM), Reference Data, Uncategorized | No Comments

ETL and the importance of dates.

ETL is all about data movement, batch processes and so on. The biggest criterias for data movement are based on dates. Dates like ETL_CREATE_DTTM, ETL_UPDATE_DTTM, SRC_SYS_CREATE_DTTM, SRC_SYS_UPDATE_DTTM, BUSINESS_CREATE_DATE, BUSINESS_UPDATE_DATE Proper understanding of dates is one of the most essential requirements, while designing, developing, testing, and scheduling ETL process. Lets discuss each of these in details.. [...]

Posted in ETL Strategy, Uncategorized | No Comments

ETL Data file processing questions, you must ask!

Many times file processing is taken quite lightly these days. The reason is either ignorance or sheer carelessness. Another reason being most of them being from RDBMS background they are just not aware of the intricacies of file management. So wakeup…. Pre process Before processing the file have you checked if the contents in header/trailer [...]

Posted in File Management | No Comments

What ETL performance (processing speed) is good enough?

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends…  But I would rather be blunt and say that usually [...]

Posted in Database Tuning, ETL Performance Tuning, Informatica, Uncategorized | 1 Comment

What ETL is not?

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted. Data [...]

Posted in ETL Basics, Uncategorized | No Comments

Why Informatica sequences & Sybase/SQL server identity columns, should not be used?

Informatica provides the sequence object to create surrogate keys during load. These object is also sharable within various mappings hence can be used in parallel. But I will recommend never using it.  Here’s the Reason why…. 1. MIGRATION ISSUE:  The sequence is stored in the Informatica repository. That means it is disconnected from the target [...]

Posted in Informatica, Oracle, SQL Server, Sybase | No Comments

Simulating Oracle Sequences in Sybase & SQL Server

Programmatic control is lost when identity columns are used in Sybase and SQL Server. I do not recommend using Identity columns to create surrogate keys during ETL process. There are many more reasons for that. Oracle has the sequence feature which is used extensively by Oracle programmers. I have no clue why other vendors are [...]

Posted in Data Loading, Informatica, SQL Server, Sybase, Uncategorized | No Comments

What Is ETL?

ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning. Disscussion: Source System can be any application or data store that creates or stores data and acts as a data [...]

Posted in ETL Basics | No Comments

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for [...]

Posted in Database Tuning, ETL Performance Tuning, Oracle, Uncategorized | No Comments

Shell Scripts for Oracle

Here’s a lazy post for shell scripts. Links In future I will add some specalised scripts for ETL.

Posted in Shell Scripts for ETL | 3 Comments