Tuesday, June 5, 2012

Data Integration Using Enterprise-wide Libraries

Traditionally, ETL tools are kind of a black box. You have ton of business logic that you program in them that is essentially specific to the tool and cannot be transported to another tool. This lock-in causes problems when you are trying to switch to a new ETL vendor.

Lately I've been working with a unique data integration requirement. We want to build ETL that takes data from several sources, then cleanses and de-duplicates it, eventually make it to a master database. Our requirement was that we wanted to abstract out the business rules that would be in an ETL tool and make it reusable and available to other applications. We also wanted to make sure that we don't lock ourselves in to using a particular ETL vendor.

Now, unlike traditional data/data warehouse teams, we have a strong set of developers with core programming skills who have a strong say in the data integration process. I personally, am approach agnostic and believe that the right tool must be used for the right kind of problem and that the requirements drive what tool(s) you end up using. I've worked as a developer coding C#, Java backend/web applications and also on the other end of the spectrum using data integration tools such as Talend, SSIS, etc. Both these approaches have their own advantages and disadvantages.

We ended up creating a library in Java using Eclipse. We used the Test Driven Development methodology and wrote ample tests in JUnit. We used Git for version management and Github for hosting our code. The build server was Jenkins and was instructed to pull from Github at every commit. The artifacts such as jar files were published using Artifactory. The idea was to have a central repository of business rule/logic which would undergo constant testing and quality assurance.

This allowed us to package the business rules as a jar and use it in our ETL tool (Talend). Our requirements were that we wanted a solution which was highly modular and re-usable by other applications. If we were to put our business logic and transformations in to the ETL tool, we would be highly dependent on the ETL tool if we ever had to switch tools. We ended up using the ETL tool as a workflow engine that triggered parts of the process, handle failures, generate emails, etc.

You might want to try similar approach in your next data integration project. Let me know how it goes!

No comments:

Post a Comment