Wednesday, February 10, 2010

MapReduce ETL for Logs

I was researching how MapReduce can be applied to ETL in organizations. I found this interesting article which shows how RackSpace.com used MapReduce to search their logs.

Check it out:
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

One has to consider that RackSpace.com generates several hundred gigabytes of logs everyday which is on an extreme end of data processing considering the data generated in an average organization. So far MapReduce seems to be a good fit for data intensive ETL such as POS data in retail, clickstream data, etc.

Update: Another interesting article published by the ACM that compares usage of MapReduce and Parallel DBMS. Very interesting read.

http://db.csail.mit.edu/pubs/p64-stonebraker.pdf

No comments:

Post a Comment