Saturday, January 29, 2011

Download tweets using Talend

Modules in Talend (or components - as they are called in the Talend world) open a new opportunity to slice and dice the massively available web data. A lot of user content is generated everyday. What if you want to monitor this and scrape it. Take for example tweets. Sentiment analysis anyone?

I built a simple JSON parser with Talend using the tFileInputJSON that takes twitter search results (capped at 1500) and stores it in a file.
The URL string for tInputFileJson reads:
"http://search.twitter.com/search.json?q=egypt&rpp=100&page="+((Integer)globalMap.get("row1.Page_Max15")).toString()

Here Page_Max15 is a sequence generated from tRowGen. With a few mods you can run this incrementally to everyday refresh your warehouse. You'll need to capture the since_id from the tweets.

This opens up limitless possibilities to enhance your data warehouse or fetch data for sentiment tracking etc or index it in Solr and make available to your executive team. Kinda nice to walk in to a meeting and say- "Hey! this is what people think about us yesterday!"

2 comments:

  1. Hi Yash,
    I am also working on the same type of project but facing the issue..
    i have URL of twitter & job is one to one just extracting and dumping from TFILEINPUTJSON as input and TFILEOUTPUTJSON output using the URL , my job is throwing the error 400,
    can u please help

    ReplyDelete
    Replies
    1. Check if the URL is valid by manually entering it in the browser

      Delete