Sunday, February 27, 2011

Using Talend to do ETL in to Solr

After great pain and spending many hours I was finally able to index something in Solr using Talend. Solr and Talend both awesome free tools, are not the easiest to integrate at first. Mainly because in order to user SolrJ ( which is the Java Library for Solr) you have to import several jars. Using those jars and doing intellisense on the methods in them can be a pain in Talend.

First, you need to following jars from your solr instance and put them in the lib/java folder in Talend installation. In your job load these using tLibraryLoad component.

From /dist:
  • apache-solr-solrj-*.jar
From /dist/solrj-lib
  • commons-codec-1.3.jar
  • commons-httpclient-3.1.jar
  • commons-io-1.4.jar
  • jcl-over-slf4j-1.5.5.jar
  • slf4j-api-1.5.5.jar
From /lib
  • slf4j-jdk14-1.5.5.jar



Start Code of the tJavaFlex

// start part of your Java code
     String url = "http://localhost:8983/solr";
     SolrServer server = new CommonsHttpSolrServer(url);
  

Main code

// here is the main part of the component,
// a piece of code executed in the row
// loop

SolrInputDocument doc1 = new SolrInputDocument();
    doc1.addField( "id", row1.CustID , 1.0f );
    doc1.addField( "name", row1.FName, 1.0f );  
    server.add(doc1);

End code

 server.commit();

In the advanced settings type



import org.apache.solr.client.solrj.*;
import org.apache.solr.common.*;
import org.apache.solr.client.solrj.impl.*;



I got sample data for 350k customers from http://www.briandunning.com/sample-data/ and was able to load it at the average rate of 300 recs/second in to Solr. The awesome thing about this is that you can update your solr index in real-time depending on how you plan to do ETL. The dataimporthandler is restrictive in the way that it indexes files of only certain types.

7 comments:

  1. Thank you a lot for your tutorial that is easy to follow and works perfectly!

    ReplyDelete
  2. I'm glad it could be of help. Thanks.

    ReplyDelete
  3. Just came across this blog. This is great. Saved me a lot of time. Thank !!! :)

    ReplyDelete
  4. Thank you for your post.
    If it's interesting for you, i developed solr components available on Talend forge. There is 5 components (tSOLRConnection, tSOLRCommit, tSOLRRollback, tSOLRInput, and tSOLROutput)
    For inserting data with tSOLROutput, batch insert is used for more quickly response time. With the tSOLRInput you can retrieve facets results too.

    Doc for use it is available on my blog : http://inrage-blog.blogspot.com/2012/03/solrtalend-components-tutorial-this.html

    Hope this could help talend & solr users

    ReplyDelete
  5. Thanks Sebastien for creating the component. That will be of great help to the Talend Community I'm sure.

    ReplyDelete
  6. I am interested in more about how Talend and SOLR integrated with the XML files formatted as SOLR needs.

    ReplyDelete
  7. Thanks for sharing the information, Can you please also suggest,do we need to follow the custom library import in latest Talend big data / Data integration products too? Or these are present as out of box.

    ReplyDelete