Sunday, February 27, 2011

Using Talend to do ETL in to Solr

After great pain and spending many hours I was finally able to index something in Solr using Talend. Solr and Talend both awesome free tools, are not the easiest to integrate at first. Mainly because in order to user SolrJ ( which is the Java Library for Solr) you have to import several jars. Using those jars and doing intellisense on the methods in them can be a pain in Talend.

First, you need to following jars from your solr instance and put them in the lib/java folder in Talend installation. In your job load these using tLibraryLoad component.

From /dist:
  • apache-solr-solrj-*.jar
From /dist/solrj-lib
  • commons-codec-1.3.jar
  • commons-httpclient-3.1.jar
  • commons-io-1.4.jar
  • jcl-over-slf4j-1.5.5.jar
  • slf4j-api-1.5.5.jar
From /lib
  • slf4j-jdk14-1.5.5.jar



Start Code of the tJavaFlex

// start part of your Java code
     String url = "http://localhost:8983/solr";
     SolrServer server = new CommonsHttpSolrServer(url);
  

Main code

// here is the main part of the component,
// a piece of code executed in the row
// loop

SolrInputDocument doc1 = new SolrInputDocument();
    doc1.addField( "id", row1.CustID , 1.0f );
    doc1.addField( "name", row1.FName, 1.0f );  
    server.add(doc1);

End code

 server.commit();

In the advanced settings type



import org.apache.solr.client.solrj.*;
import org.apache.solr.common.*;
import org.apache.solr.client.solrj.impl.*;



I got sample data for 350k customers from http://www.briandunning.com/sample-data/ and was able to load it at the average rate of 300 recs/second in to Solr. The awesome thing about this is that you can update your solr index in real-time depending on how you plan to do ETL. The dataimporthandler is restrictive in the way that it indexes files of only certain types.

8 comments:

  1. Thank you a lot for your tutorial that is easy to follow and works perfectly!

    ReplyDelete
  2. I'm glad it could be of help. Thanks.

    ReplyDelete
  3. Just came across this blog. This is great. Saved me a lot of time. Thank !!! :)

    ReplyDelete
  4. Thanks Sebastien for creating the component. That will be of great help to the Talend Community I'm sure.

    ReplyDelete
  5. I am interested in more about how Talend and SOLR integrated with the XML files formatted as SOLR needs.

    ReplyDelete
  6. Thanks for sharing the information, Can you please also suggest,do we need to follow the custom library import in latest Talend big data / Data integration products too? Or these are present as out of box.

    ReplyDelete
  7. Hi There

    I am using your custom components for talend, but i am having trouble getting the multivalued data to insert correctly. can you give an example on how to get them to insert ? I have changed the type to "list", but then how does the data in my input file (pipe delimited text) need to be structured ? say i have a file, with two columns, ID and Weight. I have 2 valuues for "weight".

    ReplyDelete
  8. Using Talend for ETL processes into Solr is a strategic choice. Is Online Snake Its user-friendly interface and robust capabilities streamline data extraction, transformation, and loading.

    ReplyDelete