Showing posts with label Talend Solr ETL SolrJ real-time. Show all posts
Showing posts with label Talend Solr ETL SolrJ real-time. Show all posts

Sunday, February 27, 2011

Using Talend to do ETL in to Solr

After great pain and spending many hours I was finally able to index something in Solr using Talend. Solr and Talend both awesome free tools, are not the easiest to integrate at first. Mainly because in order to user SolrJ ( which is the Java Library for Solr) you have to import several jars. Using those jars and doing intellisense on the methods in them can be a pain in Talend.

First, you need to following jars from your solr instance and put them in the lib/java folder in Talend installation. In your job load these using tLibraryLoad component.

From /dist:
  • apache-solr-solrj-*.jar
From /dist/solrj-lib
  • commons-codec-1.3.jar
  • commons-httpclient-3.1.jar
  • commons-io-1.4.jar
  • jcl-over-slf4j-1.5.5.jar
  • slf4j-api-1.5.5.jar
From /lib
  • slf4j-jdk14-1.5.5.jar



Start Code of the tJavaFlex

// start part of your Java code
     String url = "http://localhost:8983/solr";
     SolrServer server = new CommonsHttpSolrServer(url);
  

Main code

// here is the main part of the component,
// a piece of code executed in the row
// loop

SolrInputDocument doc1 = new SolrInputDocument();
    doc1.addField( "id", row1.CustID , 1.0f );
    doc1.addField( "name", row1.FName, 1.0f );  
    server.add(doc1);

End code

 server.commit();

In the advanced settings type



import org.apache.solr.client.solrj.*;
import org.apache.solr.common.*;
import org.apache.solr.client.solrj.impl.*;



I got sample data for 350k customers from http://www.briandunning.com/sample-data/ and was able to load it at the average rate of 300 recs/second in to Solr. The awesome thing about this is that you can update your solr index in real-time depending on how you plan to do ETL. The dataimporthandler is restrictive in the way that it indexes files of only certain types.