After great pain and spending many hours I was finally able to index something in Solr using Talend. Solr and Talend both awesome free tools, are not the easiest to integrate at first. Mainly because in order to user SolrJ ( which is the Java Library for Solr) you have to import several jars. Using those jars and doing intellisense on the methods in them can be a pain in Talend.
First, you need to following jars from your solr instance and put them in the lib/java folder in Talend installation. In your job load these using tLibraryLoad component.
Start Code of the tJavaFlex
// start part of your Java code
String url = "http://localhost:8983/solr";
SolrServer server = new CommonsHttpSolrServer(url);
Main code
// here is the main part of the component,
// a piece of code executed in the row
// loop
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( "id", row1.CustID , 1.0f );
doc1.addField( "name", row1.FName, 1.0f );
server.add(doc1);
End code
server.commit();
In the advanced settings type
import org.apache.solr.client.solrj.*;
import org.apache.solr.common.*;
import org.apache.solr.client.solrj.impl.*;
I got sample data for 350k customers from http://www.briandunning.com/sample-data/ and was able to load it at the average rate of 300 recs/second in to Solr. The awesome thing about this is that you can update your solr index in real-time depending on how you plan to do ETL. The dataimporthandler is restrictive in the way that it indexes files of only certain types.
First, you need to following jars from your solr instance and put them in the lib/java folder in Talend installation. In your job load these using tLibraryLoad component.
From /dist:
- apache-solr-solrj-*.jar
From /dist/solrj-lib
- commons-codec-1.3.jar
- commons-httpclient-3.1.jar
- commons-io-1.4.jar
- jcl-over-slf4j-1.5.5.jar
- slf4j-api-1.5.5.jar
From /lib
- slf4j-jdk14-1.5.5.jar
Start Code of the tJavaFlex
// start part of your Java code
String url = "http://localhost:8983/solr";
SolrServer server = new CommonsHttpSolrServer(url);
Main code
// here is the main part of the component,
// a piece of code executed in the row
// loop
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( "id", row1.CustID , 1.0f );
doc1.addField( "name", row1.FName, 1.0f );
server.add(doc1);
End code
server.commit();
In the advanced settings type
import org.apache.solr.client.solrj.*;
import org.apache.solr.common.*;
import org.apache.solr.client.solrj.impl.*;
I got sample data for 350k customers from http://www.briandunning.com/sample-data/ and was able to load it at the average rate of 300 recs/second in to Solr. The awesome thing about this is that you can update your solr index in real-time depending on how you plan to do ETL. The dataimporthandler is restrictive in the way that it indexes files of only certain types.