Indexing the documents stored in a database
NOTE: Also check the update note at the end of this post.
These steps were done on my machine running Fedora 17. The commands be easliy converted for other distributions.
Setup MySQL database with documents
Install MySQL Server:
# yum install mysql-server # service mysqld start
Also install Java library for connecting to MySQL ( Solr would need it later )
# yum install -y mysql-connector-java
Setup a MySQL database  for storing binary files
CREATE DATABASE binary_files; CREATE TABLE tbl_files ( id_files tinyint(3) unsigned NOT NULL auto_increment, bin_data longblob NOT NULL, description tinytext NOT NULL, filename varchar(255) NOT NULL, filesize integer NOT NULL, filetype varchar(255) NOT NULL, PRIMARY KEY (id_files) ); GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, ALTER ON binary_files.* TO binary_user@localhost IDENTIFIED BY 'binary_password';
Now lets create a ruby script to populate the database with documents. We would need a Ruby-MySQL driver  and a MIME detection library .
# yum install ruby-mysql # yum install rubygem-mime-types
Here is the script: https://gist.github.com/4706365#file-insert-mysql-rb
Lets add some documents to the database:
$ mkdir sample-docs $ cd sample-docs $ wget http://www.manning.com/lowagie/sample-ch03_Lowagie.pdf $ wget http://www.xmlw.ie/aboutxml/wordsample2.doc $ wget http://www.columbia.edu/~fdc/sample.html $ cd .. $ for f in sample-docs/* ; do DESC=`basename "$f" | tr ' ' '-' `; ruby insert-mysql.rb "$f" "$DESC"; done
If you get this error when running MySQL insert/update queries «Lost connection to MySQL server during query», you might want to consider updating your MySQL server limits . Update the limit in
$ wget -c http://apache.techartifact.com/mirror/lucene/solr/3.6.2/apache-solr-3.6.2.tgz $ tar zxf apache-solr-3.6.2.tgz $ cd apache-solr-3.6.2 $ cd examples/
Here you would see that there is an example of Data import from HSQL database but we want to work with MySQL. So we create a new configuration. ( It will be easy to follow if you have follwed the README files in the Apache Solr package ).
$ cp -r example-DIH/ dih-mysql/ $ cd dih-mysql/ $ rm -rf hsqldb/
Remove everything except db/
$ cd solr/ $ rm -rf solr mail rss tika $ ln -s /usr/share/java/mysql-connector-java.jar db/lib/
Now the directory structure should look something like this:
$ find dih-mysql/ -type d dih-mysql/ dih-mysql/solr dih-mysql/solr/db dih-mysql/solr/db/conf dih-mysql/solr/db/conf/xslt dih-mysql/solr/db/lib dih-mysql/solr/db/data dih-mysql/solr/db/data/index
Lets now update Solr configuration. Here, it is necessary to make sure that the Tika content parser libraries are put in configuration file. How to do this is mentioned below:
We just added libraries to parse the content ( to avoid ClassNotFound errors ).
<lib dir="../../../../contrib/extraction/lib/" regex="tika-core-\d.*\.jar" /> <lib dir="../../../../contrib/extraction/lib/" regex="tika-parsers-\d.*\.jar" /> <lib dir="../../../../contrib/extraction/lib/" regex=".*\.jar" />
Add the relevant fields which will be indexed along with the binary content ( PDF/DOC/HTML etc. )
We configured the database/table/columns from which to fetch the content to be indexed.
I would recommend you to go through the official documentation .
Now we are all set with the configuration. Its time to index the documents:
Start the Solr server. Notice how we are specifying the configuration path:
$ cd apache-solr-3.6.2/example $ java -jar -Dsolr.solr.home="./dih-mysql/solr/" start.jar
And invoke the indexer by hitting this url: http://localhost:8983/solr/db/dataimport?command=full-import
The steps that I described worked for me just fine. I hope this helps in resolving the issues faced by others  and .
I had to add extra entity field in both
db-data-config.xml to make the indexing work. Perhaps there is some other problem with my configuration. I don’t understand why this works ( and why the one officially documented  doesn’t work ), but this is the work around I figured out.
Update in schema.xml
<field name="text2" type="text" indexed="false" stored="false"/>
Update in db-data-config.xml
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="root.bin_data" format="text"> <field column="text2" name="body" /> </entity>