Indexing the documents stored in a database using Apache Solr and Apache Tika

· Read in about 4 min · (672 Words)

Indexing the documents stored in a database

Outline:

  • Setup a MySQL database [1] containing documents( PDF/DOC/HTML etc ).
  • Setup Apache Solr / Tika
  • Import the documents just by hitting an import url.

NOTE: Also check the update note at the end of this post.

These steps were done on my machine running Fedora 17. The commands be easliy converted for other distributions.

Setup MySQL database with documents

Install MySQL Server:

# yum install mysql-server
# service mysqld start

Also install Java library for connecting to MySQL ( Solr would need it later )

# yum install -y mysql-connector-java

Setup a MySQL database [1] for storing binary files

CREATE DATABASE binary_files;

CREATE TABLE tbl_files (
 id_files tinyint(3) unsigned NOT NULL auto_increment,
 bin_data longblob NOT NULL,
 description tinytext NOT NULL,
 filename varchar(255) NOT NULL,
 filesize integer NOT NULL,
 filetype varchar(255) NOT NULL,
 PRIMARY KEY (id_files)
);

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, ALTER
 ON binary_files.*
 TO binary_user@localhost
 IDENTIFIED BY 'binary_password';

Now lets create a ruby script to populate the database with documents. We would need a Ruby-MySQL driver [2][3] and a MIME detection library [4].

# yum install ruby-mysql
# yum install rubygem-mime-types

Here is the script: https://gist.github.com/4706365#file-insert-mysql-rb

Lets add some documents to the database:

$ mkdir sample-docs
$ cd sample-docs
$ wget http://www.manning.com/lowagie/sample-ch03_Lowagie.pdf
$ wget http://www.xmlw.ie/aboutxml/wordsample2.doc
$ wget http://www.columbia.edu/~fdc/sample.html
$ cd ..
$ for f in sample-docs/* ; do DESC=`basename "$f" | tr ' ' '-' `; ruby insert-mysql.rb "$f" "$DESC"; done

If you get this error when running MySQL insert/update queries «Lost connection to MySQL server during query», you might want to consider updating your MySQL server limits [6]. Update the limit in /etc/my.cnf:

max_allowed_packet=32M

Setup Apache Solr with Apache Tika integration

$ wget -c http://apache.techartifact.com/mirror/lucene/solr/3.6.2/apache-solr-3.6.2.tgz
$ tar zxf apache-solr-3.6.2.tgz
$ cd apache-solr-3.6.2
$ cd examples/

Here you would see that there is an example of Data import from HSQL database but we want to work with MySQL. So we create a new configuration. ( It will be easy to follow if you have follwed the README files in the Apache Solr package ).

$ cp -r example-DIH/ dih-mysql/
$ cd dih-mysql/
$ rm -rf hsqldb/

Remove everything except db/

$ cd solr/
$ rm -rf solr mail rss tika
$ ln -s /usr/share/java/mysql-connector-java.jar db/lib/

Now the directory structure should look something like this:

$ find  dih-mysql/ -type d
dih-mysql/
dih-mysql/solr
dih-mysql/solr/db
dih-mysql/solr/db/conf
dih-mysql/solr/db/conf/xslt
dih-mysql/solr/db/lib
dih-mysql/solr/db/data
dih-mysql/solr/db/data/index

Lets now update Solr configuration. Here, it is necessary to make sure that the Tika content parser libraries are put in configuration file. How to do this is mentioned below:

Configuration file: dih-mysql/solr/db/conf/solr-config.xml

Source: https://gist.github.com/4706517#file-solrconfig-xml

We just added libraries to parse the content ( to avoid ClassNotFound errors ).

  <lib dir="../../../../contrib/extraction/lib/" regex="tika-core-\d.*\.jar" />
  <lib dir="../../../../contrib/extraction/lib/" regex="tika-parsers-\d.*\.jar" />
  <lib dir="../../../../contrib/extraction/lib/" regex=".*\.jar" />

Configuration file: dih-mysql/solr/db/conf/schema.xml

Source: https://gist.github.com/4706509#file-schema-xml

Add the relevant fields which will be indexed along with the binary content ( PDF/DOC/HTML etc. )

Configuration file: dih-mysql/solr/db/conf/db-data-config.xml

Source: https://gist.github.com/4706528#file-db-data-config-xml

We configured the database/table/columns from which to fetch the content to be indexed.

I would recommend you to go through the official documentation [7].

Now we are all set with the configuration. Its time to index the documents:

Index the documents

Start the Solr server. Notice how we are specifying the configuration path:

$ cd apache-solr-3.6.2/example
$ java -jar -Dsolr.solr.home="./dih-mysql/solr/" start.jar

And invoke the indexer by hitting this url: http://localhost:8983/solr/db/dataimport?command=full-import

The steps that I described worked for me just fine. I hope this helps in resolving the issues faced by others [8] and [9].

UPDATE:

I had to add extra entity field in both schema.xml and db-data-config.xml to make the indexing work. Perhaps there is some other problem with my configuration. I don’t understand why this works ( and why the one officially documented [7] doesn’t work ), but this is the work around I figured out.

Update in schema.xml

   <field name="text2" type="text" indexed="false" stored="false"/>

Update in db-data-config.xml

	<entity dataSource="fieldReader"
		processor="TikaEntityProcessor"
		dataField="root.bin_data" format="text">
          <field column="text2" name="body" />
	</entity>

References: