Indexing the documents stored in a database
Outline:
- Setup a MySQL database [1] containing documents( PDF/DOC/HTML etc ).
- Setup Apache Solr / Tika
- Import the documents just by hitting an import url.
NOTE: Also check the update note at the end of this post.
These steps were done on my machine running Fedora 17. The commands be easliy converted for other distributions.
Setup MySQL database with documents
Install MySQL Server:
# yum install mysql-server
# service mysqld start
Also install Java library for connecting to MySQL ( Solr would need it later )
# yum install -y mysql-connector-java
Setup a MySQL database [1] for storing binary files
CREATE DATABASE binary_files;
CREATE TABLE tbl_files (
id_files tinyint(3) unsigned NOT NULL auto_increment,
bin_data longblob NOT NULL,
description tinytext NOT NULL,
filename varchar(255) NOT NULL,
filesize integer NOT NULL,
filetype varchar(255) NOT NULL,
PRIMARY KEY (id_files)
);
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, ALTER
ON binary_files.*
TO binary_user@localhost
IDENTIFIED BY 'binary_password';
Now lets create a ruby script to populate the database with documents. We would need a Ruby-MySQL driver [2][3] and a MIME detection library [4].
# yum install ruby-mysql
# yum install rubygem-mime-types
Here is the script: https://gist.github.com/4706365#file-insert-mysql-rb
Lets add some documents to the database:
$ mkdir sample-docs
$ cd sample-docs
$ wget http://www.manning.com/lowagie/sample-ch03_Lowagie.pdf
$ wget http://www.xmlw.ie/aboutxml/wordsample2.doc
$ wget http://www.columbia.edu/~fdc/sample.html
$ cd ..
$ for f in sample-docs/* ; do DESC=`basename "$f" | tr ' ' '-' `; ruby insert-mysql.rb "$f" "$DESC"; done
If you get this error when running MySQL insert/update queries “Lost connection to MySQL server during query”, you might want to consider updating your MySQL server limits [6]. Update the limit in /etc/my.cnf
:
max_allowed_packet=32M
Setup Apache Solr with Apache Tika integration
$ wget -c http://apache.techartifact.com/mirror/lucene/solr/3.6.2/apache-solr-3.6.2.tgz
$ tar zxf apache-solr-3.6.2.tgz
$ cd apache-solr-3.6.2
$ cd examples/
Here you would see that there is an example of Data import from HSQL database but we want to work with MySQL. So we create a new configuration. ( It will be easy to follow if you have follwed the README files in the Apache Solr package ).
$ cp -r example-DIH/ dih-mysql/
$ cd dih-mysql/
$ rm -rf hsqldb/
Remove everything except db/
$ cd solr/
$ rm -rf solr mail rss tika
$ ln -s /usr/share/java/mysql-connector-java.jar db/lib/
Now the directory structure should look something like this:
$ find dih-mysql/ -type d
dih-mysql/
dih-mysql/solr
dih-mysql/solr/db
dih-mysql/solr/db/conf
dih-mysql/solr/db/conf/xslt
dih-mysql/solr/db/lib
dih-mysql/solr/db/data
dih-mysql/solr/db/data/index
Lets now update Solr configuration. Here, it is necessary to make sure that the Tika content parser libraries are put in configuration file. How to do this is mentioned below:
Configuration file: dih-mysql/solr/db/conf/solr-config.xml
Source: https://gist.github.com/4706517#file-solrconfig-xml
We just added libraries to parse the content ( to avoid ClassNotFound errors ).
<lib dir="../../../../contrib/extraction/lib/" regex="tika-core-\d.*\.jar" />
<lib dir="../../../../contrib/extraction/lib/" regex="tika-parsers-\d.*\.jar" />
<lib dir="../../../../contrib/extraction/lib/" regex=".*\.jar" />
Configuration file: dih-mysql/solr/db/conf/schema.xml
Source: https://gist.github.com/4706509#file-schema-xml
Add the relevant fields which will be indexed along with the binary content ( PDF/DOC/HTML etc. )
Configuration file: dih-mysql/solr/db/conf/db-data-config.xml
Source: https://gist.github.com/4706528#file-db-data-config-xml
We configured the database/table/columns from which to fetch the content to be indexed.
I would recommend you to go through the official documentation [7].
Now we are all set with the configuration. Its time to index the documents:
Index the documents
Start the Solr server. Notice how we are specifying the configuration path:
$ cd apache-solr-3.6.2/example
$ java -jar -Dsolr.solr.home="./dih-mysql/solr/" start.jar
And invoke the indexer by hitting this url: http://localhost:8983/solr/db/dataimport?command=full-import
The steps that I described worked for me just fine. I hope this helps in resolving the issues faced by others [8] and [9].
UPDATE:
I had to add extra entity field in both schema.xml
and db-data-config.xml
to make the indexing work. Perhaps there is some other problem with my configuration. I don’t understand why this works ( and why the one officially documented [7] doesn’t work ), but this is the work around I figured out.
Update in schema.xml
<field name="text2" type="text" indexed="false" stored="false"/>
Update in db-data-config.xml
<entity dataSource="fieldReader"
processor="TikaEntityProcessor"
dataField="root.bin_data" format="text">
<field column="text2" name="body" />
</entity>
References:
- [1] http://onlamp.com/pub/a/php/2000/09/15/php_mysql.html
- [2] http://www.tmtm.org/en/ruby/mysql/
- [3] http://zetcode.com/db/mysqlrubytutorial/
- [4] http://stackoverflow.com/questions/2082293/get-mimetype-from-filename
- [5] https://gist.github.com/4706365#file-insert-mysql-rb
- [6] http://stackoverflow.com/questions/5688403/how-to-check-and-set-max-allowed-packet-mysql-variable
- [7] http://lucidworks.lucidimagination.com/display/lweug/Indexing+Binary+Data+Stored+in+a+Database
- [8] http://stackoverflow.com/questions/11339840/indexing-binary-files-from-database-issue-no-errors/14638909#14638909
- [9] http://stackoverflow.com/questions/14671461/tika-fetches-the-binary-content-stored-in-database-but-does-not-indexes-it