Apache Solr and Apache Tika Integration Tutorial
This article is a tutorial about Apache Solr and Apache Tika Integration.
1. Introduction
A Solr index can accept data from many different sources, such as CSV, XML, databases and common binary files. If the data to be indexed is in binary format, such as WORD, PPT, XLS, and PDF, the Solr Content Extraction Library (the Solr Cell framework) built upon Apache Tika is used for ingesting binary files or structured files. In this example we are going to show you how Apache Solr and Apache Tika integration works.
Table Of Contents
2. Technologies Used
The steps and commands described in this example are for Apache Solr 8.5 on Windows 10. The JDK version we use to run the SolrCloud in this example is OpenJDK 13. Before we start, please make sure your computer meet the system requirements. Also, please download the binary release of Apache Solr 8.5.
3. Apache Solr And Apache Tika Integration
3.1 The Basics
Apache Tika is a content analysis toolkit which detects and extracts metadata and text from over a thousand different file types (such as WORD, PPT, XLS, and PDF). This makes Tika very useful for indexing binary data in Solr. The Solr Cell framework uses code from the Tika project internally to support uploading binary files for data extraction and indexing. Let’s see how to set up the integration in the next section.
3.2 Setting Up The Integration
We don’t need to download Apach Tika for the integration. Solr Cell as a contrib contains all dependencies required to run Tika. It is not automatically included in the configSet but need to be configured.
3.2.1 Putting Jars On Classpath
To use the Solr Cell, we must add additional jars to Solr’s classpath. There are a few options to make other plugins available to Solr as described in Solr Plugins. We use the standard approach the directive in solrconfig.xml
as shown below:
<lib dir="${solr.install.dir:../../../../..}/contrib/extraction/lib" regex=".*\.jar" /> <lib dir="${solr.install.dir:../../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
3.2.2 ExtractingRequestHandler Parameters And Configuration
A SolrRequestHandler
is used to defines the logic executed for any request sent to Solr. When working with Solr Cell framework, Solr’s ExtractingRequestHandler
which implements SolrRequestHandler
interface uses Tika internally to support uploading binary files for data extraction and indexing. The parameters listed in the table below are accepted by the ExtractingRequestHandler
. We can specify them as request parameters for each indexing request or add them to ExtractingRequestHandler
configured in solrconfig.xml
for all requests.
Parameter | Description | Example of Request Parameter |
---|---|---|
capture | Captures XHTML elements with the specified name. | capture=p |
captureAttr | Indexes attributes of the Tika XHTML elements into separate fields. | captureAttr=true |
commitWithin | Add the document within the specified number of milliseconds. | commitWithin=5000 |
defaultField | A default field to use if the uprefix parameter is not specified and a field is not defined in the schema. | defaultField=_text_ |
extractOnly | If true, returns the extracted content from Tika without indexing the document. Default is false. | extractOnly=true |
extractFormat | The serialization format of the extract content: xml (default) or text. | extractFormat=text |
fmap.source_field | Maps source field in incoming document to another field. | fmap.content=_text_ |
ignoreTikaException | Skips exception when processing when set to true. | ignoreTikaException=true |
literal.fieldname | Populates a field with the specified value for each document. | literal.id=word-doc-1 |
literalsOverride | If true (default), overrides field values with literal values; otherwise appends to the same field which must be multivalued. | literalsOverride=false |
lowernames | Maps all fields to lowercase with underscore when set to true. | lowernames=true |
multipartUploadLimitInKB | Max upload document size allowed. Default is 2048KB | multipartUploadLimitInKB=1024000 |
parseContext.config | Specifies a Tika parser config file. | parseContext.config=doc-config.xml |
passwordsFile | Specifies a filename-password mapping file when indexing encrypted documents. | passwordsFile=/path/to/passwords.txt |
resource.name | Specifies the name of the file to index. | resource.name=jcg_examples.doc |
resource.password | Defines the password for an encrypted document. | resource.password=secret123 |
tika.config | Specifies a custom Tika config file. | tika.config=/path/to/tika.config |
uprefix | Prefixes all fields that are not defined in the schema with the given prefix. | uprefix=ignored_ |
xpath | Defines an XPath expression to restrict the XHTML returned by Tika. | xpath=/xhtml:html/xhtml:body/xhtml:div//node() |
An example of the ExtractingRequestHandler configuration in solrconfig.xml
is below:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.content">_text_</str> <!--<str name="uprefix">ignored_</str>--> <!-- capture link hrefs but ignore div attributes --> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_div</str> </lst> </requestHandler>
In the example configuration above, we map all fields to lowercase with underscore and map content
field in incoming documents to _text_
field. As the sample word document we are going to index contains several links, we set captureAttr
to true
to capture them and map hrefs
captured to the links
field. In addition, the uprefix
parameter has been commented out at the moment and we will see an example later which sets uprefix
to ignored_
to ignore all fields extracted by Tika but not defined in the schema.
3.2.3 Defining Schema
Open managed-schema
file with any text editor in jcg_example_configs
configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf
. Make sure the following fields have been defined:
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/> <field name="author" type="string" indexed="true" stored="true"/> <field name="links" type="strings" indexed="true" stored="true"/> <field name="last_modified" type="pdate" indexed="true" stored="true"/> <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
For your convinience, a jcg_example_configs.zip
file containing all configurations and schema is attached to the article. You can simply download and extract it to the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf\
.
3.2.4 Starting Solr Instance
For simplicity, instead of setting up a SolrCloud on your local machine as demonstrated in Apache Solr Clustering Example, we run a single Solr instance on our local machine with the command below:
bin\solr.cmd start
The output would be:
Waiting up to 30 to see Solr running on port 8983 Started Solr server on port 8983. Happy searching!
3.2.5 Creating A New Core
As we are running Solr in standalone mode, we need to create a new core named jcg_example_core
with the jcg_example_configs
configSet on the local machine. For example, we can do it via the CoreAdmin API:
curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs
The output would be:
{ "responseHeader":{ "status":0, "QTime":641}, "core":"jcg_example_core"}
If the jcg_example_core
exists, you can remove it via the CoreAdmin API as below:
curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=UNLOAD --data-urlencode core=jcg_example_core --data-urlencode deleteInstanceDir=true
The output would be:
{ "responseHeader":{ "status":0, "QTime":37 } }
3.3 Examples
Apache Tika supports several document formats and is able to extract metadata and/or textual content from the Supported Document Formats. Time to see some examples of how the Solr Cell works.
3.3.1 Indexing Data
Download and extract the sample data file attached to this article and index the jcg_example_articles.docx
with the following command:
curl "http://localhost:8983/solr/jcg_example_core/update/extract?literal.id=word-doc-1&commit=true" -F "myfile=@jcg_example_articles.docx"
The output would be:
{ "responseHeader":{ "status":0, "QTime":1789 } }
Based on the configuration we have for the ExtractingRequestHandler
, the URL above calls the ExtractingRequestHandler
, uploads the file jcg_example_articles.docx
, and assigns it the unique ID word-doc-1
. Note that to specify a unique Id for the document being indexed is very important in our example. Without it, if we index the same document again by running the command above, a new document in the index will be created with a new unique id because we have the uuid
update processor defined in the solrconfig.xml
. In other use cases, we may choose to map a metadata field to the ID, generate a new UUID, or generate an ID from a signature (hash) of the content. The commit=true
parameter let Solr commit changes after indexing the document so that we can find it immediately by query. For optimum performance when loading many documents, don’t call the commit command until you are done. The -F
flag allows us to specify HTTP multipart POST data for curl to upload a binary file.
Another useful parameter is extractOnly
. We can set it to true
to extract data without indexing It for testing purpose.
The example below sets the extractOnly=true
parameter to extract data without indexing it:
curl "http://localhost:8983/solr/jcg_example_core/update/extract?extractOnly=true" -F "myfile=@jcg_example_articles.docx"
The output would be:
{ "responseHeader":{ "status":0, "QTime":59}, "jcg_example_articles.docx":"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"date\"\ncontent=\"2020-07-18T09:49:00Z\"/>\n<meta name=\"Total-Time\"\ncontent=\"8\"/>\n<meta name=\"extended-properties:AppVersion\"\ncontent=\"12.0000\"/>\n<meta name=\"stream_content_type\"\n content=\"application/octet-stream\"/>\n<meta\nname=\"meta:paragraph-count\" content=\"1\"/>\n<meta name=\"subject\"\n content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"Word-Count\" content=\"103\"/>\n<meta name=\"meta:line-count\"\ncontent=\"4\"/>\n<meta name=\"Template\" content=\"Normal.dotm\"/>\n<meta\nname=\"Paragraph-Count\" content=\"1\"/>\n<meta name=\"stream_name\"\n content=\"jcg_example_articles.docx\"/>\n<meta\nname=\"meta:character-count-with-spaces\" content=\"694\"/>\n<meta\nname=\"dc:title\" content=\"Articles Written By Kevin Yang\"/>\n<meta\nname=\"modified\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"meta:author\" content=\"Kevin Yang\"/>\n<meta\nname=\"meta:creation-date\" content=\"2020-07-18T09:41:00Z\"/>\n<meta\nname=\"extended-properties:Application\"\n content=\"Microsoft Office Word\"/>\n<meta\nname=\"stream_source_info\" content=\"myfile\"/>\n<meta name=\"Creation-Date\"\n content=\"2020-07-18T09:41:00Z\"/>\n<meta\nname=\"Character-Count-With-Spaces\" content=\"694\"/>\n<meta\nname=\"Last-Author\" content=\"Kevin Yang\"/>\n<meta name=\"Character Count\"\ncontent=\"592\"/>\n<meta name=\"Page-Count\" content=\"1\"/>\n<meta\nname=\"Application-Version\" content=\"12.0000\"/>\n<meta\nname=\"extended-properties:Template\" content=\"Normal.dotm\"/>\n<meta\nname=\"Author\" content=\"Kevin Yang\"/>\n<meta name=\"publisher\"\ncontent=\"Java Code Geeks\"/>\n<meta name=\"meta:page-count\"\ncontent=\"1\"/>\n<meta name=\"cp:revision\" content=\"3\"/>\n<meta\nname=\"Keywords\" content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"Category\" content=\"example\"/>\n<meta name=\"meta:word-count\"\ncontent=\"103\"/>\n<meta name=\"dc:creator\" content=\"Kevin Yang\"/>\n<meta\nname=\"extended-properties:Company\" content=\"Java Code Geeks\"/>\n<meta\nname=\"dcterms:created\" content=\"2020-07-18T09:41:00Z\"/>\n<meta\nname=\"dcterms:modified\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"Last-Modified\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"Last-Save-Date\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"meta:character-count\" content=\"592\"/>\n<meta name=\"Line-Count\"\ncontent=\"4\"/>\n<meta name=\"meta:save-date\"\n content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"Application-Name\" content=\"Microsoft Office Word\"/>\n<meta\nname=\"extended-properties:TotalTime\" content=\"8\"/>\n<meta\nname=\"Content-Type\"\n content=\"application/vnd.openxmlformats-officedocument.wordprocessingml.document\"/>\n<meta\nname=\"stream_size\" content=\"11162\"/>\n<meta name=\"X-Parsed-By\"\n content=\"org.apache.tika.parser.DefaultParser\"/>\n<meta\nname=\"X-Parsed-By\"\n content=\"org.apache.tika.parser.microsoft.ooxml.OOXMLParser\"/>\n<meta\nname=\"creator\" content=\"Kevin Yang\"/>\n<meta name=\"dc:subject\"\n content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"meta:last-author\" content=\"Kevin Yang\"/>\n<meta\nname=\"xmpTPg:NPages\" content=\"1\"/>\n<meta name=\"Revision-Number\"\ncontent=\"3\"/>\n<meta name=\"meta:keyword\"\n content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"cp:category\" content=\"example\"/>\n<meta name=\"dc:publisher\" content=\"Java Code Geeks\"/>\n<title>Articles Written By Kevin Yang</title>\n</head>\n<body>\n<h1 class=\"title\">Articles written by Kevin Yang</h1>\n<h1>Apache Solr</h1>\n<p/>\n<p>Examples of Apache Solr.</p>\n<p>\n <a href=\"https://examples.javacodegeeks.com/apache-solr-function-query-example/\">Apache Solr Function Query Example</a>\n</p>\n<p>\n <a href=\"https://examples.javacodegeeks.com/apache-solr-standard-query-parser-example/\">Apache Solr Standard Query Parser Example</a>\n</p>\n<p>\n <a href=\"https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/\">Apache Solr Fuzzy Search Example</a>\n</p>\n<p>\n <a href=\"https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial/\">Apache Solr OpenNLP Tutorial 鈥?Part 1</a>\n</p>\n<p>\n <a href=\"https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial-part-2/\">Apache Solr OpenNLP Tutorial 鈥?Part 2</a>\n</p>\n</body>\n</html>\n", "jcg_example_articles.docx_metadata":[ "date",["2020-07-18T09:49:00Z"], "Total-Time",["8"], "extended-properties:AppVersion",["12.0000"], "stream_content_type",["application/octet-stream"], "meta:paragraph-count",["1"], "subject",["articles; kevin yang; examples"], "Word-Count",["103"], "meta:line-count",["4"], "Template",["Normal.dotm"], "Paragraph-Count",["1"], "stream_name",["jcg_example_articles.docx"], "meta:character-count-with-spaces",["694"], "dc:title",["Articles Written By Kevin Yang"], "modified",["2020-07-18T09:49:00Z"], "meta:author",["Kevin Yang"], "meta:creation-date",["2020-07-18T09:41:00Z"], "extended-properties:Application",["Microsoft Office Word"], "stream_source_info",["myfile"], "Creation-Date",["2020-07-18T09:41:00Z"], "Character-Count-With-Spaces",["694"], "Last-Author",["Kevin Yang"], "Character Count",["592"], "Page-Count",["1"], "Application-Version",["12.0000"], "extended-properties:Template",["Normal.dotm"], "Author",["Kevin Yang"], "publisher",["Java Code Geeks"], "meta:page-count",["1"], "cp:revision",["3"], "Keywords",["articles; kevin yang; examples"], "Category",["example"], "meta:word-count",["103"], "dc:creator",["Kevin Yang"], "extended-properties:Company",["Java Code Geeks"], "dcterms:created",["2020-07-18T09:41:00Z"], "dcterms:modified",["2020-07-18T09:49:00Z"], "Last-Modified",["2020-07-18T09:49:00Z"], "title",["Articles Written By Kevin Yang"], "Last-Save-Date",["2020-07-18T09:49:00Z"], "meta:character-count",["592"], "Line-Count",["4"], "meta:save-date",["2020-07-18T09:49:00Z"], "Application-Name",["Microsoft Office Word"], "extended-properties:TotalTime",["8"], "Content-Type",["application/vnd.openxmlformats-officedocument.wordprocessingml.document"], "stream_size",["11162"], "X-Parsed-By",["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.microsoft.ooxml.OOXMLParser"], "creator",["Kevin Yang"], "dc:subject",["articles; kevin yang; examples"], "meta:last-author",["Kevin Yang"], "xmpTPg:NPages",["1"], "Revision-Number",["3"], "meta:keyword",["articles; kevin yang; examples"], "cp:category",["example"], "dc:publisher",["Java Code Geeks"]]}
3.3.2 Verifying The Results
Now we can execute a query and find that document with a request below.
curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=kevin"
The output would be:
{ "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"kevin"}}, "response":{"numFound":1,"start":0,"docs":[ { "meta":["date", "2020-07-18T09:49:00Z", "Total-Time", "8", "extended-properties:AppVersion", "12.0000", "stream_content_type", "application/octet-stream", "meta:paragraph-count", "1", "subject", "articles; kevin yang; examples", "Word-Count", "103", "meta:line-count", "4", "Template", "Normal.dotm", "Paragraph-Count", "1", "stream_name", "jcg_example_articles.docx", "meta:character-count-with-spaces", "694", "dc:title", "Articles Written By Kevin Yang", "modified", "2020-07-18T09:49:00Z", "meta:author", "Kevin Yang", "meta:creation-date", "2020-07-18T09:41:00Z", "extended-properties:Application", "Microsoft Office Word", "stream_source_info", "myfile", "Creation-Date", "2020-07-18T09:41:00Z", "Character-Count-With-Spaces", "694", "Last-Author", "Kevin Yang", "Character Count", "592", "Page-Count", "1", "Application-Version", "12.0000", "extended-properties:Template", "Normal.dotm", "Author", "Kevin Yang", "publisher", "Java Code Geeks", "meta:page-count", "1", "cp:revision", "3", "Keywords", "articles; kevin yang; examples", "Category", "example", "meta:word-count", "103", "dc:creator", "Kevin Yang", "extended-properties:Company", "Java Code Geeks", "dcterms:created", "2020-07-18T09:41:00Z", "dcterms:modified", "2020-07-18T09:49:00Z", "Last-Modified", "2020-07-18T09:49:00Z", "Last-Save-Date", "2020-07-18T09:49:00Z", "meta:character-count", "592", "Line-Count", "4", "meta:save-date", "2020-07-18T09:49:00Z", "Application-Name", "Microsoft Office Word", "extended-properties:TotalTime", "8", "Content-Type", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "stream_size", "11162", "X-Parsed-By", "org.apache.tika.parser.DefaultParser", "X-Parsed-By", "org.apache.tika.parser.microsoft.ooxml.OOXMLParser", "creator", "Kevin Yang", "dc:subject", "articles; kevin yang; examples", "meta:last-author", "Kevin Yang", "xmpTPg:NPages", "1", "Revision-Number", "3", "meta:keyword", "articles; kevin yang; examples", "cp:category", "example", "dc:publisher", "Java Code Geeks"], "h1":["title"], "links":["https://examples.javacodegeeks.com/apache-solr-function-query-example/", "https://examples.javacodegeeks.com/apache-solr-standard-query-parser-example/", "https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/", "https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial/", "https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial-part-2/"], "id":"word-doc-1", "date":"2020-07-18T09:49:00Z", "total_time":8, "extended_properties_appversion":12.0, "stream_content_type":["application/octet-stream"], "meta_paragraph_count":1, "subject":["articles; kevin yang; examples"], "word_count":103, "meta_line_count":4, "template":["Normal.dotm"], "paragraph_count":1, "stream_name":["jcg_example_articles.docx"], "meta_character_count_with_spaces":694, "dc_title":["Articles Written By Kevin Yang"], "modified":"2020-07-18T09:49:00Z", "meta_author":["Kevin Yang"], "meta_creation_date":"2020-07-18T09:41:00Z", "extended_properties_application":["Microsoft Office Word"], "stream_source_info":["myfile"], "creation_date":"2020-07-18T09:41:00Z", "character_count_with_spaces":694, "last_author":["Kevin Yang"], "character_count":592, "page_count":1, "application_version":12.0, "extended_properties_template":["Normal.dotm"], "author":["Kevin Yang"], "publisher":["Java Code Geeks"], "meta_page_count":1, "cp_revision":3, "keywords":["articles; kevin yang; examples"], "category":["example"], "meta_word_count":103, "dc_creator":["Kevin Yang"], "extended_properties_company":["Java Code Geeks"], "dcterms_created":"2020-07-18T09:41:00Z", "dcterms_modified":"2020-07-18T09:49:00Z", "last_modified":"2020-07-18T09:49:00Z", "title":["Articles Written By Kevin Yang"], "last_save_date":"2020-07-18T09:49:00Z", "meta_character_count":592, "line_count":4, "meta_save_date":"2020-07-18T09:49:00Z", "application_name":["Microsoft Office Word"], "extended_properties_totaltime":8, "content_type":["application/vnd.openxmlformats-officedocument.wordprocessingml.document"], "stream_size":11162, "x_parsed_by":["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.microsoft.ooxml.OOXMLParser"], "creator":["Kevin Yang"], "dc_subject":["articles; kevin yang; examples"], "meta_last_author":["Kevin Yang"], "xmptpg_npages":1, "revision_number":3, "meta_keyword":["articles; kevin yang; examples"], "cp_category":["example"], "dc_publisher":["Java Code Geeks"], "_version_":1672550496610549760}] }}
We can see that several metadata associated to the example document has been extracted. Each of them has its own field created because we are running in schemaless
mode configured in the solrconfig.xml
by having add-unknown-fields-to-the-schema
update request processor chain enabled.
3.3.3 A Simplified Example
The behaviour of adding new fields for all metadata extracted above may not be desired in your use case and you may only care about a few specific fields and have defined them in your schema. How can we deal with other fields extracted we don’t care about? The uprefix
parameter and ignored
field type can be used for this.
Firstly, we can uncomment the following line within the ExtractingRequestHandler
in solrconfig.xml
:
<str name="uprefix">ignored_</str>
Then, make sure the ignored
field type and ignored
dynamic field are defined in managed-schema
:
<fieldType name="ignored" class="solr.StrField" indexed="false" stored="false" multiValued="true"/> <dynamicField name="ignored_*" type="ignored"/>
By doing this we indicates Solr not to index all unknown fields extracted by Solr Cell. To see how these configurations work, we need to restart Solr and recreate the jcg_example_core
with the attached configSet jcg_example_configs.zip
or a copy of the _default
configSet with configurations we mentioned before. Otherwise those autogenerated fields in the previous example will remain. Once finished, we can run the command in section 3.3.1 Indexing Data to index the example document.
Lastly, run the query below to see the indexed document:
curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=kevin"
The output would be:
{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"kevin"}}, "response":{"numFound":1,"start":0,"docs":[ { "links":["https://examples.javacodegeeks.com/apache-solr-function-query-example/", "https://examples.javacodegeeks.com/apache-solr-standard-query-parser-example/", "https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/", "https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial/", "https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial-part-2/"], "id":"word-doc-1", "author":"Kevin Yang", "last_modified":"2020-07-18T09:49:00Z", "_version_":1672565163665915904}] }}
We can see from the output above that all link addresses in jcg_example_articles.docx
have been extracted successfully and added to the links
field. In addition, both the author
field and the last_modified
field have been extracted and added to the index correctly. All unknown fields in the indexing document have been ignored and no corresponding field is created.
4. Download the Sample Data File
You can download the sample data file of this example here: Apache Solr and Apache Tika Integration Tutorial