Apache Solr

Apache Solr and Apache Tika Integration Tutorial

This article is a tutorial about Apache Solr and Apache Tika Integration.

1. Introduction

A Solr index can accept data from many different sources, such as CSV, XML, databases and common binary files. If the data to be indexed is in binary format, such as WORD, PPT, XLS, and PDF, the Solr Content Extraction Library (the Solr Cell framework) built upon Apache Tika is used for ingesting binary files or structured files. In this example we are going to show you how Apache Solr and Apache Tika integration works.


2. Technologies Used

The steps and commands described in this example are for Apache Solr 8.5 on Windows 10. The JDK version we use to run the SolrCloud in this example is OpenJDK 13. Before we start, please make sure your computer meet the system requirements. Also, please download the binary release of Apache Solr 8.5.

3. Apache Solr And Apache Tika Integration

3.1 The Basics

Apache Tika is a content analysis toolkit which detects and extracts metadata and text from over a thousand different file types (such as WORD, PPT, XLS, and PDF). This makes Tika very useful for indexing binary data in Solr. The Solr Cell framework uses code from the Tika project internally to support uploading binary files for data extraction and indexing. Let’s see how to set up the integration in the next section.

3.2 Setting Up The Integration

We don’t need to download Apach Tika for the integration. Solr Cell as a contrib contains all dependencies required to run Tika. It is not automatically included in the configSet but need to be configured.

3.2.1 Putting Jars On Classpath

To use the Solr Cell, we must add additional jars to Solr’s classpath. There are a few options to make other plugins available to Solr as described in Solr Plugins. We use the standard approach the directive in solrconfig.xml as shown below:

<lib dir="${solr.install.dir:../../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../../..}/dist/" regex="solr-cell-\d.*\.jar" />

3.2.2 ExtractingRequestHandler Parameters And Configuration

A SolrRequestHandler is used to defines the logic executed for any request sent to Solr. When working with Solr Cell framework, Solr’s ExtractingRequestHandler which implements SolrRequestHandler interface uses Tika internally to support uploading binary files for data extraction and indexing. The parameters listed in the table below are accepted by the ExtractingRequestHandler. We can specify them as request parameters for each indexing request or add them to ExtractingRequestHandler configured in solrconfig.xml for all requests.

ParameterDescriptionExample of Request Parameter
captureCaptures XHTML elements with the specified name.capture=p
captureAttrIndexes attributes of the Tika XHTML elements into separate fields.captureAttr=true
commitWithinAdd the document within the specified number of milliseconds.commitWithin=5000
defaultFieldA default field to use if the uprefix parameter is not specified and a field is not defined in the schema.defaultField=_text_
extractOnlyIf true, returns the extracted content from Tika without indexing the document. Default is false.extractOnly=true
extractFormatThe serialization format of the extract content: xml (default) or text.extractFormat=text
fmap.source_fieldMaps source field in incoming document to another field.fmap.content=_text_
ignoreTikaExceptionSkips exception when processing when set to true.ignoreTikaException=true
literal.fieldnamePopulates a field with the specified value for each document.literal.id=word-doc-1
literalsOverrideIf true (default), overrides field values with literal values; otherwise appends to the same field which must be multivalued.literalsOverride=false
lowernamesMaps all fields to lowercase with underscore when set to true.lowernames=true
multipartUploadLimitInKBMax upload document size allowed. Default is 2048KBmultipartUploadLimitInKB=1024000
parseContext.configSpecifies a Tika parser config file.parseContext.config=doc-config.xml
passwordsFileSpecifies a filename-password mapping file when indexing encrypted documents.passwordsFile=/path/to/passwords.txt
resource.nameSpecifies the name of the file to index.resource.name=jcg_examples.doc
resource.passwordDefines the password for an encrypted document.resource.password=secret123
tika.configSpecifies a custom Tika config file.tika.config=/path/to/tika.config
uprefixPrefixes all fields that are not defined in the schema with the given prefix.uprefix=ignored_
xpathDefines an XPath expression to restrict the XHTML returned by Tika.xpath=/xhtml:html/xhtml:body/xhtml:div//node()
Table. 1. ExtractingRequestHandler Parameters

An example of the ExtractingRequestHandler configuration in solrconfig.xml is below:

<requestHandler name="/update/extract"
                class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="fmap.content">_text_</str>
    <!--<str name="uprefix">ignored_</str>-->
    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_div</str>

In the example configuration above, we map all fields to lowercase with underscore and map content field in incoming documents to _text_ field. As the sample word document we are going to index contains several links, we set captureAttr to true to capture them and map hrefs captured to the links field. In addition, the uprefix parameter has been commented out at the moment and we will see an example later which sets uprefix to ignored_ to ignore all fields extracted by Tika but not defined in the schema.

3.2.3 Defining Schema

Open managed-schema file with any text editor in jcg_example_configs configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf. Make sure the following fields have been defined:

<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="author" type="string" indexed="true" stored="true"/>
<field name="links" type="strings" indexed="true" stored="true"/>
<field name="last_modified" type="pdate" indexed="true" stored="true"/>
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

For your convinience, a jcg_example_configs.zip file containing all configurations and schema is attached to the article. You can simply download and extract it to the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf\.

3.2.4 Starting Solr Instance

For simplicity, instead of setting up a SolrCloud on your local machine as demonstrated in Apache Solr Clustering Example, we run a single Solr instance on our local machine with the command below:

bin\solr.cmd start

The output would be:

Waiting up to 30 to see Solr running on port 8983
Started Solr server on port 8983. Happy searching!

3.2.5 Creating A New Core

As we are running Solr in standalone mode, we need to create a new core named jcg_example_core with the jcg_example_configs configSet on the local machine. For example, we can do it via the CoreAdmin API:

curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs

The output would be:


If the jcg_example_core exists, you can remove it via the CoreAdmin API as below:

curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=UNLOAD --data-urlencode core=jcg_example_core --data-urlencode deleteInstanceDir=true

The output would be:


3.3 Examples

Apache Tika supports several document formats and is able to extract metadata and/or textual content from the Supported Document Formats. Time to see some examples of how the Solr Cell works.

3.3.1 Indexing Data

Download and extract the sample data file attached to this article and index the jcg_example_articles.docx with the following command:

curl "http://localhost:8983/solr/jcg_example_core/update/extract?literal.id=word-doc-1&commit=true" -F "myfile=@jcg_example_articles.docx"

The output would be:


Based on the configuration we have for the ExtractingRequestHandler, the URL above calls the ExtractingRequestHandler, uploads the file jcg_example_articles.docx, and assigns it the unique ID word-doc-1. Note that to specify a unique Id for the document being indexed is very important in our example. Without it, if we index the same document again by running the command above, a new document in the index will be created with a new unique id because we have the uuid update processor defined in the solrconfig.xml. In other use cases, we may choose to map a metadata field to the ID, generate a new UUID, or generate an ID from a signature (hash) of the content. The commit=true parameter let Solr commit changes after indexing the document so that we can find it immediately by query. For optimum performance when loading many documents, don’t call the commit command until you are done. The -F flag allows us to specify HTTP multipart POST data for curl to upload a binary file.

Another useful parameter is extractOnly. We can set it to true to extract data without indexing It for testing purpose.

The example below sets the extractOnly=true parameter to extract data without indexing it:

curl "http://localhost:8983/solr/jcg_example_core/update/extract?extractOnly=true" -F "myfile=@jcg_example_articles.docx"

The output would be:

  "jcg_example_articles.docx":"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"date\"\ncontent=\"2020-07-18T09:49:00Z\"/>\n<meta name=\"Total-Time\"\ncontent=\"8\"/>\n<meta name=\"extended-properties:AppVersion\"\ncontent=\"12.0000\"/>\n<meta name=\"stream_content_type\"\n            content=\"application/octet-stream\"/>\n<meta\nname=\"meta:paragraph-count\" content=\"1\"/>\n<meta name=\"subject\"\n            content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"Word-Count\" content=\"103\"/>\n<meta name=\"meta:line-count\"\ncontent=\"4\"/>\n<meta name=\"Template\" content=\"Normal.dotm\"/>\n<meta\nname=\"Paragraph-Count\" content=\"1\"/>\n<meta name=\"stream_name\"\n            content=\"jcg_example_articles.docx\"/>\n<meta\nname=\"meta:character-count-with-spaces\" content=\"694\"/>\n<meta\nname=\"dc:title\" content=\"Articles Written By Kevin Yang\"/>\n<meta\nname=\"modified\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"meta:author\" content=\"Kevin Yang\"/>\n<meta\nname=\"meta:creation-date\" content=\"2020-07-18T09:41:00Z\"/>\n<meta\nname=\"extended-properties:Application\"\n            content=\"Microsoft Office Word\"/>\n<meta\nname=\"stream_source_info\" content=\"myfile\"/>\n<meta name=\"Creation-Date\"\n            content=\"2020-07-18T09:41:00Z\"/>\n<meta\nname=\"Character-Count-With-Spaces\" content=\"694\"/>\n<meta\nname=\"Last-Author\" content=\"Kevin Yang\"/>\n<meta name=\"Character Count\"\ncontent=\"592\"/>\n<meta name=\"Page-Count\" content=\"1\"/>\n<meta\nname=\"Application-Version\" content=\"12.0000\"/>\n<meta\nname=\"extended-properties:Template\" content=\"Normal.dotm\"/>\n<meta\nname=\"Author\" content=\"Kevin Yang\"/>\n<meta name=\"publisher\"\ncontent=\"Java Code Geeks\"/>\n<meta name=\"meta:page-count\"\ncontent=\"1\"/>\n<meta name=\"cp:revision\" content=\"3\"/>\n<meta\nname=\"Keywords\" content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"Category\" content=\"example\"/>\n<meta name=\"meta:word-count\"\ncontent=\"103\"/>\n<meta name=\"dc:creator\" content=\"Kevin Yang\"/>\n<meta\nname=\"extended-properties:Company\" content=\"Java Code Geeks\"/>\n<meta\nname=\"dcterms:created\" content=\"2020-07-18T09:41:00Z\"/>\n<meta\nname=\"dcterms:modified\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"Last-Modified\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"Last-Save-Date\" content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"meta:character-count\" content=\"592\"/>\n<meta name=\"Line-Count\"\ncontent=\"4\"/>\n<meta name=\"meta:save-date\"\n            content=\"2020-07-18T09:49:00Z\"/>\n<meta\nname=\"Application-Name\" content=\"Microsoft Office Word\"/>\n<meta\nname=\"extended-properties:TotalTime\" content=\"8\"/>\n<meta\nname=\"Content-Type\"\n            content=\"application/vnd.openxmlformats-officedocument.wordprocessingml.document\"/>\n<meta\nname=\"stream_size\" content=\"11162\"/>\n<meta name=\"X-Parsed-By\"\n            content=\"org.apache.tika.parser.DefaultParser\"/>\n<meta\nname=\"X-Parsed-By\"\n            content=\"org.apache.tika.parser.microsoft.ooxml.OOXMLParser\"/>\n<meta\nname=\"creator\" content=\"Kevin Yang\"/>\n<meta name=\"dc:subject\"\n            content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"meta:last-author\" content=\"Kevin Yang\"/>\n<meta\nname=\"xmpTPg:NPages\" content=\"1\"/>\n<meta name=\"Revision-Number\"\ncontent=\"3\"/>\n<meta name=\"meta:keyword\"\n            content=\"articles; kevin yang; examples\"/>\n<meta\nname=\"cp:category\" content=\"example\"/>\n<meta name=\"dc:publisher\" content=\"Java Code Geeks\"/>\n<title>Articles Written By Kevin Yang</title>\n</head>\n<body>\n<h1 class=\"title\">Articles written by Kevin Yang</h1>\n<h1>Apache Solr</h1>\n<p/>\n<p>Examples of Apache Solr.</p>\n<p>\n            <a href=\"https://examples.javacodegeeks.com/apache-solr-function-query-example/\">Apache Solr Function Query Example</a>\n</p>\n<p>\n            <a href=\"https://examples.javacodegeeks.com/apache-solr-standard-query-parser-example/\">Apache Solr Standard Query Parser Example</a>\n</p>\n<p>\n            <a href=\"https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/\">Apache Solr Fuzzy Search Example</a>\n</p>\n<p>\n            <a href=\"https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial/\">Apache Solr OpenNLP Tutorial 鈥?Part 1</a>\n</p>\n<p>\n            <a href=\"https://examples.javacodegeeks.com/apache-solr-opennlp-tutorial-part-2/\">Apache Solr OpenNLP Tutorial 鈥?Part 2</a>\n</p>\n</body>\n</html>\n",
    "subject",["articles; kevin yang; examples"],
    "dc:title",["Articles Written By Kevin Yang"],
    "meta:author",["Kevin Yang"],
    "extended-properties:Application",["Microsoft Office Word"],
    "Last-Author",["Kevin Yang"],
    "Character Count",["592"],
    "Author",["Kevin Yang"],
    "publisher",["Java Code Geeks"],
    "Keywords",["articles; kevin yang; examples"],
    "dc:creator",["Kevin Yang"],
    "extended-properties:Company",["Java Code Geeks"],
    "title",["Articles Written By Kevin Yang"],
    "Application-Name",["Microsoft Office Word"],
    "creator",["Kevin Yang"],
    "dc:subject",["articles; kevin yang; examples"],
    "meta:last-author",["Kevin Yang"],
    "meta:keyword",["articles; kevin yang; examples"],
    "dc:publisher",["Java Code Geeks"]]}

3.3.2 Verifying The Results

Now we can execute a query and find that document with a request below.

curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=kevin"

The output would be:

          "articles; kevin yang; examples",
          "Articles Written By Kevin Yang",
          "Kevin Yang",
          "Microsoft Office Word",
          "Kevin Yang",
          "Character Count",
          "Kevin Yang",
          "Java Code Geeks",
          "articles; kevin yang; examples",
          "Kevin Yang",
          "Java Code Geeks",
          "Microsoft Office Word",
          "Kevin Yang",
          "articles; kevin yang; examples",
          "Kevin Yang",
          "articles; kevin yang; examples",
          "Java Code Geeks"],
        "subject":["articles; kevin yang; examples"],
        "dc_title":["Articles Written By Kevin Yang"],
        "meta_author":["Kevin Yang"],
        "extended_properties_application":["Microsoft Office Word"],
        "last_author":["Kevin Yang"],
        "author":["Kevin Yang"],
        "publisher":["Java Code Geeks"],
        "keywords":["articles; kevin yang; examples"],
        "dc_creator":["Kevin Yang"],
        "extended_properties_company":["Java Code Geeks"],
        "title":["Articles Written By Kevin Yang"],
        "application_name":["Microsoft Office Word"],
        "creator":["Kevin Yang"],
        "dc_subject":["articles; kevin yang; examples"],
        "meta_last_author":["Kevin Yang"],
        "meta_keyword":["articles; kevin yang; examples"],
        "dc_publisher":["Java Code Geeks"],

We can see that several metadata associated to the example document has been extracted. Each of them has its own field created because we are running in schemaless mode configured in the solrconfig.xml by having add-unknown-fields-to-the-schema update request processor chain enabled.

3.3.3 A Simplified Example

The behaviour of adding new fields for all metadata extracted above may not be desired in your use case and you may only care about a few specific fields and have defined them in your schema. How can we deal with other fields extracted we don’t care about? The uprefix parameter and ignored field type can be used for this.

Firstly, we can uncomment the following line within the ExtractingRequestHandler in solrconfig.xml:

<str name="uprefix">ignored_</str>

Then, make sure the ignored field type and ignored dynamic field are defined in managed-schema:

<fieldType name="ignored" class="solr.StrField" indexed="false" stored="false" multiValued="true"/>
<dynamicField name="ignored_*" type="ignored"/>

By doing this we indicates Solr not to index all unknown fields extracted by Solr Cell. To see how these configurations work, we need to restart Solr and recreate the jcg_example_core with the attached configSet jcg_example_configs.zip or a copy of the _default configSet with configurations we mentioned before. Otherwise those autogenerated fields in the previous example will remain. Once finished, we can run the command in section 3.3.1 Indexing Data to index the example document.

Lastly, run the query below to see the indexed document:

curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=kevin"

The output would be:

        "author":"Kevin Yang",

We can see from the output above that all link addresses in jcg_example_articles.docx have been extracted successfully and added to the links field. In addition, both the author field and the last_modified field have been extracted and added to the index correctly. All unknown fields in the indexing document have been ignored and no corresponding field is created.

4. Download the Sample Data File

You can download the sample data file of this example here: Apache Solr and Apache Tika Integration Tutorial

Want to know how to develop your skillset to become a Java Rockstar?

Join our newsletter to start rocking!

To get you started we give you our best selling eBooks for FREE!


1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design


and many more ....


Receive Java & Developer job alerts in your Area

I have read and agree to the terms & conditions


Kevin Yang

A software design and development professional with seventeen years’ experience in the IT industry, especially with Java EE and .NET, I have worked for software companies, scientific research institutes and websites.
Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments
Back to top button