Apache Solr OpenNLP Tutorial – Part 2

Kevin YangJuly 14th, 2020Last Updated: July 13th, 2020

1 254 6 minutes read

1. Introduction

In Part 1 we’ve set up Apache Solr OpenNLP integration and used its analysis components, tokenizer, and filters, to process and analyze the sample data.

In this example, we are going to explore another powerful feature provided by Solr OpenNLP integration: extracting named entities at index time by using OpenNLP NER (Named Entity Recognition) model.

1. Introduction

2. Technologies Used

3. Solr OpenNLP NER Integration

3.1. Named Entity Recognition
3.2. Setting Up The Integration
3.3. Examples

4. Download the Sample Data File

2. Technologies Used

The steps and commands described in this example are for Apache Solr 8.5 on Windows 10. Pre-trained models for OpenNLP 1.5 are used in this example. To train your own models, please refer to Apache OpenNLP for details. The JDK version we use to run the SolrCloud in this example is OpenJDK 13.
Before we start, please make sure your computer meets the system requirements. Also, please download the binary release of Apache Solr 8.5.

3. Solr OpenNLP NER Integration

3.1 Named Entity Recognition

In information extraction, a Named Entity is a real-world object, such as persons, locations, organizations, etc. Named Entity Recognition (NER) uses pre-trained models to locate and classify named entities in text into pre-defined categories. Each pre-trained model is dependent on the language and entity type it is trained for. Solr OpenNLP integration provides an update request processor to extract named entities using an OpenNLP NER model at index time. Let’s see how to set up the OpenNLP NER integration in the next section.

3.2 Setting Up The Integration

Please follow the steps described in section 3.2 Set Up The Integration of Apache Solr OpenNLP Tutorial – Part 1 to put jars on the classpath and add required resources to the configSet. Once completed, firstly, please make sure the following directives are in solrconfig.xml of the jcg_example_configs configSet:

  <lib dir="${solr.install.dir:../../../../../}/contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../../../}/contrib/analysis-extras/lib" regex=".*\.jar"/>
  <lib path="${solr.install.dir:../../../../../}/dist/solr-analysis-extras-8.5.2.jar"/>

Secondly, the pre-trained models for the English language are downloaded and copied to the jcg_example_configs configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf\opennlp.

D:\Java\solr-8.5.2\server\solr\configsets\jcg_example_configs\conf\opennlp>dir
 Volume in drive D is Data
 Volume Serial Number is 24EC-FE37

 Directory of D:\Java\solr-8.5.2\server\solr\configsets\jcg_example_configs\conf\opennlp

06/30/2020  11:28 PM    <DIR>          .
06/30/2020  11:28 PM    <DIR>          ..
06/28/2020  08:25 PM         2,560,304 en-chunker.bin
06/30/2020  11:24 PM         1,632,029 en-lemmatizer.bin
06/28/2020  08:24 PM         5,030,307 en-ner-date.bin
06/28/2020  08:25 PM         5,110,658 en-ner-location.bin
06/28/2020  08:25 PM         4,806,234 en-ner-money.bin
06/28/2020  08:25 PM         5,297,172 en-ner-organization.bin
06/28/2020  08:25 PM         4,728,645 en-ner-percentage.bin
06/28/2020  08:25 PM         5,207,953 en-ner-person.bin
06/28/2020  08:25 PM         4,724,357 en-ner-time.bin
06/28/2020  08:26 PM        36,345,477 en-parser-chunking.bin
06/28/2020  08:24 PM         5,696,197 en-pos-maxent.bin
06/28/2020  08:24 PM         3,975,786 en-pos-perceptron.bin
06/28/2020  08:24 PM            98,533 en-sent.bin
06/28/2020  08:24 PM           439,890 en-token.bin
06/30/2020  10:34 PM                35 stop.pos.txt
              15 File(s)     85,653,577 bytes
               2 Dir(s)  47,963,561,984 bytes free

Thirdly, the text_en_opennlp field type is added in managed-schema in jcg_example_configs configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf as below:

<fieldType name="text_en_opennlp" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="opennlp/en-sent.bin" tokenizerModel="opennlp/en-token.bin"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="opennlp/en-pos-maxent.bin"/>
    <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="opennlp/en-chunker.bin"/>
    <filter class="solr.KeywordRepeatFilterFactory"/>
    <filter class="solr.OpenNLPLemmatizerFilterFactory" lemmatizerModel="opennlp/en-lemmatizer.bin"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.TypeAsPayloadFilterFactory"/>
    <filter class="solr.TypeTokenFilterFactory" types="opennlp/stop.pos.txt"/>
  </analyzer>
</fieldType>

Finally, let’s set up Update Request Processors by using OpenNLP NER models. Detailed usage of solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory can be found in the java doc. In this example we are going to extract organization names from the introduction field of an article by using OpenNLP NER model en-ner-organization.bin so the configurations are as below:

Open managed-schema, add the following two fields:

<field name="introduction" type="string" indexed="true" stored="true"/>
<field name="organization" type="string" indexed="true" stored="true"/>

Open solrconfig.xml, add the following update request processor chain with an OpenNLP NER update processor:

<!-- Update requeset processor chain with OpenNLP NER Update Request Processor -->
<updateRequestProcessorChain name="extract-organization" default="true"
         processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
  <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
    <str name="modelFile">opennlp/en-ner-organization.bin</str>
    <str name="analyzerFieldType">text_en_opennlp</str>
    <str name="source">introduction</str>
    <str name="dest">organization</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

If you have other update request processor chain configured as default such as add-unknown-fields-to-the-schema chain, please comment it out.

For your convenience, a jcg_example_configs.zip the file containing all configurations and schema is attached to the article. You can simply download and extract it to the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs.

3.3 Examples

3.3.1 Trying The Pre-defined Model With OpenNLP Name Finder

Before we start Solr and use the pre-trained NER model to index data, there is an easy way to try out the pre-trained NER model with Apache OpenNLP name finder. It is a command line tool for demonstration and testing purpose. Download the English organization model en-ner-organization.bin and start the Name Finder Tool with the following command:

opennlp TokenNameFinder en-ner-organization.bin

The output would be:

D:\Java\apache-opennlp-1.9.2\bin>opennlp TokenNameFinder en-ner-organization.bin
Loading Token Name Finder model ... done (0.717s)

The name finder now is waiting to read a tokenized sentence per line from stdin, an empty line indicates a document boundary. Just copy the text below to the terminal:

Kevin Yang wrote an article with title "Java Array Example" for Microsoft in Beijing China in June 2018
This article was written by Kevin Yang for IBM in Sydney Australia in 2020

The name finder will output the text with markup for organization names:

Kevin Yang wrote an article with title "Java Array Example" for <START:organization> Microsoft <END> in Beijing China in June 2018
This article was written by Kevin Yang for <START:organization> IBM <END> in Sydney Australia in 2020

The pre-trained model work well without Solr. Time to see some examples of how Solr OpenNLP NER works.

3.3.2 Indexing Data

Start a single Solr instance on the local machine with the command below:

bin\solr.cmd start

The output would be:

D:\Java\solr-8.5.2>bin\solr.cmd start
Waiting up to 30 to see Solr running on port 8983
Started Solr server on port 8983. Happy searching!

Then create a new Solr core with the command below:

curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs

The output would be:

D:\Java\solr-8.5.2>curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs
{
  "responseHeader":{
    "status":0,
    "QTime":641},
  "core":"jcg_example_core"}

Download and extract the sample data file attached to this article and index the articles-opennlp.csv with the following command:

java -jar -Dc=jcg_example_core -Dauto post.jar articles-opennlp.csv

The output would be:

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/jcg_example_core/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file articles-opennlp.csv (text/csv) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/jcg_example_core/update...
Time spent: 0:00:00.670

Note that the post.jar is included in Solr distribution under example\exampledocs directory. It is also be included in the sample data file attached to this article.

3.3.3 Verifying Named Entity Extraction

To verify the named entity extraction works or not, we can simply run a search query to return all articles with the organization field:

curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=*:*" --data-urlencode fl=title,author,introduction,organization

The output would be:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "fl":"title,author,introduction,organization"}},
  "response":{"numFound":13,"start":0,"docs":[
      {
        "title":["Java Array Example"],
        "author":["Kevin Yang"],
        "introduction":" Kevin Yang wrote an article with title \"Java Array Example\" for Microsoft in Beijing China in June 2018",
        "organization":"Microsoft"},
      {
        "title":["Java Arrays Showcases"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang for IBM in Sydney Australia in 2020",
        "organization":"IBM"},
      {
        "title":["Java ArrayList 101"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang for Atlanssian in Sydney Australia in 2020"},
      {
        "title":["Java Remote Method Invocation Example"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang for Oracle in Beijing China in 2010",
        "organization":"Oracle"},
      {
        "title":["Thread"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang for HP in Sydney Australia in 2020",
        "organization":"HP"},
      {
        "title":["Java StringTokenizer Example"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang for Apple in Sydney Australia in 2020",
        "organization":"Apple"},
      {
        "title":["Java HashMap Example"],
        "author":["Evan Swing"],
        "introduction":"This article was written by Evan Swing for Google in Boston USA in 2018"},
      {
        "title":["Java HashSet Example"],
        "author":["Evan Swing"],
        "introduction":"This article was written by Kevin Yang for Goldman Sachs in Sydney Australia in 2020",
        "organization":"Goldman Sachs"},
      {
        "title":["Apache SolrCloud Example"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang for Tripadvisor in Sydney Australia in 2020"},
      {
        "title":["The Solr Runbook"],
        "author":["James Cook"],
        "introduction":"This article was written by James Cook for Samsung in London UK in 2020",
        "organization":"Samsung"}]
  }}

The original articles-opennlp.csv we just indexed doesn’t have a organization field. And as we can see from the search results above, organization names are extracted from the text of the introduction field and put into the organization field. Solr OpenNLP NER integration works as expected. Also, you may notice from the search results above, some well-known organizations such as Google, Atlassian, and Tripadvisor are not recognized by the en-ner-organization.bin model. This is because the training data used to train this model doesn’t have these organization names covered. You can try to use other pre-trained models such as en-ner-person.bin to extract a person’s names as an exercise. Furthermore, it will be full of fun if you follow the instructions in the Apache OpenNLP manual to train your own models with the data in your business domain and use them with Solr OpenNLP NER integration.

4. Download the Sample Data File

Download
You can download the full source code of this example here: Apache Solr OpenNLP Tutorial – Part 2

Apache Solr OpenNLP Tutorial – Part 2

1. Introduction

Table Of Contents

2. Technologies Used

3. Solr OpenNLP NER Integration

3.1 Named Entity Recognition

3.2 Setting Up The Integration

3.3 Examples

3.3.1 Trying The Pre-defined Model With OpenNLP Name Finder

3.3.2 Indexing Data

3.3.3 Verifying Named Entity Extraction

4. Download the Sample Data File

Thank you!

Kevin Yang

Thank you!

1. Introduction

Table Of Contents

2. Technologies Used

3. Solr OpenNLP NER Integration

3.1 Named Entity Recognition

3.2 Setting Up The Integration

3.3 Examples

3.3.1 Trying The Pre-defined Model With OpenNLP Name Finder

3.3.2 Indexing Data

3.3.3 Verifying Named Entity Extraction

4. Download the Sample Data File

Thank you!

Related Articles

Thank you!