Apache Solr OpenNLP Tutorial – Part 2
1. Introduction
In Part 1 we’ve set up Apache Solr OpenNLP integration and used its analysis components, tokenizer, and filters, to process and analyze the sample data.
In this example, we are going to explore another powerful feature provided by Solr OpenNLP integration: extracting named entities at index time by using OpenNLP NER (Named Entity Recognition) model.
Table Of Contents
2. Technologies Used
The steps and commands described in this example are for Apache Solr 8.5 on Windows 10. Pre-trained models for OpenNLP 1.5 are used in this example. To train your own models, please refer to Apache OpenNLP for details. The JDK version we use to run the SolrCloud in this example is OpenJDK 13.
Before we start, please make sure your computer meets the system requirements. Also, please download the binary release of Apache Solr 8.5.
3. Solr OpenNLP NER Integration
3.1 Named Entity Recognition
In information extraction, a Named Entity is a real-world object, such as persons, locations, organizations, etc. Named Entity Recognition (NER) uses pre-trained models to locate and classify named entities in text into pre-defined categories. Each pre-trained model is dependent on the language and entity type it is trained for. Solr OpenNLP integration provides an update request processor to extract named entities using an OpenNLP NER model at index time. Let’s see how to set up the OpenNLP NER integration in the next section.
3.2 Setting Up The Integration
Please follow the steps described in section 3.2 Set Up The Integration of Apache Solr OpenNLP Tutorial – Part 1 to put jars on the classpath and add required resources to the configSet. Once completed, firstly, please make sure the following directives are in solrconfig.xml
of the jcg_example_configs
configSet:
<lib dir="${solr.install.dir:../../../../../}/contrib/analysis-extras/lucene-libs" regex=".*\.jar" /> <lib dir="${solr.install.dir:../../../../../}/contrib/analysis-extras/lib" regex=".*\.jar"/> <lib path="${solr.install.dir:../../../../../}/dist/solr-analysis-extras-8.5.2.jar"/>
Secondly, the pre-trained models for the English language are downloaded and copied to the jcg_example_configs
configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf\opennlp
.
D:\Java\solr-8.5.2\server\solr\configsets\jcg_example_configs\conf\opennlp>dir Volume in drive D is Data Volume Serial Number is 24EC-FE37 Directory of D:\Java\solr-8.5.2\server\solr\configsets\jcg_example_configs\conf\opennlp 06/30/2020 11:28 PM <DIR> . 06/30/2020 11:28 PM <DIR> .. 06/28/2020 08:25 PM 2,560,304 en-chunker.bin 06/30/2020 11:24 PM 1,632,029 en-lemmatizer.bin 06/28/2020 08:24 PM 5,030,307 en-ner-date.bin 06/28/2020 08:25 PM 5,110,658 en-ner-location.bin 06/28/2020 08:25 PM 4,806,234 en-ner-money.bin 06/28/2020 08:25 PM 5,297,172 en-ner-organization.bin 06/28/2020 08:25 PM 4,728,645 en-ner-percentage.bin 06/28/2020 08:25 PM 5,207,953 en-ner-person.bin 06/28/2020 08:25 PM 4,724,357 en-ner-time.bin 06/28/2020 08:26 PM 36,345,477 en-parser-chunking.bin 06/28/2020 08:24 PM 5,696,197 en-pos-maxent.bin 06/28/2020 08:24 PM 3,975,786 en-pos-perceptron.bin 06/28/2020 08:24 PM 98,533 en-sent.bin 06/28/2020 08:24 PM 439,890 en-token.bin 06/30/2020 10:34 PM 35 stop.pos.txt 15 File(s) 85,653,577 bytes 2 Dir(s) 47,963,561,984 bytes free
Thirdly, the text_en_opennlp
field type is added in managed-schema
in jcg_example_configs
configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf
as below:
<fieldType name="text_en_opennlp" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="opennlp/en-sent.bin" tokenizerModel="opennlp/en-token.bin"/> <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="opennlp/en-pos-maxent.bin"/> <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="opennlp/en-chunker.bin"/> <filter class="solr.KeywordRepeatFilterFactory"/> <filter class="solr.OpenNLPLemmatizerFilterFactory" lemmatizerModel="opennlp/en-lemmatizer.bin"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.TypeAsPayloadFilterFactory"/> <filter class="solr.TypeTokenFilterFactory" types="opennlp/stop.pos.txt"/> </analyzer> </fieldType>
Finally, let’s set up Update Request Processors by using OpenNLP NER models. Detailed usage of solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory
can be found in the java doc. In this example we are going to extract organization names from the introduction field of an article by using OpenNLP NER model en-ner-organization.bin
so the configurations are as below:
Open managed-schema
, add the following two fields:
<field name="introduction" type="string" indexed="true" stored="true"/> <field name="organization" type="string" indexed="true" stored="true"/>
Open solrconfig.xml
, add the following update request processor chain with an OpenNLP NER update processor:
<!-- Update requeset processor chain with OpenNLP NER Update Request Processor --> <updateRequestProcessorChain name="extract-organization" default="true" processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">opennlp/en-ner-organization.bin</str> <str name="analyzerFieldType">text_en_opennlp</str> <str name="source">introduction</str> <str name="dest">organization</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
If you have other update request processor chain configured as default such as add-unknown-fields-to-the-schema
chain, please comment it out.
For your convenience, a jcg_example_configs.zip
the file containing all configurations and schema is attached to the article. You can simply download and extract it to the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs
.
3.3 Examples
3.3.1 Trying The Pre-defined Model With OpenNLP Name Finder
Before we start Solr and use the pre-trained NER model to index data, there is an easy way to try out the pre-trained NER model with Apache OpenNLP name finder. It is a command line tool for demonstration and testing purpose. Download the English organization model en-ner-organization.bin
and start the Name Finder Tool with the following command:
opennlp TokenNameFinder en-ner-organization.bin
The output would be:
D:\Java\apache-opennlp-1.9.2\bin>opennlp TokenNameFinder en-ner-organization.bin Loading Token Name Finder model ... done (0.717s)
The name finder now is waiting to read a tokenized sentence per line from stdin, an empty line indicates a document boundary. Just copy the text below to the terminal:
Kevin Yang wrote an article with title "Java Array Example" for Microsoft in Beijing China in June 2018 This article was written by Kevin Yang for IBM in Sydney Australia in 2020
The name finder will output the text with markup for organization names:
Kevin Yang wrote an article with title "Java Array Example" for <START:organization> Microsoft <END> in Beijing China in June 2018 This article was written by Kevin Yang for <START:organization> IBM <END> in Sydney Australia in 2020
The pre-trained model work well without Solr. Time to see some examples of how Solr OpenNLP NER works.
3.3.2 Indexing Data
Start a single Solr instance on the local machine with the command below:
bin\solr.cmd start
The output would be:
D:\Java\solr-8.5.2>bin\solr.cmd start Waiting up to 30 to see Solr running on port 8983 Started Solr server on port 8983. Happy searching!
Then create a new Solr core with the command below:
curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs
The output would be:
D:\Java\solr-8.5.2>curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs { "responseHeader":{ "status":0, "QTime":641}, "core":"jcg_example_core"}
Download and extract the sample data file attached to this article and index the articles-opennlp.csv
with the following command:
java -jar -Dc=jcg_example_core -Dauto post.jar articles-opennlp.csv
The output would be:
SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/jcg_example_core/update... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file articles-opennlp.csv (text/csv) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/jcg_example_core/update... Time spent: 0:00:00.670
Note that the post.jar
is included in Solr distribution under example\exampledocs
directory. It is also be included in the sample data file attached to this article.
3.3.3 Verifying Named Entity Extraction
To verify the named entity extraction works or not, we can simply run a search query to return all articles with the organization
field:
curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=*:*" --data-urlencode fl=title,author,introduction,organization
The output would be:
{ "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"*:*", "fl":"title,author,introduction,organization"}}, "response":{"numFound":13,"start":0,"docs":[ { "title":["Java Array Example"], "author":["Kevin Yang"], "introduction":" Kevin Yang wrote an article with title \"Java Array Example\" for Microsoft in Beijing China in June 2018", "organization":"Microsoft"}, { "title":["Java Arrays Showcases"], "author":["Kevin Yang"], "introduction":"This article was written by Kevin Yang for IBM in Sydney Australia in 2020", "organization":"IBM"}, { "title":["Java ArrayList 101"], "author":["Kevin Yang"], "introduction":"This article was written by Kevin Yang for Atlanssian in Sydney Australia in 2020"}, { "title":["Java Remote Method Invocation Example"], "author":["Kevin Yang"], "introduction":"This article was written by Kevin Yang for Oracle in Beijing China in 2010", "organization":"Oracle"}, { "title":["Thread"], "author":["Kevin Yang"], "introduction":"This article was written by Kevin Yang for HP in Sydney Australia in 2020", "organization":"HP"}, { "title":["Java StringTokenizer Example"], "author":["Kevin Yang"], "introduction":"This article was written by Kevin Yang for Apple in Sydney Australia in 2020", "organization":"Apple"}, { "title":["Java HashMap Example"], "author":["Evan Swing"], "introduction":"This article was written by Evan Swing for Google in Boston USA in 2018"}, { "title":["Java HashSet Example"], "author":["Evan Swing"], "introduction":"This article was written by Kevin Yang for Goldman Sachs in Sydney Australia in 2020", "organization":"Goldman Sachs"}, { "title":["Apache SolrCloud Example"], "author":["Kevin Yang"], "introduction":"This article was written by Kevin Yang for Tripadvisor in Sydney Australia in 2020"}, { "title":["The Solr Runbook"], "author":["James Cook"], "introduction":"This article was written by James Cook for Samsung in London UK in 2020", "organization":"Samsung"}] }}
The original articles-opennlp.csv
we just indexed doesn’t have a organization
field. And as we can see from the search results above, organization names are extracted from the text of the introduction field and put into the organization field. Solr OpenNLP NER integration works as expected. Also, you may notice from the search results above, some well-known organizations such as Google, Atlassian, and Tripadvisor are not recognized by the en-ner-organization.bin model. This is because the training data used to train this model doesn’t have these organization names covered. You can try to use other pre-trained models such as en-ner-person.bin to extract a person’s names as an exercise. Furthermore, it will be full of fun if you follow the instructions in the Apache OpenNLP manual to train your own models with the data in your business domain and use them with Solr OpenNLP NER integration.
4. Download the Sample Data File
You can download the full source code of this example here: Apache Solr OpenNLP Tutorial – Part 2
Hi, Can we have NER in query time as well? Can’t seem to figure the same implementation for Query Parser. Thanks!