Apache Solr

Apache Solr OpenNLP Tutorial – Part 1

This is an article about Apache Solr OpenNLP.

1. Introduction

Natural Language Processing (NLP) is a field focusing on processing and analyzing human languages by using computers. Using NLP in a search will help search service providers to have a better understanding of what their customers really mean in their searches, thus to run search queries more efficiently and to return better search results to meet customer’s needs.

In this example, we are going to show you how Apache Solr OpenNLP integration works and how customer’s search experience can be improved by using OpenNLP.

2. Technologies Used

Apache Solr OpenNLP

The steps and commands described in this example are for Apache Solr 8.5 on Windows 10. Pre-trained models for OpenNLP 1.5 are used in this example. To train your own models, please refer to Apache OpenNLP for details. The JDK version we use to run the SolrCloud in this example is OpenJDK 13.

Before we start, please make sure your computer meets the system requirements. Also, please download the binary release of Apache Solr 8.5.

3. Solr OpenNLP Integration

3.1 The Basics

NLP processes and analyzes natural languages. To understand how it works with Solr, we need to know where analysis takes place. There are two places in which text analysis happens in Solr: index time and query time. Analyzers consist of tokenizers and filters are used in both places. At index time, token stream generated from analysis is added to an index, and terms are defined for a field. At query time, terms generated from analysis of the values being searched for are matched against those stored in the index.

Solr OpenNLP integration provides several analysis components: an OpenNLP tokenizer, an OpenNLP part-of-speech tagging filter, an OpenNLP phrase chunking filter, and an OpenNLP lemmatization filter. In addition to these analysis components, Solr also provides an update request processor to extract named entities using an OpenNLP NER model. Let’s see how to set up the OpenNLP integration in the next section.

3.2 Setting Up The Integration

3.2.1 Putting jars on classpath

To use the OpenNLP components, we must add additional jars to Solr’s classpath. There are a few options to make other plugins available to Solr as described in Solr Plugins. We use the standard approach the directive in solrconfig.xml as shown below:

  <lib dir="${solr.install.dir:../../../../../}/contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../../../}/contrib/analysis-extras/lib" regex=".*\.jar"/>
  <lib path="${solr.install.dir:../../../../../}/dist/solr-analysis-extras-8.5.2.jar"/>

3.2.2 Adding required resources to configset

We need to go to the Apache OpenNLP website to download the pre-trained models for the OpenNLP 1.5. They are fully compatible with Apache OpenNLP 1.9.2.

Also, we need to Download and unzip apache-opennlp-1.9.2-bin.zip. Then go to the URL for the lemmatizer training file and save it as en-lemmatizer.dict. Next, let’s train the lemmatizer model by going to the apache-opennlp bin directory we just unzipped and execute the following command:

opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data /path/to/en-lemmatizer.dict -encoding UTF-8

The output will be:

D:\java\apache-opennlp-1.9.2\bin\opennlp  LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data en-lemmatizer.dict -encoding UTF-8
Indexing events with TwoPass using cutoff of 5

        Computing event counts...  done. 301403 events
        Indexing...  done.
Sorting and merging events... done. Reduced 301403 events to 297776.
Done indexing in 12.63 s.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 297776
            Number of Outcomes: 431
          Number of Predicates: 69122
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-1828343.1766817758     0.6328968192088333
  2:  ... loglikelihood=-452189.7053988425      0.8768227257193857
  3:  ... loglikelihood=-211064.45129182754     0.9506474719893299
  4:  ... loglikelihood=-132189.41066218202     0.9667289310325379
  5:  ... loglikelihood=-95473.57210099498      0.9743997239576249
  6:  ... loglikelihood=-74894.1935626126       0.9794693483475613
  7:  ... loglikelihood=-61926.78603360762      0.9831056757895575
  8:  ... loglikelihood=-53069.688593599894     0.9856438058015348
  9:  ... loglikelihood=-46655.871988011146     0.9877439839683082
 10:  ... loglikelihood=-41801.50242291499      0.9893265826816587
 11:  ... loglikelihood=-37998.3432302135       0.9905608106090517
 12:  ... loglikelihood=-34935.28330041361      0.9915196597246876
 13:  ... loglikelihood=-32412.054562775495     0.9923325248919221
 14:  ... loglikelihood=-30294.265898838632     0.9930259486468284
 15:  ... loglikelihood=-28488.56869622921      0.9936132022574427
 16:  ... loglikelihood=-26928.219836178196     0.9941340995278747
 17:  ... loglikelihood=-25564.30190282366      0.9945521444710238
 18:  ... loglikelihood=-24360.17747454469      0.9948806083549268
 19:  ... loglikelihood=-23287.876071165214     0.9951924831537842
 20:  ... loglikelihood=-22325.67856216146      0.9954744975995594
 21:  ... loglikelihood=-21456.463866609512     0.9956437062670246
 22:  ... loglikelihood=-20666.55205863062      0.9958195505685079
 23:  ... loglikelihood=-19944.878511734943     0.9959953948699913
 24:  ... loglikelihood=-19282.394080308608     0.9961845104395112
 25:  ... loglikelihood=-18671.622759799964     0.9963570369239855
 26:  ... loglikelihood=-18106.330904658702     0.9965196099574324
 27:  ... loglikelihood=-17581.276656339858     0.9966357335527516
 28:  ... loglikelihood=-17092.017845561142     0.9967551749650799
 29:  ... loglikelihood=-16634.763075140218     0.9968712985603992
 30:  ... loglikelihood=-16206.255072812444     0.9969675152536637
 31:  ... loglikelihood=-15803.678430914795     0.9970902744830011
 32:  ... loglikelihood=-15424.585970349774     0.9971964446272931
 33:  ... loglikelihood=-15066.839470007333     0.9972860256865392
 34:  ... loglikelihood=-14728.561581223981     0.9973722889287764
 35:  ... loglikelihood=-14408.0965283682       0.9974618699880227
 36:  ... loglikelihood=-14103.977768763696     0.9975381797792324
 37:  ... loglikelihood=-13814.901208117759     0.997581311400351
 38:  ... loglikelihood=-13539.702883330643     0.9976509855575426
 39:  ... loglikelihood=-13277.340262355141     0.9976941171786611
 40:  ... loglikelihood=-13026.876491519615     0.997747202250807
 41:  ... loglikelihood=-12787.467059226115     0.997770426969871
 42:  ... loglikelihood=-12558.348451930819     0.9978069229569713
 43:  ... loglikelihood=-12338.828461585104     0.9978401011270625
 44:  ... loglikelihood=-12128.277868995287     0.9978799149311719
 45:  ... loglikelihood=-11926.123279039519     0.9979164109182722
 46:  ... loglikelihood=-11731.840924598388     0.9979263643692996
 47:  ... loglikelihood=-11544.951288710525     0.9979595425393908
 48:  ... loglikelihood=-11365.01442068802      0.9979993563435002
 49:  ... loglikelihood=-11191.625843150192     0.9980557592326553
 50:  ... loglikelihood=-11024.41296410639      0.9980955730367648
 51:  ... loglikelihood=-10863.031922256287     0.9981320690238651
 52:  ... loglikelihood=-10707.16480518142      0.998158611559938
 53:  ... loglikelihood=-10556.517189551667     0.9981917897300292
 54:  ... loglikelihood=-10410.81596029103      0.998211696632084
 55:  ... loglikelihood=-10269.807372149957     0.9982249679001204
 56:  ... loglikelihood=-10133.255322511463     0.998241556985166
 57:  ... loglikelihood=-10000.939808806212     0.998268099521239
 58:  ... loglikelihood=-9872.655547678738      0.9982913242403029
 59:  ... loglikelihood=-9748.21073625716       0.9983311380444123
 60:  ... loglikelihood=-9627.425938565784      0.9983609983974944
 61:  ... loglikelihood=-9510.13308241278       0.9983941765675856
 62:  ... loglikelihood=-9396.174554023093      0.9984140834696403
 63:  ... loglikelihood=-9285.40237935212       0.9984240369206677
 64:  ... loglikelihood=-9177.677482426574      0.9984306725546859
 65:  ... loglikelihood=-9072.869012278017      0.9984638507247772
 66:  ... loglikelihood=-8970.853731087096      0.9984738041758044
 67:  ... loglikelihood=-8871.515457047639      0.9984804398098227
 68:  ... loglikelihood=-8774.74455624773       0.9985036645288866
 69:  ... loglikelihood=-8680.437478540607      0.9985136179799139
 70:  ... loglikelihood=-8588.496332961782      0.9985268892479504
 71:  ... loglikelihood=-8498.82849876398       0.9985401605159869
 72:  ... loglikelihood=-8411.346268577978      0.9985467961500052
 73:  ... loglikelihood=-8325.966520610862      0.9985633852350507
 74:  ... loglikelihood=-8242.610417120377      0.9985799743200964
 75:  ... loglikelihood=-8161.203126709595      0.9985832921371055
 76:  ... loglikelihood=-8081.67356824808       0.9985932455881328
 77:  ... loglikelihood=-8003.954174455548      0.9986197881242058
 78:  ... loglikelihood=-7927.98067338463       0.9986264237582241
 79:  ... loglikelihood=-7853.691886230994      0.9986463306602787
 80:  ... loglikelihood=-7781.029540039709      0.9986463306602787
 81:  ... loglikelihood=-7709.938094037545      0.9986496484772879
 82:  ... loglikelihood=-7640.364578431137      0.9986695553793427
 83:  ... loglikelihood=-7572.258444629405      0.9986927800984065
 84:  ... loglikelihood=-7505.5714259522365     0.9986994157324247
 85:  ... loglikelihood=-7440.257407963147      0.998706051366443
 86:  ... loglikelihood=-7376.272307657644      0.9987093691834521
 87:  ... loglikelihood=-7313.57396080075       0.9987259582684976
 88:  ... loglikelihood=-7252.12201677264       0.9987458651705524
 89:  ... loglikelihood=-7191.877840340969      0.9987525008045707
 90:  ... loglikelihood=-7132.80441983102       0.9987657720726071
 91:  ... loglikelihood=-7074.866281202995      0.9987823611576527
 92:  ... loglikelihood=-7018.029407597901      0.9987989502426983
 93:  ... loglikelihood=-6962.261163947286      0.9988022680597074
 94:  ... loglikelihood=-6907.530226271331      0.9988055858767165
 95:  ... loglikelihood=-6853.806515329603      0.9988221749617622
 96:  ... loglikelihood=-6801.061134311805      0.9988221749617622
 97:  ... loglikelihood=-6749.266310279299      0.9988321284127896
 98:  ... loglikelihood=-6698.39533909719       0.998845399680826
 99:  ... loglikelihood=-6648.422533612705      0.9988487174978351
100:  ... loglikelihood=-6599.323174858488      0.9988586709488625
Writing lemmatizer model ... done (1.541s)

Wrote lemmatizer model to
path: D:\en-lemmatizer.bin

Execution time: 339.410 seconds

In this example, we only have English in our test data so we just need to download English pre-trained models and train the English lemmatizer model as described above. Now all required resources are ready and we just need to copy these resources to the jcg_example_configs configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf\opennlp. This is because resources are typically resolved from the configSet by Solr. And as we are going to run Solr in standalone mode, the configSet is on the file system. If we run Solr in SolrCloud mode, the configSet and resources are stored in ZooKeeper and shared by Solr instances in SolrCloud. The output below shows models in the opennlp directory:

D:\Java\solr-8.5.2\server\solr\configsets\jcg_example_configs\conf\opennlp>dir
 Volume in drive D is Data
 Volume Serial Number is 24EC-FE37

 Directory of D:\Java\solr-8.5.2\server\solr\configsets\jcg_example_configs\conf\opennlp

06/30/2020  11:28 PM    <DIR>          .
06/30/2020  11:28 PM    <DIR>          ..
06/28/2020  08:25 PM         2,560,304 en-chunker.bin
06/30/2020  11:24 PM         1,632,029 en-lemmatizer.bin
06/28/2020  08:24 PM         5,030,307 en-ner-date.bin
06/28/2020  08:25 PM         5,110,658 en-ner-location.bin
06/28/2020  08:25 PM         4,806,234 en-ner-money.bin
06/28/2020  08:25 PM         5,297,172 en-ner-organization.bin
06/28/2020  08:25 PM         4,728,645 en-ner-percentage.bin
06/28/2020  08:25 PM         5,207,953 en-ner-person.bin
06/28/2020  08:25 PM         4,724,357 en-ner-time.bin
06/28/2020  08:26 PM        36,345,477 en-parser-chunking.bin
06/28/2020  08:24 PM         5,696,197 en-pos-maxent.bin
06/28/2020  08:24 PM         3,975,786 en-pos-perceptron.bin
06/28/2020  08:24 PM            98,533 en-sent.bin
06/28/2020  08:24 PM           439,890 en-token.bin
06/30/2020  10:34 PM                35 stop.pos.txt
              15 File(s)     85,653,577 bytes
               2 Dir(s)  47,963,561,984 bytes free

3.2.3 Defining Schema

Before we define the schema, it would be good to have some basic understanding of TextField, analyzer, tokenizer and filter in Solr. TextField is the basic type for configurable text analysis. It allows the specification of custom text analyzers consist of a tokenizer and a list of token filters. Different analyzers may be specified for indexing and querying. For more info on customizing your analyzer chain, please see Understanding Analyzers, Tokenizers, and Filters.

Now let’s see how to configure OpenNLP analysis components.

The OpenNLP Tokenizer takes two language-specific binary model files as required parameters: a sentence detector model and a tokenizer model. For example:

<analyzer>
  <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-token.bin"/>
</analyzer>

The OpenNLP Part-Of-Speech Filter takes one language-specific binary model file as the required parameter: a POS tagger model. Normally we don’t want to include punctuation in the index, so the TypeTokenFilter is included in the examples below, with stop.pos.txt containing the following:

stop.pos.txt

#
$
''
``
,
-LRB-
-RRB-
:
.

The OpenNLP Part-Of-Speech Filter example:

<analyzer>
  <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-token.bin"/>
  <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
  <filter class="solr.TypeAsPayloadFilterFactory"/>
  <filter class="solr.TypeTokenFilterFactory" types="stop.pos.txt"/>
</analyzer>

The OpenNLP Phrase Chunking Filter takes one language-specific binary model file as the required parameter: a phrase chunker model. For example:

<analyzer>
  <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-token.bin"/>
  <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="en-chunker.bin"/>
</analyzer>

The OpenNLP Lemmatizer Filter takes two optional parameters: a dictionary-based lemmatizer and a model-based lemmatizer. In this example, we perform model-based lemmatization only, preserving the original token and emitting the lemma as a synonym.

<analyzer>
  <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-token.bin"/>
  <filter class="solr.KeywordRepeatFilterFactory"/>
  <filter class="solr.OpenNLPLemmatizerFilterFactory" lemmatizerModel="en-lemmatizer.bin"/>
  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

Put everything above together, the analyzer configuration would be:

<analyzer>
  <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-token.bin"/>
  <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
  <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="en-chunker.bin"/>
  <filter class="solr.KeywordRepeatFilterFactory"/>
  <filter class="solr.OpenNLPLemmatizerFilterFactory" lemmatizerModel="en-lemmatizer.bin"/>
  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  <filter class="solr.TypeAsPayloadFilterFactory"/>
  <filter class="solr.TypeTokenFilterFactory" types="stop.pos.txt"/>
</analyzer>

Open managed-schema file with any text editor in jcg_example_configs configSet under the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs\conf. Add field type text_en_opennlp using OpenNLP-based analysis components described above, then field introduction using text_en_opennlp field type as below:

<!-- English TextField OpenNLP -->
<fieldType name="text_en_opennlp" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="opennlp/en-sent.bin" tokenizerModel="opennlp/en-token.bin"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="opennlp/en-pos-maxent.bin"/>
    <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="opennlp/en-chunker.bin"/>
    <filter class="solr.KeywordRepeatFilterFactory"/>
    <filter class="solr.OpenNLPLemmatizerFilterFactory" lemmatizerModel="opennlp/en-lemmatizer.bin"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.TypeAsPayloadFilterFactory"/>
    <filter class="solr.TypeTokenFilterFactory" types="opennlp/stop.pos.txt"/>
  </analyzer>
</fieldType>
<field name="introduction" type="text_en_opennlp" indexed="true" stored="true"/>

If extracting named entities from text seems interesting and useful in your use cases, we can set up Update Request Processors by using OpenNLP NER models. This step is optional and out of scope of this article. Feel free to check out details usage of solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory in the java doc. An example configuration to extract company names from introduction field by using OpenNLP NER model en-ner-organization.bin is listed below:

Open solrconfig.xml, add the following snippet:

<updateRequestProcessorChain name="single-extract">
  <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
    <str name="modelFile">opennlp/en-ner-organization.bin</str>
    <str name="analyzerFieldType">text_en_opennlp</str>
    <str name="source">introduction</str>
    <str name="dest">company</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Open managed-schema, add the following field:

<field name="company" type="text_general" indexed="true" stored="true"/>

For your convenience, a jcg_example_configs.zip file containing all configurations and schema is attached to the article. You can simply download and extract it to the directory ${solr.install.dir}\server\solr\configsets\jcg_example_configs.

3.2.4 Starting Solr Instance

For simplicity, instead of setting up a SolrCloud on your local machine as demonstrated in Apache Solr Clustering Example, we run a single Solr instance on our local machine with the command below:

bin\solr.cmd start

The output would be:

D:\Java\solr-8.5.2>bin\solr.cmd start
Waiting up to 30 to see Solr running on port 8983
Started Solr server on port 8983. Happy searching!

3.2.5 Creating A New Core

As we are running Solr in standalone mode, we need to create a new core named jcg_example_core with the jcg_example_configs configSet on the local machine. For example, we can do it via the CoreAdmin API:

curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs

The output would be:

D:\Java\solr-8.5.2>curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs
{
  "responseHeader":{
    "status":0,
    "QTime":641},
  "core":"jcg_example_core"}

If you would like to remove a core, you can do it via the CoreAdmin API as below:

curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=UNLOAD --data-urlencode core=jcg_example_core --data-urlencode deleteInstanceDir=true

The output would be:

D:\Java\solr-8.5.2>curl -G http://localhost:8983/solr/admin/cores --data-urlencode action=UNLOAD --data-urlencode core=jcg_example_core --data-urlencode deleteInstanceDir=true
{
  "responseHeader":{
    "status":0,
    "QTime":37}}

3.3 Examples

Time to see some examples of how Solr OpenNLP works.

3.3.1 Indexing Data

Download and extract the sample data file attached to this article and index the articles-opennlp.csv with the following command:

java -jar -Dc=jcg_example_core -Dauto post.jar articles-opennlp.csv

The output would be:

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/jcg_example_core/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file articles-opennlp.csv (text/csv) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/jcg_example_core/update...
Time spent: 0:00:00.670

The post.jar is included in Solr distribution file under example\exampledocs. It is also be included in the sample data file attached to this article.

3.3.2 Semantic Search Examples

As we know that when searching with Solr if we specify the field to be searched, we may get more relevant results. But in a real-world applications, normally customers have no idea about which field to look at or they are just provided with a simple text input box to enter keywords they are looking for. For example, is it possible to search the author of the article “Java Array Example” without knowing any field to search for? With OpenNLP integration we’ve set up, we can do this easily by saying a sentence “author of java array example” to Solr as below:

curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=author of java array example" --data-urlencode fl=title,author,introduction

The output would be:

{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "q":"author of java array example",
      "fl":"title,author,introduction"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "title":["Java Array Example"],
        "author":["Kevin Yang"],
        "introduction":" Kevin Yang wrote an article with title \"Java Array Example\" for Microsoft in Beijing China in June 2018"}]
  }}

How exciting! It seems we are talking to the search engine with natural human language. Let’s try another one by saying “articles written by James Cook in 2019” as below:

curl -G http://localhost:8983/solr/jcg_example_core/select --data-urlencode "q=articles written by James Cook in 2019" --data-urlencode fl=title,author,introduction,score

The output would be:

{
  "responseHeader":{
    "status":0,
    "QTime":5,
    "params":{
      "q":"articles written by James Cook in 2019",
      "fl":"title,author,introduction,score"}},
  "response":{"numFound":13,"start":0,"maxScore":3.8089,"docs":[
      {
        "title":["The Apache Solr Cookbook"],
        "author":["James Cook"],
        "introduction":"This article was written by James Cook in Oxford UK in 2019",
        "score":3.8089},
      {
        "title":["The Solr Runbook"],
        "author":["James Cook"],
        "introduction":"This article was written by James Cook in London UK in 2020",
        "score":2.5949912},
      {
        "title":["Java ArrayList 101"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang in Sydney Australia in 2020",
        "score":0.1685594},
      {
        "title":["Java Remote Method Invocation Example"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang in Beijing China in 2010",
        "score":0.1685594},
      {
        "title":["Thread"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang in Sydney Australia in 2020",
        "score":0.1685594},
      {
        "title":["Java StringTokenizer Example"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang in Sydney Australia in 2020",
        "score":0.1685594},
      {
        "title":["Java HashMap Example"],
        "author":["Evan Swing"],
        "introduction":"This article was written by Evan Swing in Boston USA in 2018",
        "score":0.1685594},
      {
        "title":["Java HashSet Example"],
        "author":["Evan Swing"],
        "introduction":"This article was written by Kevin Yang in Sydney Australia in 2020",
        "score":0.1685594},
      {
        "title":["Apache SolrCloud Example"],
        "author":["Kevin Yang"],
        "introduction":"This article was written by Kevin Yang in Sydney Australia in 2020",
        "score":0.1685594},
      {
        "title":["The Solr REST API"],
        "author":["Steven Thomas"],
        "introduction":"This article was written by Steven Thomas in Seattle USA in 2020",
        "score":0.1685594}]
  }}

From the output above we can see that the article “The Apache Solr Cookbook” written by James Cook in 2019 is returned as the first result with the highest relevance score.

4. Download the Sample Data File

Download
You can download the sample data file of this example here: Apache Solr OpenNLP Tutorial – Part 1

Kevin Yang

A software design and development professional with seventeen years’ experience in the IT industry, especially with Java EE and .NET, I have worked for software companies, scientific research institutes and websites.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Nagendra
Nagendra
3 years ago

Hi Kevin YangThank you for the Solr NLP with an example. I tried your examples for testing and found that when queried,
curl -G http://localhost:8983/solr/jcg_example_core/select –data-urlencode “q=author of java array example” –data-urlencode fl=title,author,introduction – Results 1

curl -G http://localhost:8983/solr/jcg_example_core/select –data-urlencode “q=author of the solr runbook” –data-urlencode fl=title,author,introduction – Results 0.
Expecting result the author of the solr runbook –

Please can you explain we are getting different result set.

Homero Gonzalez Abarca
Homero Gonzalez Abarca
2 years ago
Reply to  Nagendra

The examples worked for me when I change the default filed by adding this to the search requests

–data-urlencode df=introduction

Back to top button