Apache Solr

Apache Solr Fuzzy Search Example

1. Introduction

In this example we are going to build queries by using fuzzy search provided by Apache Solr. Fuzzy search is a powerful tool to find inexact matches in the Solr index.

2. Technologies Used

The steps and commands described in this example are for Apache Solr 8.5 on Windows 10. The JDK version we use to run the SolrCloud in this example is OpenJDK 13.

Before we start, please make sure your computer meets the system requirements. Also, please download the binary release of Apache Solr 8.5. In addition, it will save you some time if you can follow the Apache Solr Clustering Example to get a SolrCloud up and running on your local machine.

3. Fuzzy Search Examples

When talking about inexact matching in Solr, actually several types of queries are supported, such as wildcard searches, range searches, fuzzy searches, and proximity searches. We focus on fuzzy searches and proximity searches in this article. You can check out Apache Solr Standard Query Parser Example for examples of wildcard searches and range searches.

3.1 Preparation

Before we start, we suggest downloading the sample data file of this article containing jcg_example_configs.zip and articles.csv. Then follow the steps described in Section 4.1 Upload a ConfigSet and Section 4.2 Indexing Data in Apache Solr Function Query Example to get the article’s data indexed and stored in jcgArticles collection. In the jcg_example_configs, the default search field of /select request handler is set to field title. Additionally, instead of using “Solr Admin” in browser, we use a command-line tool curl to run all queries in the examples of this article.

3.2 Edit Distance

Edit distance is a way to measure the similarity of two strings. It is defined as the minimum number of primitive operations to convert one string to the other. In Solr, edit distance or similarity is calculated by using an algorithm called Damerau–Levenshtein distance. The primitive operations include:

  • insertion: geek –> greek
  • deletion: geek –> gee
  • substitution: geek –> geez
  • transposition: geek –> geke

Fuzzy searches and proximity searches are based on the similarity calculation shown above. Let’s see some examples in the following sections.

3.3 Fuzzy Searches

The syntax of a fuzzy search is by appending the tilde symbol ~ to the end of a single-word term with an optional edit distance parameter, a number between 0 and 2 (default). It matches similar terms to the specified term. For example:

curl http://localhost:8983/solr/jcgArticles/select?q=array~

The output would be:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":17,
    "params":{
      "q":"array~"}},
  "response":{"numFound":2,"start":0,"maxScore":0.6837484,"docs":[
      {
        "id":"0553573333",
        "category":["java"],
        "title":["Java Array Example"],
        "published":true,
        "author":["Kevin Yang"],
        "views":2560,
        "likes":256,
        "dislikes":6,
        "comments":3,
        "publish_date":"2020-05-06T00:00:00Z",
        "_version_":1669841842345082880},
      {
        "id":"0626166238",
        "category":["java"],
        "title":["Java Arrays Showcases"],
        "published":true,
        "author":["Kevin Yang"],
        "views":565,
        "likes":234,
        "dislikes":8,
        "comments":14,
        "publish_date":"2020-03-06T00:00:00Z",
        "_version_":1669841842373394432}]
  }}

Note that the fuzzy search can only be used with terms but not phases. Instead, we can append it to each term individually in a phase like this:

curl -G http://localhost:8983/solr/jcgArticles/select --data-urlencode "q=arra~1 AND examp~2"

The output would be:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":17,
    "params":{
      "q":"arra~1 AND examp~2"}},
  "response":{"numFound":1,"start":0,"maxScore":0.72990763,"docs":[
      {
        "id":"0553573333",
        "category":["java"],
        "title":["Java Array Example"],
        "published":true,
        "author":["Kevin Yang"],
        "views":2560,
        "likes":256,
        "dislikes":6,
        "comments":3,
        "publish_date":"2020-05-06T00:00:00Z",
        "_version_":1669927522290106368}]
  }}

3.4 Proximity Searches

The syntax of a proximity search is by appending the tilde symbol ~ and a numeric value to the end of a search phase. It matches terms within a specific distance (the number of term movements needed) from one another. For example:

curl -G http://localhost:8983/solr/jcgArticles/select --data-urlencode "q=\"java example\"~3"

The query above searches any article title for java and example within 3 words distance of each other. The output would be:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":11,
    "params":{
      "q":"\"java example\"~3"}},
  "response":{"numFound":5,"start":0,"maxScore":0.4862815,"docs":[
      {
        "id":"055357342Y",
        "category":["java"],
        "title":["Java StringTokenizer Example"],
        "published":true,
        "author":["Kevin Yang"],
        "views":699,
        "likes":30,
        "dislikes":0,
        "comments":0,
        "publish_date":"2020-06-01T00:00:00Z",
        "_version_":1669843269986549760},
      {
        "id":"0928237471",
        "category":["java"],
        "title":["Java HashSet Example"],
        "published":true,
        "author":["Evan Swing"],
        "views":3828,
        "likes":123,
        "dislikes":8,
        "comments":2,
        "publish_date":"2018-02-16T00:00:00Z",
        "_version_":1669843269989695488},
      {
        "id":"0553573333",
        "category":["java"],
        "title":["Java Array Example"],
        "published":true,
        "author":["Kevin Yang"],
        "views":2560,
        "likes":256,
        "dislikes":6,
        "comments":3,
        "publish_date":"2020-05-06T00:00:00Z",
        "_version_":1669843269982355456},
      {
        "id":"0553292123",
        "category":["java"],
        "title":["Java HashMap Example"],
        "published":true,
        "author":["Evan Swing"],
        "views":5897,
        "likes":1033,
        "dislikes":1,
        "comments":86,
        "publish_date":"2018-03-23T00:00:00Z",
        "_version_":1669843269998084096},
      {
        "id":"0553579908",
        "category":["java"],
        "title":["Java Remote Method Invocation Example"],
        "published":true,
        "author":["Kevin Yang"],
        "views":389,
        "likes":26,
        "dislikes":3,
        "comments":0,
        "publish_date":"2010-05-23T00:00:00Z",
        "_version_":1669843269993889792}]
  }}

4. Download the Sample Data File

Download
You can download the full source code of this example here: Apache Solr Fuzzy Search Example

Kevin Yang

A software design and development professional with seventeen years’ experience in the IT industry, especially with Java EE and .NET, I have worked for software companies, scientific research institutes and websites.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button