Apache Solr Fuzzy Search Example
1. Introduction
In this example we are going to build queries by using fuzzy search provided by Apache Solr. Fuzzy search is a powerful tool to find inexact matches in the Solr index.
2. Technologies Used
The steps and commands described in this example are for Apache Solr 8.5 on Windows 10. The JDK version we use to run the SolrCloud in this example is OpenJDK 13.
Before we start, please make sure your computer meets the system requirements. Also, please download the binary release of Apache Solr 8.5. In addition, it will save you some time if you can follow the Apache Solr Clustering Example to get a SolrCloud up and running on your local machine.
3. Fuzzy Search Examples
When talking about inexact matching in Solr, actually several types of queries are supported, such as wildcard searches, range searches, fuzzy searches, and proximity searches. We focus on fuzzy searches and proximity searches in this article. You can check out Apache Solr Standard Query Parser Example for examples of wildcard searches and range searches.
3.1 Preparation
Before we start, we suggest downloading the sample data file of this article containing jcg_example_configs.zip and articles.csv. Then follow the steps described in Section 4.1 Upload a ConfigSet and Section 4.2 Indexing Data in Apache Solr Function Query Example to get the article’s data indexed and stored in jcgArticles collection. In the jcg_example_configs, the default search field of /select request handler is set to field title. Additionally, instead of using “Solr Admin” in browser, we use a command-line tool curl
to run all queries in the examples of this article.
3.2 Edit Distance
Edit distance is a way to measure the similarity of two strings. It is defined as the minimum number of primitive operations to convert one string to the other. In Solr, edit distance or similarity is calculated by using an algorithm called Damerau–Levenshtein distance. The primitive operations include:
- insertion: geek –> greek
- deletion: geek –> gee
- substitution: geek –> geez
- transposition: geek –> geke
Fuzzy searches and proximity searches are based on the similarity calculation shown above. Let’s see some examples in the following sections.
3.3 Fuzzy Searches
The syntax of a fuzzy search is by appending the tilde symbol ~
to the end of a single-word term with an optional edit distance parameter, a number between 0
and 2
(default). It matches similar terms to the specified term. For example:
curl http://localhost:8983/solr/jcgArticles/select?q=array~
The output would be:
{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":17, "params":{ "q":"array~"}}, "response":{"numFound":2,"start":0,"maxScore":0.6837484,"docs":[ { "id":"0553573333", "category":["java"], "title":["Java Array Example"], "published":true, "author":["Kevin Yang"], "views":2560, "likes":256, "dislikes":6, "comments":3, "publish_date":"2020-05-06T00:00:00Z", "_version_":1669841842345082880}, { "id":"0626166238", "category":["java"], "title":["Java Arrays Showcases"], "published":true, "author":["Kevin Yang"], "views":565, "likes":234, "dislikes":8, "comments":14, "publish_date":"2020-03-06T00:00:00Z", "_version_":1669841842373394432}] }}
Note that the fuzzy search can only be used with terms but not phases. Instead, we can append it to each term individually in a phase like this:
curl -G http://localhost:8983/solr/jcgArticles/select --data-urlencode "q=arra~1 AND examp~2"
The output would be:
{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":17, "params":{ "q":"arra~1 AND examp~2"}}, "response":{"numFound":1,"start":0,"maxScore":0.72990763,"docs":[ { "id":"0553573333", "category":["java"], "title":["Java Array Example"], "published":true, "author":["Kevin Yang"], "views":2560, "likes":256, "dislikes":6, "comments":3, "publish_date":"2020-05-06T00:00:00Z", "_version_":1669927522290106368}] }}
3.4 Proximity Searches
The syntax of a proximity search is by appending the tilde symbol ~
and a numeric value to the end of a search phase. It matches terms within a specific distance (the number of term movements needed) from one another. For example:
curl -G http://localhost:8983/solr/jcgArticles/select --data-urlencode "q=\"java example\"~3"
The query above searches any article title for java
and example
within 3
words distance of each other. The output would be:
{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":11, "params":{ "q":"\"java example\"~3"}}, "response":{"numFound":5,"start":0,"maxScore":0.4862815,"docs":[ { "id":"055357342Y", "category":["java"], "title":["Java StringTokenizer Example"], "published":true, "author":["Kevin Yang"], "views":699, "likes":30, "dislikes":0, "comments":0, "publish_date":"2020-06-01T00:00:00Z", "_version_":1669843269986549760}, { "id":"0928237471", "category":["java"], "title":["Java HashSet Example"], "published":true, "author":["Evan Swing"], "views":3828, "likes":123, "dislikes":8, "comments":2, "publish_date":"2018-02-16T00:00:00Z", "_version_":1669843269989695488}, { "id":"0553573333", "category":["java"], "title":["Java Array Example"], "published":true, "author":["Kevin Yang"], "views":2560, "likes":256, "dislikes":6, "comments":3, "publish_date":"2020-05-06T00:00:00Z", "_version_":1669843269982355456}, { "id":"0553292123", "category":["java"], "title":["Java HashMap Example"], "published":true, "author":["Evan Swing"], "views":5897, "likes":1033, "dislikes":1, "comments":86, "publish_date":"2018-03-23T00:00:00Z", "_version_":1669843269998084096}, { "id":"0553579908", "category":["java"], "title":["Java Remote Method Invocation Example"], "published":true, "author":["Kevin Yang"], "views":389, "likes":26, "dislikes":3, "comments":0, "publish_date":"2010-05-23T00:00:00Z", "_version_":1669843269993889792}] }}
4. Download the Sample Data File
You can download the full source code of this example here: Apache Solr Fuzzy Search Example