Apache Solr Multilingual Search: Language Identification Example
This is an article related to the Apache Solr Multilingual Search: Language Identification. Generally, Apache Solr is used for search and browsing categories and facets.
1. Introduction
Apache Solr is an open-source software java search engine. It is scalable and can process a high volume of data. It is used to index the content and search a huge amount of content. It is a popular search engine. It is used as a document-based NoSQL data source. It can be also used as a key-value store. Solr has JSON, XML, and HTTP REST API.
2. Apache Solr Multilingual Search
2.1 Prerequisites
Java 7 or 8 is required on the Linux, windows, or Mac operating system. Apache Solr 4.7.0 is required for this example.
2.2 Download
You can download Java 8 can be downloaded from the Oracle website. Apache Solr’s latest releases are available from the Apache Solr website.
2.3 Setup
You can set the environment variables for JAVA_HOME and PATH. They can be set as shown below:
Setup
JAVA_HOME="/desktop/jdk1.8.0_73" export JAVA_HOME PATH=$JAVA_HOME/bin:$PATH export PATH
2.4 How to download and install Apache Solr
Apache Solr’s latest releases are available from the Apache Solr website. After downloading the zip file can be extracted to a folder.
To start the Apache Solr, you can use the command below:
Solr start command
bin/solr start
The output of the above command is shown below:
Solr start command output
apples-MacBook-Air:solr-8.8.2 bhagvan.kommadi$ bin/solr start *** [WARN] *** Your open file limit is currently 2560. It should be set to 65000 to avoid operational disruption. If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh *** [WARN] *** Your Max Processes Limit is currently 1392. It should be set to 65000 to avoid operational disruption. If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh Waiting up to 180 seconds to see Solr running on port 8983 [-] Started Solr server on port 8983 (pid=3054). Happy searching!
You can access the Solr application from the browse at : http://localhost:8983/solr/. The screenshot below shows the Solr application.
2.5 Apache Solr
Apache Solr merged into Lucene around 2010. Lucene was created by Doug Cutting in 1999. Solr was developed by Yonik Seeley at CNET. Solr had a cloud feature released in 4.0. Solr 6.0 supported parallel SQL queries. Solr is based on Lucene. It has REST API support. It has an inverted index feature to get documents for a query using the search word. The search word is entered by the user to link the documents to the word. Solr has the features such as support for XML/JSON/HTTP, recommendations, automatic load balancing, spell suggestions, autocompletion, geospatial search, authentication, authorization, multilingual keyword search, type ahead prediction, batch processing, streaming, machine learning models, high volume web traffic support, schema, schemaless configuration, faceted search, filtering, and cluster configuration.
2.6 Apache Solr – Language Identification Example
To handle multiple languages, a field per language approach can be used in Apache Solr. Solr supports different languages. We need to set up the schema to search across three languages: English, Spanish, and French.
First, let us look at the configuration of language identification to find the language of a document.
solr configuration
<?xml version="1.0" encoding="UTF-8" ?> <config> <!-- Begin everything else --> <luceneMatchVersion>4.7</luceneMatchVersion> <lib dir="../../../contrib/langid/lib/" /> <lib dir="../../../dist/" regex="solr-langid-\d.*\.jar" /> <dataDir>${solr.data.dir:}</dataDir> <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/> <updateHandler class="solr.DirectUpdateHandler2"> <updateLog> <str name="dir">${solr.ulog.dir:}</str> </updateLog> <autoCommit> <maxTime>15000</maxTime> <openSearcher>false</openSearcher> </autoCommit> </updateHandler> <query> <maxBooleanClauses>1024</maxBooleanClauses> <useColdSearcher>false</useColdSearcher> <maxWarmingSearchers>1</maxWarmingSearchers> </query> <requestDispatcher handleSelect="false" > <httpCaching never304="true" /> </requestDispatcher> <requestHandler name="/select" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">none</str> <str name="df">content</str> <str name="wt">json</str> <str name="indent">true</str> </lst> </requestHandler> <updateRequestProcessorChain name="langid"> <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> <lst name="invariants"> <str name="langid.fl">content,content_lang1,content_lang2,content_lang3</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <str name="langid.map">true</str> <str name="langid.map.individual">true</str> <str name="langid.map.fl">content_lang1,content_lang2,content_lang3</str>str> <str name="langid.whitelist">en,es,fr</str> <str name="langid.map.lcmap">en:english es:spanish fr:french</str> </lst> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">langid</str> </lst> </requestHandler> <queryResponseWriter name="json" class="solr.JSONResponseWriter"> <str name="content-type">text/plain; charset=UTF-8</str> </queryResponseWriter> <admin> <defaultQuery>*:*</defaultQuery> </admin> </config>
Let us look at a schema to support the above three languages.
schema file
<?xml version="1.0" encoding="UTF-8" ?> <schema name="example" version="1.5"> <types> <fieldType name="text_english" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.KStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_spanish" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball"/> <filter class="solr.SpanishLightStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_french" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball"/> <filter class="solr.FrenchLightStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/> </types> <fields> <field name="id" type="string" indexed="true" stored="true" /> <field name="_version_" type="long" indexed="true" stored="true"/> <field name="text" type="string" indexed="true" stored="false" multiValued="true"/> <field name="title" type="string" indexed="true" stored="true" /> <field name="content" type="string" indexed="false" stored="false" /> <field name="content_lang1" type="string" indexed="false" stored="false" /> <field name="content_lang2" type="string" indexed="false" stored="false" /> <field name="content_lang3" type="string" indexed="false" stored="false" /> <field name="language" type="string" indexed="true" stored="true" /> <field name="languages" type="string" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_english" type="text_english" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_spanish" type="text_spanish" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_french" type="text_french" indexed="true" stored="true" multiValued="true"/> </fields> <uniqueKey>id</uniqueKey> <defaultSearchField>text</defaultSearchField> </schema>
You can implement this by copying the directory example
within solr installation and rename it as langdetect
. Ensure that it is a deep copy of the directory. Within the directory, remove the unused directories like example-DIH
, multicore
,and example-schemaless
. Remove the directories under the solr folder except for bin. You can copy the folders and files from the source code ($SOURCE_CODE
) provided to solr
. You can execute the commands below from the langdetect
directory
Solr initial setup
cd $SOLR_INSTALL cp -R example langdetect cd langdetect rm -r example-DIH rm -r multicore cd solr rm -r collection1 cp * $SOURCE_CODE/* .
Note that the core.properties
is changed the name from collection to langdetect
directory. You can restart Solr from langdetect
folder by using the command below:
Solr start command
cd $SOLR_INSTALL/langdetect java -jar start.jar
The output of the command is shown below:
Solr start output
apples-MacBook-Air:languagedetection bhagvan.kommadi$ java -jar start.jar 0 [main] INFO org.eclipse.jetty.server.Server – jetty-8.1.10.v20130312 42 [main] INFO org.eclipse.jetty.deploy.providers.ScanningAppProvider – Deployment monitor /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/contexts at interval 0 53 [main] INFO org.eclipse.jetty.deploy.DeploymentManager – Deployable added: /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/contexts/solr-jetty-context.xml 1414 [main] INFO org.eclipse.jetty.webapp.StandardDescriptorProcessor – NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet 1494 [main] INFO org.apache.solr.servlet.SolrDispatchFilter – SolrDispatchFilter.init() 1518 [main] INFO org.apache.solr.core.SolrResourceLoader – JNDI not configured for solr (NoInitialContextEx) 1519 [main] INFO org.apache.solr.core.SolrResourceLoader – solr home defaulted to 'solr/' (could not find system property or JNDI) 1522 [main] INFO org.apache.solr.core.SolrResourceLoader – new SolrResourceLoader for directory: 'solr/' 1667 [main] INFO org.apache.solr.core.ConfigSolr – Loading container configuration from /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/solr.xml 1900 [main] INFO org.apache.solr.core.CoresLocator – Config-defined core root directory: /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr 1915 [main] INFO org.apache.solr.core.CoreContainer – New CoreContainer 1720339 1916 [main] INFO org.apache.solr.core.CoreContainer – Loading cores into CoreContainer [instanceDir=solr/] 1935 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting socketTimeout to: 0 1935 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting urlScheme to: null 1942 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting connTimeout to: 0 1943 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting maxConnectionsPerHost to: 20 1947 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting corePoolSize to: 0 1948 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting maximumPoolSize to: 2147483647 1949 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting maxThreadIdleTime to: 5 1949 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting sizeOfQueue to: -1 1950 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – Setting fairnessPolicy to: false 2250 [main] INFO org.apache.solr.logging.LogWatcher – SLF4J impl is org.slf4j.impl.Log4jLoggerFactory 2251 [main] INFO org.apache.solr.logging.LogWatcher – Registering Log Listener [Log4j (org.slf4j.impl.Log4jLoggerFactory)] 2252 [main] INFO org.apache.solr.core.CoreContainer – Host Name: 2460 [main] INFO org.apache.solr.core.CoresLocator – Looking for core definitions underneath /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr 2472 [main] INFO org.apache.solr.core.CoresLocator – Found core langdetect in /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/ 2473 [main] INFO org.apache.solr.core.CoresLocator – Found 1 core definitions 2476 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.CoreContainer – Creating SolrCore 'langdetect' using instanceDir: /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect 2476 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrResourceLoader – new SolrResourceLoader for directory: '/Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/' 2511 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrConfig – Adding specified lib dirs to ClassLoader 2513 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrResourceLoader – Adding 'file:/Users/bhagvan.kommadi/Desktop/solr-4.7.0/contrib/langid/lib/jsonic-1.2.7.jar' to classloader 2514 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrResourceLoader – Adding 'file:/Users/bhagvan.kommadi/Desktop/solr-4.7.0/contrib/langid/lib/langdetect-1.1-20120112.jar' to classloader 2517 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrResourceLoader – Adding 'file:/Users/bhagvan.kommadi/Desktop/solr-4.7.0/dist/solr-langid-4.7.0.jar' to classloader 2547 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrConfig – Using Lucene MatchVersion: LUCENE_47 2698 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.Config – Loaded SolrConfig: solrconfig.xml 2707 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.schema.IndexSchema – Reading Solr Schema from schema.xml 2725 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.schema.IndexSchema – [langdetect] Schema name=example 2790 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.schema.IndexSchema – default search field in schema is text 2791 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.schema.IndexSchema – unique key field: id 2925 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – solr.NRTCachingDirectoryFactory 2931 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – [langdetect] Opening new SolrCore at /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/, dataDir=/Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/data/ 2931 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – JMX monitoring not detected for core: langdetect 2943 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.CachingDirectoryFactory – return new directory for /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/data 2943 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – New index directory detected: old=null new=/Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/data/index/ 2944 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.CachingDirectoryFactory – return new directory for /Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/data/index 2955 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – created json: solr.JSONResponseWriter 3023 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.update.processor.UpdateRequestProcessorChain – creating updateRequestProcessorChain "langid" 3674 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.update.processor.UpdateRequestProcessorChain – inserting DistributedUpdateProcessorFactory into updateRequestProcessorChain "langid" 3674 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – no updateRequestProcessorChain defined as default, creating implicit default 3680 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.RequestHandlers – created /select: solr.SearchHandler 3683 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.RequestHandlers – created /update: solr.UpdateRequestHandler 3709 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.handler.loader.XMLLoader – xsltCacheLifetimeSeconds=60 3767 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – Hard AutoCommit: if uncommited for 15000ms; 3768 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – Soft AutoCommit: disabled 3819 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – SolrDeletionPolicy.onInit: commits: num=1 commit{dir=NRTCachingDirectory(NIOFSDirectory@/Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/data/index lockFactory=NativeFSLockFactory@/Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection/solr/langdetect/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_4,generation=4} 3821 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – newest commit generation = 4 3859 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.search.SolrIndexSearcher – Opening Searcher@12a34218[langdetect] main 3869 [searcherExecutor-5-thread-1] INFO org.apache.solr.core.SolrCore – [langdetect] Registered new searcher Searcher@12a34218[langdetect] main{StandardDirectoryReader(segments_4:11:nrt _2(4.7):C4)} 3870 [coreLoadExecutor-4-thread-1] INFO org.apache.solr.core.CoreContainer – registering core: langdetect 3874 [main] INFO org.apache.solr.servlet.SolrDispatchFilter – user.dir=/Users/bhagvan.kommadi/Desktop/solr-4.7.0/languagedetection 3875 [main] INFO org.apache.solr.servlet.SolrDispatchFilter – SolrDispatchFilter.init() done 3920 [main] INFO org.eclipse.jetty.server.AbstractConnector – Started SocketConnector@0.0.0.0:8983
You can post the multilanguage file (provided in the source code) from $SOLR_INSTALL/example-docs using the command below.
posting multilanguage files to solr
apples-MacBook-Air:exampledocs bhagvan.kommadi$ java -Durl=http://localhost:8983/solr/langdetect/update -jar post.jar /Users/bhagvan.kommadi/desktop/JavacodeGeeks/code/apachesollangidentification/multi_lang.xml SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/langdetect/update using content-type application/xml.. POSTing file multi_lang.xml 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/langdetect/update.. Time spent: 0:00:00.491
You can launch the browser pointing to http://localhost:8983/solr/. The screenshot below shows the query results in langdetect core.
3. Download the Source Code
You can download the full source code of this example here: Apache Solr Multilingual Search: Language Identification Example