1.29.9.25. Ignoring Accent Symbols. New Analyzer Setting.

In this example, we will create new Analyzer, set it in QueryHandler configuration, and make query to check it.

Standard analyzer does not normalize accents like é,è,à. So, a word like 'tréma' will be stored to index as 'tréma'. But if we want to normalize such symbols or not? We want to store 'tréma' word as 'trema'.

There is two ways of setting up new Analyzer (no matter standarts or our):

There is only one way - create new Analyzer (if there is no previously created and accepted for our needs) and set it in Search index.

We will use the last one:

public class MyAnalyzer extends Analyzer

{
   @Override
   public TokenStream tokenStream(String fieldName, Reader reader)
   {
      StandardTokenizer tokenStream = new StandardTokenizer(reader);
      // process all text with standard filter
      // removes 's (as 's in "Peter's") from the end of words and removes dots from acronyms.
      TokenStream result = new StandardFilter(tokenStream);
      // this filter normalizes token text to lower case
      result = new LowerCaseFilter(result);
      // this one replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalents
      result = new ISOLatin1AccentFilter(result);
      // and finally return token stream
      return result;
   }
}

<workspace name="ws">
   ...
   <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
      <properties>
         <property name="analyzer" value="org.exoplatform.services.jcr.impl.core.MyAnalyzer"/>
         ...
      </properties>
   </query-handler>
   ...
</workspace>

After that, check it with query:

Find node with mixin type 'mix:title' where 'jcr:title' contains "tréma" and "naïve" strings.

Copyright ©2012. All rights reserved. eXo Platform SAS