Class WhitespaceTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class WhitespaceTokenizer
    extends org.apache.lucene.analysis.CharTokenizer
    Created by The eXo Platform SAS Author : eXoPlatform exo@exoplatform.com Apr 9, 2013 A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.
    You must specify the required Version compatibility when creating WhitespaceTokenizer:
    • As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See CharTokenizer.isTokenChar(int) and CharTokenizer.normalize(int) for details.
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

        org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
    • Field Summary

      • Fields inherited from class org.apache.lucene.analysis.Tokenizer

        input
    • Constructor Summary

      Constructors 
      Constructor Description
      WhitespaceTokenizer​(org.apache.lucene.util.Version matchVersion, Reader in)
      Construct a new WhitespaceTokenizer.
      WhitespaceTokenizer​(org.apache.lucene.util.Version matchVersion, org.apache.lucene.util.AttributeSource.AttributeFactory factory, Reader in)
      Construct a new WhitespaceTokenizer using a given AttributeSource.AttributeFactory.
      WhitespaceTokenizer​(org.apache.lucene.util.Version matchVersion, org.apache.lucene.util.AttributeSource source, Reader in)
      Construct a new WhitespaceTokenizer using a given AttributeSource.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected boolean isTokenChar​(int c)
      Collects only characters which do not satisfy Character.isWhitespace(int).
      • Methods inherited from class org.apache.lucene.analysis.CharTokenizer

        end, incrementToken, isTokenChar, normalize, normalize, reset
      • Methods inherited from class org.apache.lucene.analysis.Tokenizer

        close, correctOffset
      • Methods inherited from class org.apache.lucene.analysis.TokenStream

        reset
      • Methods inherited from class org.apache.lucene.util.AttributeSource

        addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
    • Constructor Detail

      • WhitespaceTokenizer

        public WhitespaceTokenizer​(org.apache.lucene.util.Version matchVersion,
                                   Reader in)
        Construct a new WhitespaceTokenizer. * @param matchVersion Lucene version to match
        Parameters:
        in - the input to split up into tokens
      • WhitespaceTokenizer

        public WhitespaceTokenizer​(org.apache.lucene.util.Version matchVersion,
                                   org.apache.lucene.util.AttributeSource source,
                                   Reader in)
        Construct a new WhitespaceTokenizer using a given AttributeSource.
        Parameters:
        matchVersion - Lucene version to match
        source - the attribute source to use for this Tokenizer
        in - the input to split up into tokens
      • WhitespaceTokenizer

        public WhitespaceTokenizer​(org.apache.lucene.util.Version matchVersion,
                                   org.apache.lucene.util.AttributeSource.AttributeFactory factory,
                                   Reader in)
        Construct a new WhitespaceTokenizer using a given AttributeSource.AttributeFactory.
        Parameters:
        matchVersion - Lucene version to match See <a href="#version">above</a>
        factory - the attribute factory to use for this Tokenizer
        in - the input to split up into tokens
    • Method Detail

      • isTokenChar

        protected boolean isTokenChar​(int c)
        Collects only characters which do not satisfy Character.isWhitespace(int).
        Overrides:
        isTokenChar in class org.apache.lucene.analysis.CharTokenizer