Class ICUTokenizerFactory
java.lang.Object
org.apache.lucene.analysis.AbstractAnalysisFactory
org.apache.lucene.analysis.TokenizerFactory
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
- All Implemented Interfaces:
ResourceLoaderAware
Factory for
ICUTokenizer
. Words are broken across script boundaries, then segmented
according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig
.
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated
list of code:rulefile
pairs in the following format: four-letter ISO 15924 script code,
followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn")
and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>
- Since:
- 3.1
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final boolean
private ICUTokenizerConfig
private final boolean
static final String
SPI name(package private) static final String
private final IntObjectHashMap
<String> Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
Constructor Summary
ConstructorsConstructorDescriptionDefault ctor for compatibility with SPIICUTokenizerFactory
(Map<String, String> args) Creates a new ICUTokenizerFactory -
Method Summary
Modifier and TypeMethodDescriptioncreate
(AttributeFactory factory) Creates a TokenStream of the specified input using the given AttributeFactoryvoid
inform
(ResourceLoader loader) Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).private com.ibm.icu.text.BreakIterator
parseRules
(String filename, ResourceLoader loader) Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
Field Details
-
NAME
SPI name- See Also:
-
RULEFILES
- See Also:
-
tailored
-
config
-
cjkAsWords
private final boolean cjkAsWords -
myanmarAsWords
private final boolean myanmarAsWords
-
-
Constructor Details
-
ICUTokenizerFactory
Creates a new ICUTokenizerFactory -
ICUTokenizerFactory
public ICUTokenizerFactory()Default ctor for compatibility with SPI
-
-
Method Details
-
inform
Description copied from interface:ResourceLoaderAware
Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).- Specified by:
inform
in interfaceResourceLoaderAware
- Throws:
IOException
-
parseRules
private com.ibm.icu.text.BreakIterator parseRules(String filename, ResourceLoader loader) throws IOException - Throws:
IOException
-
create
Description copied from class:TokenizerFactory
Creates a TokenStream of the specified input using the given AttributeFactory- Specified by:
create
in classTokenizerFactory
-