Class UnicodeRegex

java.lang.Object
com.ibm.icu.impl.UnicodeRegex
All Implemented Interfaces:
StringTransform, Transform<String,String>, Freezable<UnicodeRegex>, Cloneable

public class UnicodeRegex extends Object implements Cloneable, Freezable<UnicodeRegex>, StringTransform
Contains utilities to supplement the JDK Regex, since it doesn't handle Unicode well.

TODO: Move to com.ibm.icu.dev.somewhere. 2015-sep-03: This is used there, and also in CLDR and in UnicodeTools.

  • Field Details

    • SUPP_ESCAPE

      private static final Pattern SUPP_ESCAPE
    • symbolTable

      private SymbolTable symbolTable
    • STANDARD

      private static final UnicodeRegex STANDARD
    • bnfCommentString

      private String bnfCommentString
    • bnfVariableInfix

      private String bnfVariableInfix
    • bnfLineSeparator

      private String bnfLineSeparator
    • LongestFirst

      private Comparator<Object> LongestFirst
  • Constructor Details

    • UnicodeRegex

      public UnicodeRegex()
  • Method Details

    • getSymbolTable

      public SymbolTable getSymbolTable()
      Set the symbol table for internal processing
    • setSymbolTable

      public UnicodeRegex setSymbolTable(SymbolTable symbolTable)
      Get the symbol table for internal processing
    • transform

      public String transform(String regex)
      Adds full Unicode property support, with the latest version of Unicode, to Java Regex, bringing it up to Level 1 (see http://www.unicode.org/reports/tr18/). It does this by preprocessing the regex pattern string and interpreting the character classes (\p{...}, \P{...}, [...]) according to their syntax and meaning in UnicodeSet. With this utility, Java regex expressions can be updated to work with the latest version of Unicode, and with all Unicode properties. Note that the UnicodeSet syntax has not yet, however, been updated to be completely consistent with Java regex, so be careful of the differences.

      Not thread-safe; create a separate copy for different threads.

      In the future, we may extend this to support other regex packages.

      Specified by:
      transform in interface StringTransform
      Specified by:
      transform in interface Transform<String,String>
      Parameters:
      regex - A modified Java regex pattern, as in the input to Pattern.compile(), except that all "character classes" are processed as if they were UnicodeSet patterns. Example: "abc[:bc=N:]. See UnicodeSet for the differences in syntax.
      Returns:
      A processed Java regex pattern, suitable for input to Pattern.compile().
    • fix

      public static String fix(String regex)
      Convenience static function, using standard parameters.
      Parameters:
      regex - as in process()
      Returns:
      processed regex pattern, as in process()
    • compile

      public static Pattern compile(String regex)
      Compile a regex string, after processing by fix(...).
      Parameters:
      regex - Raw regex pattern, as in fix(...).
      Returns:
      Pattern
    • compile

      public static Pattern compile(String regex, int options)
      Compile a regex string, after processing by fix(...).
      Parameters:
      regex - Raw regex pattern, as in fix(...).
      Returns:
      Pattern
    • compileBnf

      public String compileBnf(String bnfLines)
      Compile a composed string from a set of BNF lines; see the List version for more information.
      Parameters:
      bnfLines - Series of BNF lines.
      Returns:
      Pattern
    • compileBnf

      public String compileBnf(List<String> lines)
      Compile a composed string from a set of BNF lines, such as for composing a regex expression. The lines can be in any order, but there must not be any cycles. The result can be used as input for fix().

      Example:

       uri = (?: (scheme) \\:)? (host) (?: \\? (query))? (?: \\u0023 (fragment))?;
       scheme = reserved+;
       host = // reserved+;
       query = [\\=reserved]+;
       fragment = reserved+;
       reserved = [[:ascii:][:alphabetic:]];
       

      Caveats: at this point the parsing is simple; for example, # cannot be quoted (use \\u0023); you can set it to null to disable. The equality sign and a few others can be reset with setBnfX().

      Parameters:
      lines - Series of lines that represent a BNF expression. The lines contain a series of statements that of the form x=y;. A statement can take multiple lines, but there can't be multiple statements on a line. A hash quotes to the end of the line.
      Returns:
      Pattern
    • getBnfCommentString

      public String getBnfCommentString()
    • setBnfCommentString

      public void setBnfCommentString(String bnfCommentString)
    • getBnfVariableInfix

      public String getBnfVariableInfix()
    • setBnfVariableInfix

      public void setBnfVariableInfix(String bnfVariableInfix)
    • getBnfLineSeparator

      public String getBnfLineSeparator()
    • setBnfLineSeparator

      public void setBnfLineSeparator(String bnfLineSeparator)
    • appendLines

      public static List<String> appendLines(List<String> result, String file, String encoding) throws IOException
      Utility for loading lines from a file.
      Parameters:
      result - The result of the appended lines.
      file - The file to have an input stream.
      encoding - if null, then UTF-8
      Returns:
      filled list
      Throws:
      IOException - If there were problems opening the file for input stream.
    • appendLines

      public static List<String> appendLines(List<String> result, InputStream inputStream, String encoding) throws UnsupportedEncodingException, IOException
      Utility for loading lines from a UTF8 file.
      Parameters:
      result - The result of the appended lines.
      inputStream - The input stream.
      encoding - if null, then UTF-8
      Returns:
      filled list
      Throws:
      IOException - If there were problems opening the input stream for reading.
      UnsupportedEncodingException
    • cloneAsThawed

      public UnicodeRegex cloneAsThawed()
      Description copied from interface: Freezable
      Provides for the clone operation. Any clone is initially unfrozen.
      Specified by:
      cloneAsThawed in interface Freezable<UnicodeRegex>
    • freeze

      public UnicodeRegex freeze()
      Description copied from interface: Freezable
      Freezes the object.
      Specified by:
      freeze in interface Freezable<UnicodeRegex>
      Returns:
      the object itself.
    • isFrozen

      public boolean isFrozen()
      Description copied from interface: Freezable
      Determines whether the object has been frozen or not.
      Specified by:
      isFrozen in interface Freezable<UnicodeRegex>
    • processSet

      private int processSet(String regex, int i, StringBuilder result, UnicodeSet temp, ParsePosition pos)
    • getVariables

      private Map<String,String> getVariables(List<String> lines)