Class StringUtil

java.lang.Object
org.docx4j.org.apache.poi.util.StringUtil

public class StringUtil
extends java.lang.Object
Title: String Utility Description: Collection of string handling utilities

Note - none of the methods in this class deals with org.docx4j.org.apache.poi.hssf.record.ContinueRecords. For such functionality, consider using RecordInputStream

Author:
Andrew C. Oliver, Sergei Kozello (sergeikozello at mail.ru), Toshiaki Kamoshida (kamoshida.toshiaki at future dot co dot jp)
  • Nested Class Summary

    Nested Classes 
    Modifier and Type Class Description
    static class  StringUtil.StringsIterator
    An Iterator over an array of Strings.
  • Method Summary

    Modifier and Type Method Description
    static java.lang.String format​(java.lang.String message, java.lang.Object[] params)
    Apply printf() like formatting to a string.
    static int getEncodedSize​(java.lang.String value)  
    static java.lang.String getFromCompressedUnicode​(byte[] string, int offset, int len)
    Read 8 bit data (in ISO-8859-1 codepage) into a (unicode) Java String and return.
    static java.lang.String getFromUnicodeLE​(byte[] string)
    Given a byte array of 16-bit unicode characters in little endian format (most important byte last), return a Java String representation of it.
    static java.lang.String getFromUnicodeLE​(byte[] string, int offset, int len)
    Given a byte array of 16-bit unicode characters in Little Endian format (most important byte last), return a Java String representation of it.
    static java.lang.String getPreferredEncoding()  
    static byte[] getToUnicodeLE​(java.lang.String string)
    Convert String to 16-bit unicode characters in little endian format
    static boolean hasMultibyte​(java.lang.String value)
    check the parameter has multibyte character
    static boolean isUnicodeString​(java.lang.String value)
    Checks to see if a given String needs to be represented as Unicode
    static void mapMsCodepoint​(int msCodepoint, int unicodeCodepoint)  
    static java.lang.String mapMsCodepointString​(java.lang.String string)
    Some strings may contain encoded characters of the unicode private use area.
    static void putCompressedUnicode​(java.lang.String input, byte[] output, int offset)
    Takes a unicode (java) string, and returns it as 8 bit data (in ISO-8859-1 codepage).
    static void putCompressedUnicode​(java.lang.String input, LittleEndianOutput out)  
    static void putUnicodeLE​(java.lang.String input, byte[] output, int offset)
    Takes a unicode string, and returns it as little endian (most important byte last) bytes in the supplied byte array.
    static void putUnicodeLE​(java.lang.String input, LittleEndianOutput out)  
    static java.lang.String readCompressedUnicode​(LittleEndianInput in, int nChars)  
    static java.lang.String readUnicodeLE​(LittleEndianInput in, int nChars)  
    static java.lang.String readUnicodeString​(LittleEndianInput in)
    InputStream in is expected to contain: ushort nChars byte is16BitFlag byte[]/char[] characterData For this encoding, the is16BitFlag is always present even if nChars==0.
    static java.lang.String readUnicodeString​(LittleEndianInput in, int nChars)
    InputStream in is expected to contain: byte is16BitFlag byte[]/char[] characterData For this encoding, the is16BitFlag is always present even if nChars==0.
    static void writeUnicodeString​(LittleEndianOutput out, java.lang.String value)
    OutputStream out will get: ushort nChars byte is16BitFlag byte[]/char[] characterData For this encoding, the is16BitFlag is always present even if nChars==0.
    static void writeUnicodeStringFlagAndData​(LittleEndianOutput out, java.lang.String value)
    OutputStream out will get: byte is16BitFlag byte[]/char[] characterData For this encoding, the is16BitFlag is always present even if nChars==0.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • getFromUnicodeLE

      public static java.lang.String getFromUnicodeLE​(byte[] string, int offset, int len) throws java.lang.ArrayIndexOutOfBoundsException, java.lang.IllegalArgumentException
      Given a byte array of 16-bit unicode characters in Little Endian format (most important byte last), return a Java String representation of it. { 0x16, 0x00 } -0x16
      Parameters:
      string - the byte array to be converted
      offset - the initial offset into the byte array. it is assumed that string[ offset ] and string[ offset + 1 ] contain the first 16-bit unicode character
      len - the length of the final string
      Returns:
      the converted string, never null.
      Throws:
      java.lang.ArrayIndexOutOfBoundsException - if offset is out of bounds for the byte array (i.e., is negative or is greater than or equal to string.length)
      java.lang.IllegalArgumentException - if len is too large (i.e., there is not enough data in string to create a String of that length)
    • getFromUnicodeLE

      public static java.lang.String getFromUnicodeLE​(byte[] string)
      Given a byte array of 16-bit unicode characters in little endian format (most important byte last), return a Java String representation of it. { 0x16, 0x00 } -0x16
      Parameters:
      string - the byte array to be converted
      Returns:
      the converted string, never null
    • getToUnicodeLE

      public static byte[] getToUnicodeLE​(java.lang.String string)
      Convert String to 16-bit unicode characters in little endian format
      Parameters:
      string - the string
      Returns:
      the byte array of 16-bit unicode characters
    • getFromCompressedUnicode

      public static java.lang.String getFromCompressedUnicode​(byte[] string, int offset, int len)
      Read 8 bit data (in ISO-8859-1 codepage) into a (unicode) Java String and return. (In Excel terms, read compressed 8 bit unicode as a string)
      Parameters:
      string - byte array to read
      offset - offset to read byte array
      len - length to read byte array
      Returns:
      String generated String instance by reading byte array
    • readCompressedUnicode

      public static java.lang.String readCompressedUnicode​(LittleEndianInput in, int nChars)
    • readUnicodeString

      public static java.lang.String readUnicodeString​(LittleEndianInput in)
      InputStream in is expected to contain:
      1. ushort nChars
      2. byte is16BitFlag
      3. byte[]/char[] characterData
      For this encoding, the is16BitFlag is always present even if nChars==0. This structure is also known as a XLUnicodeString.
    • readUnicodeString

      public static java.lang.String readUnicodeString​(LittleEndianInput in, int nChars)
      InputStream in is expected to contain:
      1. byte is16BitFlag
      2. byte[]/char[] characterData
      For this encoding, the is16BitFlag is always present even if nChars==0.
      This method should be used when the nChars field is not stored as a ushort immediately before the is16BitFlag. Otherwise, readUnicodeString(LittleEndianInput) can be used.
    • writeUnicodeString

      public static void writeUnicodeString​(LittleEndianOutput out, java.lang.String value)
      OutputStream out will get:
      1. ushort nChars
      2. byte is16BitFlag
      3. byte[]/char[] characterData
      For this encoding, the is16BitFlag is always present even if nChars==0.
    • writeUnicodeStringFlagAndData

      public static void writeUnicodeStringFlagAndData​(LittleEndianOutput out, java.lang.String value)
      OutputStream out will get:
      1. byte is16BitFlag
      2. byte[]/char[] characterData
      For this encoding, the is16BitFlag is always present even if nChars==0.
      This method should be used when the nChars field is not stored as a ushort immediately before the is16BitFlag. Otherwise, writeUnicodeString(LittleEndianOutput, String) can be used.
    • getEncodedSize

      public static int getEncodedSize​(java.lang.String value)
      Returns:
      the number of bytes that would be written by writeUnicodeString(LittleEndianOutput, String)
    • putCompressedUnicode

      public static void putCompressedUnicode​(java.lang.String input, byte[] output, int offset)
      Takes a unicode (java) string, and returns it as 8 bit data (in ISO-8859-1 codepage). (In Excel terms, write compressed 8 bit unicode)
      Parameters:
      input - the String containing the data to be written
      output - the byte array to which the data is to be written
      offset - an offset into the byte arrat at which the data is start when written
    • putCompressedUnicode

      public static void putCompressedUnicode​(java.lang.String input, LittleEndianOutput out)
    • putUnicodeLE

      public static void putUnicodeLE​(java.lang.String input, byte[] output, int offset)
      Takes a unicode string, and returns it as little endian (most important byte last) bytes in the supplied byte array. (In Excel terms, write uncompressed unicode)
      Parameters:
      input - the String containing the unicode data to be written
      output - the byte array to hold the uncompressed unicode, should be twice the length of the String
      offset - the offset to start writing into the byte array
    • putUnicodeLE

      public static void putUnicodeLE​(java.lang.String input, LittleEndianOutput out)
    • readUnicodeLE

      public static java.lang.String readUnicodeLE​(LittleEndianInput in, int nChars)
    • format

      public static java.lang.String format​(java.lang.String message, java.lang.Object[] params)
      Apply printf() like formatting to a string. Primarily used for logging.
      Parameters:
      message - the string with embedded formatting info eg. "This is a test %2.2"
      params - array of values to format into the string
      Returns:
      The formatted string
    • getPreferredEncoding

      public static java.lang.String getPreferredEncoding()
      Returns:
      the encoding we want to use, currently hardcoded to ISO-8859-1
    • hasMultibyte

      public static boolean hasMultibyte​(java.lang.String value)
      check the parameter has multibyte character
      Parameters:
      value - string to check
      Returns:
      boolean result true:string has at least one multibyte character
    • isUnicodeString

      public static boolean isUnicodeString​(java.lang.String value)
      Checks to see if a given String needs to be represented as Unicode
      Parameters:
      value -
      Returns:
      true if string needs Unicode to be represented.
    • mapMsCodepointString

      public static java.lang.String mapMsCodepointString​(java.lang.String string)
      Some strings may contain encoded characters of the unicode private use area. Currently the characters of the symbol fonts are mapped to the corresponding characters in the normal unicode range.
      Parameters:
      string - the original string
      Returns:
      the string with mapped characters
      See Also:
      Private Use Area (symbol), Symbol font - Unicode alternatives for Greek and special characters in HTML
    • mapMsCodepoint

      public static void mapMsCodepoint​(int msCodepoint, int unicodeCodepoint)