Sunday, 15 June 2014

regex - detect any combining character in Java -


I'm finding a way to find out that in the Java string there is a character "is a combination letter" or not ,

  String Khmer CombiningWeil = New String (new byte [] {0 by 0xe1, (byte) 0x9f, (byte) 0x80}, "UTF-8"); // Unicode 17c0  

represents one I have tried "\\ p {InCombiningDiacriticalMarks}" but it does not apply to these special combination characters Or even if there is some comprehensive list in the combination of all Unicode character blocks, can I still be able to create a regex for them? According to

, there are several blocks for combining letters.

There are many useful functions in Java, try:

  string codePointStr = new string (new byte [] (byte) 0xe1, (byte) 0x9f, (byte ) 0x80}, "UTF-8"); // Unicode 17c0 System.out.println (codePointStr.matches ("\\ p {MC}")); System.out.println (character. COMBINING_SPACING_MARK == character .get (CodePointStr.codePointAt (0));  

(correct print in both cases)

In this case, (and related regex \ p {Gc = Mc} ) Both refer to "Mark, Spacing Magnizing" which basically connects any character that is Connects to the previous character while adding width too.

Other regular expressions that may be useful: For \ p {M} if you type the letter getType ( ) , you can behave in the same way that its type is COMBINING_SPACING_MARK or ENCLOSING_MARK , or NON_SPACING_MARK .

ENCLOSING_MARK is a surrounding character, like a circle - also adds the width of the character with which it connects.

Latin alphabet diacritical combination digits, etc. (Which basically goes up or down, and does not add any width to the character).


No comments:

Post a Comment