'How to get the category name of the character type in Java?
The Character.getType(int codePoint) returns an integer and I couldn't find a way to
get the unicode category name, such as "Lu" or "Cn", out of it.
What I want is a method such as Character.getCategoryTypeName(int codePoint) that returns a String representing the type.
The category names are in the constants comments, and one way would be write a switch case for the returned type and then manually encode the type name, something like this:
my original plan was something like this:
for (int i = 0; i <= 0x10FFFF; i++) {
switch (Character.getType(i)) {
// General category "Sc" in the Unicode specification.
// public static final byte CURRENCY_SYMBOL = 26;
case Character.CURRENCY_SYMBOL:
map.put(i, "Sc");
break;
....
}
}
however this will be very tedious. Is there an automatic way or a library to do the task?
Solution 1:[1]
As commented, there seems to be no such functionality currently bundled with Java. As Marcono1234 commented, a feature-request is on the books but is not yet implemented.
Note to self: If I ever rub elbows with Brian Goetz or Mark Reinhold at a conference, ask/plead/beg them for a major revamp of working with code points & Unicode in Java: Project Papyrus.
I whipped up a couple ways to produce your desired two-letter name for each of the thirty "General Category" items defined by Unicode. (This two-letter name is called an “alias” in the Unicode spec.) One of my implementations is merely a switch hard-coded for each alias. The other is more complicated, defining a couple of enums.
Both of these implementations were created by me merely as an exercise. I have not used them in production. I am not saying they are the best route, but hopefully they might prove useful or at least inspire a better effort by someone else.
Basic
In Java 14+, use switch expressions for each of the "General Category" constants defined on Character class. If unfamiliar, see JEP 361: Switch Expressions.
public String unicodeGeneralCategoryAliasForCodePoint ( int codePoint ) {
return switch ( Character.getType( codePoint ) ) {
// L, Letter
case Character.UPPERCASE_LETTER -> "Lu";
case Character.LOWERCASE_LETTER -> "Ll";
case Character.TITLECASE_LETTER -> "Lt";
case Character.MODIFIER_LETTER -> "Lm";
case Character.OTHER_LETTER -> "Lo";
// M, Mark
case Character.NON_SPACING_MARK -> "Mn";
case Character.COMBINING_SPACING_MARK -> "Mc";
case Character.ENCLOSING_MARK -> "Me";
// N, Number
case Character.DECIMAL_DIGIT_NUMBER -> "Nd";
case Character.LETTER_NUMBER -> "Nl";
case Character.OTHER_NUMBER -> "No";
// P, Punctuation
case Character.CONNECTOR_PUNCTUATION -> "Pc";
case Character.DASH_PUNCTUATION -> "Pd";
case Character.START_PUNCTUATION -> "Ps";
case Character.END_PUNCTUATION -> "Pe";
case Character.INITIAL_QUOTE_PUNCTUATION -> "Pi";
case Character.FINAL_QUOTE_PUNCTUATION -> "Pf";
case Character.OTHER_PUNCTUATION -> "Po";
// S, Symbol
case Character.MATH_SYMBOL -> "Sm";
case Character.CURRENCY_SYMBOL -> "Sc";
case Character.MODIFIER_SYMBOL -> "Sk";
case Character.OTHER_SYMBOL -> "So";
// Z, Separator
case Character.SPACE_SEPARATOR -> "Zs";
case Character.LINE_SEPARATOR -> "Zl";
case Character.PARAGRAPH_SEPARATOR -> "Zp";
// C, Other
case Character.CONTROL -> "Cc";
case Character.FORMAT -> "Cf";
case Character.SURROGATE -> "Cs";
case Character.PRIVATE_USE -> "Co";
case Character.UNASSIGNED -> "Cn";
default -> "ERROR - Unexpected General Category type for code point " + codePoint + ". Message # 5d44e5fd-d60e-4b02-9431-ad57c56657f5.";
};
}
Usage:
String alias = x.unicodeGeneralCategoryAliasForCodePoint( 65 );
Lu
Deluxe
In this alternate approach, I defined a pair of enums:
UnicodeGeneralCategory
30 objects, one for each of the General Category items defined on pages 170-172 of section 4.5 of the Unicode 13 spec. These items are listed on this Wikipedia page.UnicodeMajorClass
Groups objects ofUnicodeGeneralCategoryinto groups defined by the Unicode spec: letter, mark, number, punctuation, symbol, separator, and other. That last one, "other", is noteworthy in that it covers the non-printable “control” characters as well as the vast majority of code points not assigned to any character. When looping over all the possible code points, we want to skip these.
The names of my enum objects in UnicodeGeneralCategory are copied from the subset of constants declared on the Character class whose description starts with the phrase “General category”. These constant names differ somewhat from the official Unicode names, but are close enough. I defined these in the same order as listed in the Unicode spec.
package work.basil.unicode.category;
import java.util.Arrays;
import java.util.Optional;
// For more info about Unicode General Category, see section 4.5 of the Unicode 13.0 spec, pages 170-172.
// https://www.unicode.org/versions/Unicode13.0.0/ch04.pdf
public enum UnicodeGeneralCategory {
// See Wikipedia page list the General Category values defined in Unicode 13.
// L, Letter
UPPERCASE_LETTER( Character.UPPERCASE_LETTER , "Lu" , "Letter" , "uppercase" ),
LOWERCASE_LETTER( Character.LOWERCASE_LETTER , "Ll" , "Letter" , "lowercase" ),
TITLECASE_LETTER( Character.TITLECASE_LETTER , "Lt" , "Letter" , "titlecase" ),
MODIFIER_LETTER( Character.MODIFIER_LETTER , "Lm" , "Letter" , "modifier" ),
OTHER_LETTER( Character.OTHER_LETTER , "Lo" , "Letter" , "other" ),
// M, Mark
NON_SPACING_MARK( Character.NON_SPACING_MARK , "Mn" , "Mark" , "nonspacing" ),
COMBINING_SPACING_MARK( Character.COMBINING_SPACING_MARK , "Mc" , "Mark" , "spacing combining" ),
ENCLOSING_MARK( Character.ENCLOSING_MARK , "Me" , "Mark" , "enclosing" ),
// N, Number
DECIMAL_DIGIT_NUMBER( Character.DECIMAL_DIGIT_NUMBER , "Nd" , "Number" , "decimal digit" ),
LETTER_NUMBER( Character.LETTER_NUMBER , "Nl" , "Number" , "letter" ),
OTHER_NUMBER( Character.OTHER_NUMBER , "No" , "Number" , "other" ),
// P, Punctuation
CONNECTOR_PUNCTUATION( Character.CONNECTOR_PUNCTUATION , "Pc" , "Punctuation" , "connector" ),
DASH_PUNCTUATION( Character.DASH_PUNCTUATION , "Pd" , "Punctuation" , "dash" ),
START_PUNCTUATION( Character.START_PUNCTUATION , "Ps" , "Punctuation" , "open" ),
END_PUNCTUATION( Character.END_PUNCTUATION , "Pe" , "Punctuation" , "close" ),
INITIAL_QUOTE_PUNCTUATION( Character.INITIAL_QUOTE_PUNCTUATION , "Pi" , "Punctuation" , "initial quote" ),
FINAL_QUOTE_PUNCTUATION( Character.FINAL_QUOTE_PUNCTUATION , "Pf" , "Puntuation" , "final quote" ),
OTHER_PUNCTUATION( Character.OTHER_PUNCTUATION , "Po" , "Punctuation" , "other" ),
// S, Symbol
MATH_SYMBOL( Character.MATH_SYMBOL , "Sm" , "Symbol" , "math" ),
CURRENCY_SYMBOL( Character.CURRENCY_SYMBOL , "Sc" , "Symbol" , "currency" ),
MODIFIER_SYMBOL( Character.MODIFIER_SYMBOL , "Sk" , "Symbol" , "modifier" ),
OTHER_SYMBOL( Character.OTHER_SYMBOL , "So" , "Symbol" , "other" ),
// Z, Separator
SPACE_SEPARATOR( Character.SPACE_SEPARATOR , "Zs" , "Separator" , "space" ),
LINE_SEPARATOR( Character.LINE_SEPARATOR , "Zl" , "Separator" , "line" ),
PARAGRAPH_SEPARATOR( Character.PARAGRAPH_SEPARATOR , "Zp" , "Separator" , "paragraph" ),
// C, Other
CONTROL( Character.CONTROL , "Cc" , "Other" , "control" ),
FORMAT( Character.FORMAT , "Cf" , "Other" , "format" ),
SURROGATE( Character.SURROGATE , "Cs" , "Other" , "surrogate" ),
PRIVATE_USE( Character.PRIVATE_USE , "Co" , "Other" , "private use" ),
UNASSIGNED( Character.UNASSIGNED , "Cn" , "Other" , "not assigned" );
// Fields.
private byte characterClassConstantForGeneralCategory;
private String alias, major, minor;
// Constructor.
UnicodeGeneralCategory ( byte characterClassConstantForGeneralCategory , String alias , String major , String minor ) {
this.characterClassConstantForGeneralCategory = characterClassConstantForGeneralCategory;
this.alias = alias;
this.major = major;
this.minor = minor;
}
public static UnicodeGeneralCategory forCodePoint ( int codePoint ) {
if ( ! Character.isValidCodePoint( codePoint ) ) {
throw new IllegalArgumentException( "Code point " + codePoint + " is invalid. Must be within 0 to U+10FFFF ( 1,114,111 ) inclusive." );
}
Optional < UnicodeGeneralCategory > optionalUnicodeGeneralCategory = Arrays.stream( UnicodeGeneralCategory.values() ).filter( category -> category.characterClassConstantForGeneralCategory == Character.getType( codePoint ) ).findAny();
if ( optionalUnicodeGeneralCategory.isEmpty() ) {
throw new IllegalStateException( "No general category defined in this enum matching `Character.getType( codePoint )`: " + Character.getType( codePoint ) );
} else {
return optionalUnicodeGeneralCategory.get();
}
}
public static UnicodeGeneralCategory forAlias ( String abbrev ) {
Optional < UnicodeGeneralCategory > optionalUnicodeGeneralCategory = Arrays.stream( UnicodeGeneralCategory.values() ).filter( category -> category.alias == abbrev ).findAny();
if ( optionalUnicodeGeneralCategory.isEmpty() ) {
throw new IllegalArgumentException( "No general category defined in this enum for abbreviation " + abbrev );
} else {
return optionalUnicodeGeneralCategory.get();
}
}
// Getters
public String getAlias () {
return this.alias;
}
public String getMajor () {
return this.major;
}
public String getMinor () {
return this.minor;
}
public byte getCharacterClassConstant () {
return this.characterClassConstantForGeneralCategory;
}
public String getDisplayName () {
return this.alias + " – " + this.major + ", " + this.minor;
}
}
… and …
package work.basil.unicode.category;
import java.util.EnumSet;
import java.util.Set;
public enum UnicodeMajorClass {
L_Letter( "L" , "Letter" , EnumSet.of( UnicodeGeneralCategory.UPPERCASE_LETTER , UnicodeGeneralCategory.LOWERCASE_LETTER , UnicodeGeneralCategory.TITLECASE_LETTER , UnicodeGeneralCategory.MODIFIER_LETTER , UnicodeGeneralCategory.OTHER_LETTER ) ),
M_MARK( "M" , "Mark" , EnumSet.of( UnicodeGeneralCategory.NON_SPACING_MARK , UnicodeGeneralCategory.COMBINING_SPACING_MARK , UnicodeGeneralCategory.ENCLOSING_MARK ) ),
N_NUMBER( "N" , "Number" , EnumSet.of( UnicodeGeneralCategory.DECIMAL_DIGIT_NUMBER , UnicodeGeneralCategory.LETTER_NUMBER , UnicodeGeneralCategory.OTHER_LETTER ) ),
P_PUNCTUATION( "P" , "Punctuation" , EnumSet.of( UnicodeGeneralCategory.CONNECTOR_PUNCTUATION , UnicodeGeneralCategory.DASH_PUNCTUATION , UnicodeGeneralCategory.START_PUNCTUATION , UnicodeGeneralCategory.END_PUNCTUATION , UnicodeGeneralCategory.INITIAL_QUOTE_PUNCTUATION , UnicodeGeneralCategory.FINAL_QUOTE_PUNCTUATION , UnicodeGeneralCategory.OTHER_PUNCTUATION ) ),
S_SYMBOL( "S" , "Symbol" , EnumSet.of( UnicodeGeneralCategory.MATH_SYMBOL , UnicodeGeneralCategory.CURRENCY_SYMBOL , UnicodeGeneralCategory.MODIFIER_SYMBOL , UnicodeGeneralCategory.OTHER_SYMBOL ) ),
Z_SEPARATOR( "Z" , "Separator" , EnumSet.of( UnicodeGeneralCategory.SPACE_SEPARATOR , UnicodeGeneralCategory.LINE_SEPARATOR , UnicodeGeneralCategory.PARAGRAPH_SEPARATOR ) ),
C_OTHER( "C" , "Other" , EnumSet.of( UnicodeGeneralCategory.CONTROL , UnicodeGeneralCategory.FORMAT , UnicodeGeneralCategory.SURROGATE , UnicodeGeneralCategory.PRIVATE_USE , UnicodeGeneralCategory.UNASSIGNED ) );
private String alias;
private String name;
private Set < UnicodeGeneralCategory > categories;
UnicodeMajorClass ( String alias , String name , Set < UnicodeGeneralCategory > categories ) {
this.alias = alias;
this.name = name;
this.categories = categories;
}
public String getAlias () {
return alias;
}
public String getName () {
return name;
}
public Set < UnicodeGeneralCategory > getCategories () {
return categories;
}
public String getDisplayName () {
return this.alias + " – " + this.name;
}
public boolean coversCodePoint ( int codePoint ) {
return this.getCategories().contains( UnicodeGeneralCategory.forCodePoint( codePoint ) );
}
}
Usages:
UnicodeGeneralCategory.forCodePoint( yourCodePointGoesHere ).getAlias()is what you asked for at the top of your Question.UnicodeMajorClass.C_OTHER.coversCodePoint( codePoint )skips over those pesky non-printable/unassigned code points.
Also, to get a String of a single character being represented by a code point number, call Character.toString( codePoint ).
We can use those two enums to report on all code points.
package work.basil.text;
import work.basil.unicode.category.UnicodeMajorClass;
public class DumpCharacters {
public static void main ( String[] args ) {
System.out.println( "INFO - Demo starting. " );
for ( int codePoint = 0 ; codePoint <= Character.MAX_CODE_POINT ; codePoint++ ) {
if ( Character.isValidCodePoint( codePoint ) ) // If code point is valid.
{
if ( UnicodeMajorClass.C_OTHER.coversCodePoint( codePoint ) ) // If control character.
{
// No code needed. Skip over this code point as it is not a printable character.
} else {
System.out.println( codePoint + " code point is named: " + Character.getName( codePoint ) + " = " + Character.toString( codePoint ) );
}
} else {
System.out.println( "ERROR - Invalid code point number: " + codePoint );
}
}
System.out.println( "INFO - Demo ending. " );
}
}
When run:
INFO - Demo starting.
32 code point is named: SPACE =
33 code point is named: EXCLAMATION MARK = !
34 code point is named: QUOTATION MARK = "
35 code point is named: NUMBER SIGN = #
36 code point is named: DOLLAR SIGN = $
37 code point is named: PERCENT SIGN = %
38 code point is named: AMPERSAND = &
…
122 code point is named: LATIN SMALL LETTER Z = z
123 code point is named: LEFT CURLY BRACKET = {
124 code point is named: VERTICAL LINE = |
125 code point is named: RIGHT CURLY BRACKET = }
126 code point is named: TILDE = ~
160 code point is named: NO-BREAK SPACE =
161 code point is named: INVERTED EXCLAMATION MARK = ¡
…
917998 code point is named: VARIATION SELECTOR-255 = ?
917999 code point is named: VARIATION SELECTOR-256 = ?
INFO - Demo ending.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
