'Read next character (full unicode code point) from Java input stream
I need to parse UTF-8 input (from a text file) character by character (and by character I mean full UTF-8 character (UTF-8 code point), not Java's char).
What approach should I use?
Solution 1:[1]
Try this.
public class CodePointReader {
Reader in;
public CodePointReader(Reader in) {
this.in = in;
}
public int read() throws IOException {
int first = in.read();
if (first == -1)
return -1;
if (!Character.isHighSurrogate((char)first))
return first;
int second = in.read();
if (second == -1)
throw new IOException("low surrogate expected after %d".formatted(first));
if (!Character.isLowSurrogate((char)second))
throw new IOException("invalid surrogate pair (%d, %d)".formatted(first, second));
return Character.toCodePoint((char)first, (char)second);
}
}
and
@Test
public void testCodePointReader() throws IOException {
String s = "??";
CodePointReader reader = new CodePointReader(new StringReader(s));
assertEquals(s.codePointAt(0), reader.read());
assertEquals(s.codePointAt(2), reader.read());
assertEquals(-1, reader.read());
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
