'Read next character (full unicode code point) from Java input stream

I need to parse UTF-8 input (from a text file) character by character (and by character I mean full UTF-8 character (UTF-8 code point), not Java's char).

What approach should I use?



Solution 1:[1]

Try this.

public class CodePointReader {

    Reader in;

    public CodePointReader(Reader in) {
        this.in = in;
    }

    public int read() throws IOException {
        int first = in.read();
        if (first == -1)
            return -1;
        if (!Character.isHighSurrogate((char)first))
            return first;
        int second = in.read();
        if (second == -1)
            throw new IOException("low surrogate expected after %d".formatted(first));
        if (!Character.isLowSurrogate((char)second))
            throw new IOException("invalid surrogate pair (%d, %d)".formatted(first, second));
        return Character.toCodePoint((char)first, (char)second);
    }
}

and

@Test
public void testCodePointReader() throws IOException {
    String s = "??";
    CodePointReader reader = new CodePointReader(new StringReader(s));
    assertEquals(s.codePointAt(0), reader.read());
    assertEquals(s.codePointAt(2), reader.read());
    assertEquals(-1, reader.read());
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1