'How to read unicode codepoints greater than 0xFFFF from file in Java

I'm writing a lexical analyzer for a compiler and I was wondering how I can read a UTF-8 file that contains unicode codepoints greater than 0xFFFF. The char data type only supports two bytes, so how can I read an int codepoint from the file?



Solution 1:[1]

I had to do this recently; here's the code I used. It's a Spliterator.OfInt implementation that can be used to create an IntStream of codepoints from input from a Reader, or used directly if that's easier. Or just extract the logic from the nextCP method.

package org.raevnos.util.iterator;

import java.util.Objects;
import java.util.Spliterator;
import java.util.function.IntConsumer;
import java.io.Reader;
import java.io.Closeable;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.CharacterCodingException;

/**
 * A {@code Spliterator.OfInt} used to iterate over codepoints read from a file.
 */
public class CPSpliterator
    implements Spliterator.OfInt, Closeable {
    private final Reader input;

    /**
     * Create a new spliterator.
     * @param input The {@code Reader} to get codepoints from.
     */
    public CPSpliterator(Reader input) {
        this.input = Objects.requireNonNull(input);
    }

    /**
     * Fetch the next codepoint from the underlying stream, accounting for
     * surrogate pairs.
     * @return a codepoint, or -1 on end of file.
     * @throws UncheckedIOException on input errors.
     */
    private int nextCP() {
        try {
            int first_char = input.read();
            if (first_char == -1) {
                return -1;
            } else if (Character.isHighSurrogate((char)first_char)) {
                int second_char = input.read();
                if (second_char == -1
                    || !Character.isLowSurrogate((char)second_char)) {
                    // Hopefully shouldn't happen; caught by Reader first.
                    throw new CharacterCodingException();
                } else {
                    return Character.toCodePoint((char)first_char, (char)second_char);
                }
            } else {
                return first_char;
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    }

    @Override
    public int characteristics() { return ORDERED | NONNULL; }

    @Override
    public long estimateSize() { return Long.MAX_VALUE; }

    @Override
    public void forEachRemaining(IntConsumer f) {
        int cp;
        while ((cp = nextCP()) != -1) {
            f.accept(cp);
        }
    }

    @Override
    public boolean tryAdvance(IntConsumer f) {
        int cp = nextCP();
        if (cp != -1) {
            f.accept(cp);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public Spliterator.OfInt trySplit() { return null; }

    @Override
    public void close() throws IOException { input.close(); }
}

Example usage:

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    IntStream codepoints = StreamSupport.intStream(sp, false);
    // do something with the stream
}

or

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    sp.forEachRemaining(cp -> doSomething(cp));
}

etc.

You can also use Files.readString() to read an entire file into a string and use String#codePoints or other codepoint methods on it, but the above class is more memory efficient if that matters because it only reads a character at a time. Or read a line at a time and convert those to codepoints.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1