'How to read unicode codepoints greater than 0xFFFF from file in Java
I'm writing a lexical analyzer for a compiler and I was wondering how I can read a UTF-8 file that contains unicode codepoints greater than 0xFFFF. The char data type only supports two bytes, so how can I read an int codepoint from the file?
Solution 1:[1]
I had to do this recently; here's the code I used. It's a Spliterator.OfInt implementation that can be used to create an IntStream of codepoints from input from a Reader, or used directly if that's easier. Or just extract the logic from the nextCP method.
package org.raevnos.util.iterator;
import java.util.Objects;
import java.util.Spliterator;
import java.util.function.IntConsumer;
import java.io.Reader;
import java.io.Closeable;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.CharacterCodingException;
/**
* A {@code Spliterator.OfInt} used to iterate over codepoints read from a file.
*/
public class CPSpliterator
implements Spliterator.OfInt, Closeable {
private final Reader input;
/**
* Create a new spliterator.
* @param input The {@code Reader} to get codepoints from.
*/
public CPSpliterator(Reader input) {
this.input = Objects.requireNonNull(input);
}
/**
* Fetch the next codepoint from the underlying stream, accounting for
* surrogate pairs.
* @return a codepoint, or -1 on end of file.
* @throws UncheckedIOException on input errors.
*/
private int nextCP() {
try {
int first_char = input.read();
if (first_char == -1) {
return -1;
} else if (Character.isHighSurrogate((char)first_char)) {
int second_char = input.read();
if (second_char == -1
|| !Character.isLowSurrogate((char)second_char)) {
// Hopefully shouldn't happen; caught by Reader first.
throw new CharacterCodingException();
} else {
return Character.toCodePoint((char)first_char, (char)second_char);
}
} else {
return first_char;
}
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
@Override
public int characteristics() { return ORDERED | NONNULL; }
@Override
public long estimateSize() { return Long.MAX_VALUE; }
@Override
public void forEachRemaining(IntConsumer f) {
int cp;
while ((cp = nextCP()) != -1) {
f.accept(cp);
}
}
@Override
public boolean tryAdvance(IntConsumer f) {
int cp = nextCP();
if (cp != -1) {
f.accept(cp);
return true;
} else {
return false;
}
}
@Override
public Spliterator.OfInt trySplit() { return null; }
@Override
public void close() throws IOException { input.close(); }
}
Example usage:
try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
IntStream codepoints = StreamSupport.intStream(sp, false);
// do something with the stream
}
or
try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
sp.forEachRemaining(cp -> doSomething(cp));
}
etc.
You can also use Files.readString() to read an entire file into a string and use String#codePoints or other codepoint methods on it, but the above class is more memory efficient if that matters because it only reads a character at a time. Or read a line at a time and convert those to codepoints.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
