'How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
How do I truncate a java String so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?
Solution 1:[1]
You should use CharsetEncoder, the simple getBytes() + copy as many as you can can cut UTF-8 charcters in half.
Something like this:
public static int truncateUtf8(String input, byte[] output) {
ByteBuffer outBuf = ByteBuffer.wrap(output);
CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());
CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();
utf8Enc.encode(inBuf, outBuf, true);
System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
return outBuf.position();
}
Solution 2:[2]
Here's what I came up with, it uses standard Java APIs so should be safe and compatible with all the unicode weirdness and surrogate pairs etc. The solution is taken from http://www.jroller.com/holy/entry/truncating_utf_string_to_the with checks added for null and for avoiding decoding when the string is fewer bytes than maxBytes.
/**
* Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
* half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
* character.
*
* Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
*/
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
if (s == null) {
return null;
}
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
byte[] sba = s.getBytes(charset);
if (sba.length <= maxBytes) {
return s;
}
// Ensure truncation by having byte buffer = maxBytes
ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
CharBuffer cb = CharBuffer.allocate(maxBytes);
// Ignore an incomplete character
decoder.onMalformedInput(CodingErrorAction.IGNORE)
decoder.decode(bb, cb, true);
decoder.flush(cb);
return new String(cb.array(), 0, cb.position());
}
Solution 3:[3]
UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.
check the stream at the character limit you want.
- If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
- If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
- If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.
Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.
Solution 4:[4]
you can use -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");
Solution 5:[5]
You can calculate the number of bytes without doing any conversion.
foreach character in the Java string
if 0 <= character <= 0x7f
count += 1
else if 0x80 <= character <= 0x7ff
count += 2
else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
count += 3
else if 0xdc00 <= character <= 0xffff
count += 3
else { // surrogate, a bit more complicated
count += 4
skip one extra character in the input stream
}
You would have to detect surrogate pairs (D800-DBFF and U+DC00–U+DFFF) and count 4 bytes for each valid surrogate pair. If you get the first value in the first range and the second in the second range, it's all ok, skip them and add 4. But if not, then it is an invalid surrogate pair. I am not sure how Java deals with that, but your algorithm will have to do right counting in that (unlikely) case.
Solution 6:[6]
Based on billjamesdev's answer I've come up with the following method which, as far as I can tell, is the simplest and still works OK with surrogate pairs:
public static String utf8ByteTrim(String s, int trimSize) {
final byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
if ((bytes[trimSize-1] & 0x80) != 0) { // inside a multibyte sequence
while ((bytes[trimSize-1] & 0x40) == 0) { // 2nd, 3rd, 4th bytes
trimSize--;
}
trimSize--;
}
return new String(bytes, 0, trimSize, StandardCharsets.UTF_8);
}
Some testing:
String test = "Aæ???";
IntStream.range(1, 16).forEachOrdered(i ->
System.out.println("Size " + i + ": " + utf8ByteTrim(test, i))
);
---
Size 1: A
Size 2: A
Size 3: A
Size 4: Aæ
Size 5: Aæ
Size 6: Aæ
Size 7: Aæ
Size 8: Aæ?
Size 9: Aæ?
Size 10: Aæ?
Size 11: Aæ??
Size 12: Aæ??
Size 13: Aæ???
Size 14: Aæ???
Size 15: Aæ???
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | hongsy |
| Solution 2 | logtwo |
| Solution 3 | |
| Solution 4 | Suresh Gupta |
| Solution 5 | |
| Solution 6 |
