'How to set 'charset' for DatumWriter || write avro that contains arabic characters to HDFS

Some of the data contains value in Arabic format, and when the data is written, reader code/hadoop fs -text command shows ?? instead of Arabic characters.

1) Writer

// avro object is provided as SpecificRecordBase
Path path = new Path(pathStr);
DatumWriter<SpecificRecord> datumWriter = new SpecificDatumWriter<>();
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf); // HDFS File System

FSDataOutputStream outputStream = fs.create(path);
DataFileWriter<SpecificRecord> dataFileWriter = new DataFileWriter<>(datumWriter);

Schema schema = getSchema(); // method to get schema
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(schema, outputStream);
dataFileWriter.append(avroObject);

2) Reader

Configuration conf = new Configuration();
FsInput in = new FsInput(new Path(hdfsFilePathStr), conf);
DatumReader<Row> datumReader = new GenericDatumReader<>();
DataFileReader<Row> dataFileReader = new DataFileReader<>(in, datumReader);
GenericRecord outputData = (GenericRecord) dataFileReader.iterator.next();

I've tried hadoop fs -text {filePath} command, there also the values in Arabic appear as ??.

It will be really difficult to change the format in which data is written because there are numerous consumers of the same file.

Tried reading through SpecificRecordBase, still getting ??.

Edit

Also tried these (in both reader and writer):

Configuration conf = new Configuration();
conf.set("file.encoding", StandardCharsets.UTF_16.displayName());

AND

System.setProperty("file.encoding", StandardCharsets.UTF_16.displayName());

Doesn't help.



Solution 1:[1]

Apparently, HDFS does not support a lot of non-english characters. To work around that, change the field from String to bytes in your avro schema.

To convert your value from String to bytes, use:

ByteBuffer.wrap(str.getBytes(StandardCharsets.UTF_8)).

Then, while reading, to convert it back to String use:

new String(byteData.array(), StandardCharsets.UTF_8).

Rest of the code in your reader and writer stays the same.

Doing this, for English characters hadooop fs -text command will show proper text but for non-English characters it might show gibberish, but your reader will still be able to create the UTF-8 String from ByteBuffer.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1