'load data from csv with encoding utf-16le

I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le.

df = spark.read.format("csv")
    .option("delimiter", ",")
    .option("header", true)
    .option("encoding", "utf-16le")
    .load(file_path)
df.show(4)

It seems spark can only read the first line normally: Starting from the second row, either garbled characters or null values

however, python can read the data correct with code:

with open(file_path, encoding='utf-16le', mode='r') as f:
    text = f.read()
    print(text)

print result like: python read correct



Solution 1:[1]

Add these options while creating Spark dataframe from CSV file source -

.option('encoding', 'UTF-16')
.option('multiline', 'true')

Solution 2:[2]

the multiline option ignores the encoding option when using the DataFrameReader. It is not possible to use both options at the same time.

Maybe you can process the multiline problems in your data and later specify an encoding to read good characters.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vikramsinh Shinde
Solution 2 Ali BOUHLEL