'load data from csv with encoding utf-16le
I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le.
df = spark.read.format("csv")
.option("delimiter", ",")
.option("header", true)
.option("encoding", "utf-16le")
.load(file_path)
df.show(4)
It seems spark can only read the first line normally: Starting from the second row, either garbled characters or null values
however, python can read the data correct with code:
with open(file_path, encoding='utf-16le', mode='r') as f:
text = f.read()
print(text)
print result like: python read correct
Solution 1:[1]
Add these options while creating Spark dataframe from CSV file source -
.option('encoding', 'UTF-16')
.option('multiline', 'true')
Solution 2:[2]
the multiline option ignores the encoding option when using the DataFrameReader. It is not possible to use both options at the same time.
Maybe you can process the multiline problems in your data and later specify an encoding to read good characters.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Vikramsinh Shinde |
Solution 2 | Ali BOUHLEL |