'Read spark csv dataframe as pandas

After processing a big data on pyspark, I saved it on csv using the following command:

df.repartition(1).write.option("header", "true").option("delimeter", "\t").csv("csv_data", mode="overwrite")

Now, I want use pd.read_csv() to load it again.

info = pd.read_csv('part0000.csv', sep='\t', header='infer')

info is returned as 1 column where the data is separated by comma not '\t'.

col1name,col2name,col3name
val1,val2,val3

I tried to specify the sep=',' but I got an parsing error where some rows have more than 3 cols.

How to fix that without skipping any rows ? Is there anything to do with spark to resolve it such as specify a '|' as delimiter

Solution 1:^[1]

The csv format writer method DOESN'T have the delimeter option, guess what you need is the sep option.

Please refer to here

Solution 2:^[2]

As mentioned here, pandas treats the char " for queting, and expects " after every " which is not always true in my case. To fix this problem, we have to specify quoting=3 to not use quote.

data = pd.read_csv('data.csv', header='infer', sep='\t', engine='python', quoting=3)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	è¿‡è¿‡æ‹›
Solution 2	LearnToGrow

'Read spark csv dataframe as pandas

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]