'finding error/mismatched records in dataframe while reading csv file with multiLine and delimiter using pyspark

I am trying to find of error/mismatched records in dataframe reading with multiLine and delimiter using pyspark

Id,Address Line1,City,State,Zipcode
1,9182 Clear Water Rd,Fayetteville,AR,72704
2,"9920 State
Highway 89",Ringling,OK,73456
3,9724 E Landon Ln,Kennewick,WA,99338
4,9725 E Landon,
5,Kennewick,WA,99338

converting to dataframe

df = spark.read.option("header",true).option("multiLine",true).option("delimiter",",").csv("src/main/resources/address-multiline.csv")
df.show()

col_count = 5
col_delimiter = ','

def func1(x):
    no_of_columns = row[col_delimite] =! col_count
    ret = 'haveing issue at lineno'+lineno()
    return (ret)

rdd2=df.rdd.map(lambda x: func1(x))

Total np of columns : 5

row 4 and 5 are mismatched records

expected output: error/no of fields mismatched records in a row
Error at line no 4, actual fields  2,  9725 E Landon,
Error at line no 5, actual fields 4,  5,Kennewick,WA,99338

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'finding error/mismatched records in dataframe while reading csv file with multiLine and delimiter using pyspark

Sources

Related Questions