'finding error/mismatched records in dataframe while reading csv file with multiLine and delimiter using pyspark
I am trying to find of error/mismatched records in dataframe reading with multiLine and delimiter using pyspark
Id,Address Line1,City,State,Zipcode
1,9182 Clear Water Rd,Fayetteville,AR,72704
2,"9920 State
Highway 89",Ringling,OK,73456
3,9724 E Landon Ln,Kennewick,WA,99338
4,9725 E Landon,
5,Kennewick,WA,99338
converting to dataframe
df = spark.read.option("header",true).option("multiLine",true).option("delimiter",",").csv("src/main/resources/address-multiline.csv")
df.show()
col_count = 5
col_delimiter = ','
def func1(x):
no_of_columns = row[col_delimite] =! col_count
ret = 'haveing issue at lineno'+lineno()
return (ret)
rdd2=df.rdd.map(lambda x: func1(x))
Total np of columns : 5
row 4 and 5 are mismatched records
expected output: error/no of fields mismatched records in a row
Error at line no 4, actual fields 2, 9725 E Landon,
Error at line no 5, actual fields 4, 5,Kennewick,WA,99338
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
