'Data Bricks read RDD

I need to read a CSV file as an RDD and then extract the desired data using functions, the issue is the first function cannot produce proper RDD for the second function.

this my code:

def extract_vin_key_value(line):
    sr=line.split(',')
    if sr[1]=='I':
        return sr[2],sr[3],sr[4],sr[5]

def popolate_make(make):
    return make
    
raw_rdd = sc.textFile('/FileStore/tables/data.csv')
vin_kv = raw_rdd.map(lambda x: extract_vin_key_value(x))
vin_kv=vin_kv.filter(lambda x: x is not None)
enhance_make = vin_kv.groupByKey().flatMap(lambda kv: populate_make(kv[1]))

The CSV file is:

1   I   VXIO456XLBB630221   Nissan  Altima  2003    2002-05-08  Initial sales from TechMotors
2   I   INU45KIOOPA343980   Mercedes    C300    2015    2014-01-01  Sold from EuroMotors
3   A   VXIO456XLBB630221               2014-07-02  Head on collision
4   R   VXIO456XLBB630221               2014-08-05  Repair transmission
5   I   VOME254OOXW344325   Mercedes    E350    2015    2014-02-01  Sold from Carmax
6   R   VOME254OOXW344325               2015-02-06  Wheel alignment service
7   R   VXIO456XLBB630221               2015-01-01  Replace right head light
8   I   EXOA00341AB123456   Mercedes    SL550   2016    2015-01-01  Sold from AceCars
9   A   VOME254OOXW344325               2015-10-01  Side collision
10  R   VOME254OOXW344325               2015-09-01  Changed tires
11  R   EXOA00341AB123456               2015-05-01  Repair engine
12  A   EXOA00341AB123456               2015-05-03  Vehicle rollover
13  R   VOME254OOXW344325               2015-09-01  Replace passenger side door
14  I   UXIA769ABCC447906   Toyota  Camery  2017    2016-05-08  Initial sales from Carmax
15  R   UXIA769ABCC447906               2020-01-02  Initial sales from Carmax
16  A   INU45KIOOPA343980               2020-05-01  Side collision

The output of the first function is:

[('VXIO456XLBB630221', 'Nissan', 'Altima', '2003'),
 ('INU45KIOOPA343980', 'Mercedes', 'C300', '2015'),
 ('VOME254OOXW344325', 'Mercedes', 'E350', '2015'),
 ('EXOA00341AB123456', 'Mercedes', 'SL550', '2016'),
 ('UXIA769ABCC447906', 'Toyota', 'Camery', '2017')]

And the error message is:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 256.0 failed 1 times, most recent failure: Lost task 0.0 in stage 256.0 (TID 338) (ip-10-172-200-7.us-west-2.compute.internal executor driver): org.apache.spark.api.python.PythonException: 'ValueError: too many values to unpack (expected 2)'.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Data Bricks read RDD

Sources

Related Questions