'Is there any way to convert this into a numerical output? PySpark
I am a beginner that is trying to get my way understanding how to utilize PySpark. So I am just trying different codes, and getting the output to match my expectations.
However, a certain block here served as a challenge for me as I couldn't understand/convert it to a numerical output. Have a look:
Loading a data file:
raw_data = sc.textFile("./comedy_comparisons.txt")
Here, I am trying to split my text file through commas and placing it into CSV format. Then I would want to detect "left" the 3rd column (Index: 2) and return the total duration of the entire process:
csv = raw_data.map(lambda x: x.split(","))
normal_data = csv.filter(lambda x: x[2]=="right")
duration = normal_data.map(lambda x: str(x[0]))
total_duration = duration.reduce(lambda x, y: x+y)
total_duration
And the output I have received is gibberish as compared to my expected output which was numerical:
'Vr4D8xO2lBYwSMh4E3xxdwHZPUQQNRvOgsP3YdfG_UioSG2B7dIqbNQrhjIStU0JvIbfU4rTa-PfQwvybYNEDQWEAi6DSPPo0VIg6lJ2k3TCFgLr4SS1zxRYgUPTrASEx_p8g2HylVyOyzcJmtITSojOF8BXPvq89jMU...
A little bit of context of the data set I am using, it looks like this:
- sNabaB-eb3Y,wHkPb68dxEw,left
- sNabaB-eb3Y,y2emSXSE-N4,left
- fY_FQMQpjok,sNabaB-eb3Y,left
- Vr4D8xO2lBY,sNabaB-eb3Y,right
- sNabaB-eb3Y,dDtRnstrefE,left
Each row in this text file represents one anonymous user vote. Each line contains three comma-separated fields. The first two fields are YouTube video IDs. The third field is either 'left' or 'right'. Left indicates the first video from the pair was voted to be funnier than the second. Right indicates the opposite preference.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
