'how to convert or save a csv file into a txt file using pyspark
I'm learning Pyspark and I don't know how to save the sum of RDD values into a file. I've tried the code below unsuccessfully:
from typing import KeysView
counts = rdd.flatMap(lambda line: line.split(",")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
k=counts.keys().saveAsTextFile("out/out_1_2a.txt")
sc.parallelize(counts.values().sum()).saveAsTextFile('out/out_1_3.txt')
While I could save the keys into a file, I couldn't save the sum of the values. The error I get is: "TypeError: 'int' object is not iterable"
Can someone help:
Solution 1:[1]
See logic below -
counts = rdd.flatMap(lambda line: line.split(",")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
cnt_sum = counts.values().sum()
sc.parallelize([cnt_sum]).coalesce(1).saveAsTextFile("<path>/filename.txt")
More effective (less code):
count = len(rdd.flatMap(lambda x: x.split(",")).collect())
sc.parallelize([count]).coalesce(1).saveAsTextFile("<path>/filename.txt")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
