'Split pyspark rdd in key and value as key,0 and value,1
I have an rdd as below:
['1:2','3:4','5:1 2 3']
and want to split it like this:
[1,0], [2,1], [3,0], [4,1], [5,0],[1,1],[2,1],[3,1]
Logic - x:y Left side of colon should make x,0 and right side of colon will make y,1.
x: y a b c
If right side of colon contains multiple value seperated by space then all value should make (y,1) (a,1) (b,1) (c,1)
How can i get above result in pyspark.
Solution 1:[1]
You can achieve this using below
from pyspark.sql import SparkSession
data = ['1:2', '3:4', '5:1 2 3']
spark = SparkSession.builder.master("local[4]").appName("Q71346701") \
.getOrCreate()
def generate_output(row):
final_elements = []
items = row.split(':')
for idx, elm in enumerate(items):
inner_list = elm.split(' ')
if len(inner_list) == 1:
final_elements.append((int(elm), idx))
else:
for el in inner_list:
final_elements.append((int(el), 1))
return final_elements
rdd = spark.sparkContext.parallelize(data)
final_rdd = rdd.flatMap(generate_output)
print(final_rdd.collect())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | srinivas |
