'Split pyspark rdd in key and value as key,0 and value,1

I have an rdd as below:

['1:2','3:4','5:1 2 3']

and want to split it like this:

[1,0], [2,1], [3,0], [4,1], [5,0],[1,1],[2,1],[3,1]

Logic - x:y Left side of colon should make x,0 and right side of colon will make y,1.

x: y a b c

If right side of colon contains multiple value seperated by space then all value should make (y,1) (a,1) (b,1) (c,1)

How can i get above result in pyspark.



Solution 1:[1]

You can achieve this using below

from pyspark.sql import SparkSession

data = ['1:2', '3:4', '5:1 2 3']
spark = SparkSession.builder.master("local[4]").appName("Q71346701") \
    .getOrCreate()

def generate_output(row):
    final_elements = []
    items = row.split(':')
    for idx, elm in enumerate(items):
        inner_list = elm.split(' ')
        if len(inner_list) == 1:
            final_elements.append((int(elm), idx))
        else:
            for el in inner_list:
                final_elements.append((int(el), 1))
    return final_elements


rdd = spark.sparkContext.parallelize(data)
final_rdd = rdd.flatMap(generate_output)
print(final_rdd.collect())

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 srinivas