'Need to extract the data based on delimiter and map to data frame in pyspark

I need to extract the data with ~~ delimiter and map accordingly to the required columns.

But the output somehow is random and getting wrong results/mappings. How can we achieve this using pyspark?

Sample Date: MESSAGE from Dataframe column

{5:~~:2016:ABCDEF123~~:2323:002~~:2016:567~~::555:~~XXABC~~:2016:123~~:555:~~YYYYY~~-}
{5:~~:2016:DEF~~:2323:009~~:2016:666~~::555:~~ZZZZ~~:2016:788~~:555:~~DDDDD~~:2016:5013~~:555:~~TTTTTTTT~~-}

Expected Data Frame Output:

PARENT_REF|PARENT_TXN||CHILD_REF|_CHILD_ORG

ABCDEF123|002|567|XXABC

ABCDEF123|002|123|YYYYY

DEF|009|666|ZZZZ

DEF|009|788|DDDDD

DEF|009|5013|TTTTTTTT

First 2016 is PARENT_REF. First 2323 is Parent TXN. Subsequent 2016 is child Ref. Susequent 555 is Child org.

Note - Child records can vary for a parent Record

Code Snippet:

from pyspark.sql import functions as F

df2=df1.select("MESSAGE")

df3=df2.withColumn("PARENT_REF",F.regexp_extract(F.col('MESSAGE'),'\{5:*:.*:2016:(.*?)~~:"',1))
        .withColumn("PARENT_TXN",F.regexp_extract(F.col('MESSAGE'),'\{5:*:.*:2323:(.*?)~~:"',1))
        .withColumn("CHILD_REF",F.regexp_extract(F.col('MESSAGE'),'\{5:*:.*:2016:(.*?)~~:"',1))
        .withColumn("CHILD_ORG",F.regexp_extract(F.col('MESSAGE'),'\{5:*:.*:555:(.*?)~~:"',1))

df3.show()

pyspark

Solution 1:^[1]

You just need to write a correct regex

Sample data

df = spark.createDataFrame([
    ('{5:~~:2016:ABCDEF123~~:2323:002~~:2016:567~~::555:~~XXABC~~:2016:123~~:555:~~YYYYY~~-}',),
    ('{5:~~:2016:DEF~~:2323:009~~:2016:666~~::555:~~ZZZZ~~:2016:788~~:555:~~DDDDD~~:2016:5013~~:555:~~TTTTTTTT~~-}',),
], ['message'])

+------------------------------------------------------------------------------------------------------------+
|message                                                                                                     |
+------------------------------------------------------------------------------------------------------------+
|{5:~~:2016:ABCDEF123~~:2323:002~~:2016:567~~::555:~~XXABC~~:2016:123~~:555:~~YYYYY~~-}                      |
|{5:~~:2016:DEF~~:2323:009~~:2016:666~~::555:~~ZZZZ~~:2016:788~~:555:~~DDDDD~~:2016:5013~~:555:~~TTTTTTTT~~-}|
+------------------------------------------------------------------------------------------------------------+

Transformation

(df
    .withColumn('parent_ref',F.regexp_extract(F.col('message'), '\{5:~~:2016:([^~]+)~~:2323:([^~]+)~~([^-]+)-}', 1))
    .withColumn('parent_txn',F.regexp_extract(F.col('message'), '\{5:~~:2016:([^~]+)~~:2323:([^~]+)~~([^-]+)-}', 2))
    .withColumn('children',  F.regexp_extract(F.col('message'), '\{5:~~:2016:([^~]+)~~:2323:([^~]+)~~([^-]+)-}', 3))
    .withColumn('child',  F.explode(F.split('children', ':2016:')))
    .where(F.col('child') != '')
    .withColumn('child_ref', F.regexp_extract(F.col('child'), '^([^~]+)', 1))
    .withColumn('child_org', F.regexp_extract(F.col('child'), '~~([^~]+)~~$', 1))
    .drop('message', 'children', 'child')
    .show(10, False)
)

+----------+----------+---------+---------+
|parent_ref|parent_txn|child_ref|child_org|
+----------+----------+---------+---------+
|ABCDEF123 |002       |567      |XXABC    |
|ABCDEF123 |002       |123      |YYYYY    |
|DEF       |009       |666      |ZZZZ     |
|DEF       |009       |788      |DDDDD    |
|DEF       |009       |5013     |TTTTTTTT |
+----------+----------+---------+---------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	pltc

'Need to extract the data based on delimiter and map to data frame in pyspark

Solution 1:[1]

Sample data

Transformation

Sources

Related Questions

Solution 1:^[1]