'Spark Regex extraction

Have a string below in a table. The content inside messageBody is a JSON string. How to read using a Spark and extract the JSON inside messageBody

Input Data:

{
  "audit_id": "",
  "audit_name": "GFSVpFeox/KrjEpFkIELEgltPGcqVU7/I0Oh9iVfdWA=",
  "audit_info": "ingest-eventss",
  "messageBody": "{\"Id\":\"8607379a-348b-4fdd-909e-80b85ac402d1\",\"EventId\":31,\"EventName\":\"LandingPage\",\"TriggerId\":38,\"TriggerName\":\"Agent.StartInterview\",\"TopicId\":5,\"TopicName\":\"businessevents.data\",\"SourceAppId\":22,\"SourceAppName\":\"TEST\",\"EventCorrelationId\":\"e3f091d9-86cf-4516-a173-22d891e1f20a\",\"Environment\":\"en1\",\"Timestamp\":\"2022-04-15T20:11:48.9505708Z\",\"Detail\":{\"Data\":{\"LineContent\":\"Business\",\"ReferenceNumber\":\"6834555\"}}},
  "partitionKey": null,
  "replyTo": null
}

Expected Output:

audit_info messageBody
ingest-eventss {"Id":"8607379a-348b-4fdd-909e-80b85ac402d1","EventId":31,"EventName":"LandingPage","TriggerId":38,"TriggerName":"Agent.StartInterview","TopicId":5,"TopicName":"businessevents.data","SourceAppId":22,"SourceAppName":"TEST","EventCorrelationId":"e3f091d9-86cf-4516-a173-22d891e1f20a","Environment":"en1","Timestamp":"2022-04-15T20:11:48.9505708Z","Detail":{"Data":{"LineContent":"Business","ReferenceNumber":"6834555"}}}

Need to do this in Spark 3. Any regexp_extract or Split function ? the Split seems hard as the delimiter : is inside the json message as well.



Solution 1:[1]

You might need this, but it's not recommended because it's not maintainable

find:

{([^\\]*)"{\\"Id\\":\\"([^\\]*)\\",\\"EventId\\":([^,]*),\\"EventName\\":\\"([^\\]*)\\",\\"TriggerId\\":([^,]*),\\"TriggerName\\":\\"([^\\]*)\\",\\"TopicId\\":([^,]*),\\"TopicName\\":\\"([^\\]*)\\",\\"SourceAppId\\":([^,]*),\\"SourceAppName\\":\\"([^\\]*)\\",\\"EventCorrelationId\\":\\"([^\\]*)\\",\\"Environment\\":\\"([^\\]*)\\",\\"Timestamp\\":\\"([^\\]*)\\",\\"Detail\\":{\\"([^\\]*)\\":{\\"LineContent\\":\\"([^\\]*)\\",\\"ReferenceNumber\\":\\"([^\\]*)\\"}}}([^}]*)}

replace:

{"Id":"$2","EventId":$3,"EventName":"$4","TriggerId":$5,"TriggerName":"$6","TopicId":$7,"TopicName":"$8","SourceAppId":$9,"SourceAppName":"($10)","EventCorrelationId":"$11","Environment":"$12","Timestamp":"$13","Detail":{"$14":{"LineContent":"$15","ReferenceNumber":"$16"}}}

I'm not familiar with Spark syntax, so maybe I'll do something like this: regexp_replace(string A, string B, string C) Replace the part of string A that conforms to Java regular expression B with C.

Why not use Python or Java dict to concatenate strings?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1