'Spark Regex extraction
Have a string below in a table. The content inside messageBody is a JSON string. How to read using a Spark and extract the JSON inside messageBody
Input Data:
{
"audit_id": "",
"audit_name": "GFSVpFeox/KrjEpFkIELEgltPGcqVU7/I0Oh9iVfdWA=",
"audit_info": "ingest-eventss",
"messageBody": "{\"Id\":\"8607379a-348b-4fdd-909e-80b85ac402d1\",\"EventId\":31,\"EventName\":\"LandingPage\",\"TriggerId\":38,\"TriggerName\":\"Agent.StartInterview\",\"TopicId\":5,\"TopicName\":\"businessevents.data\",\"SourceAppId\":22,\"SourceAppName\":\"TEST\",\"EventCorrelationId\":\"e3f091d9-86cf-4516-a173-22d891e1f20a\",\"Environment\":\"en1\",\"Timestamp\":\"2022-04-15T20:11:48.9505708Z\",\"Detail\":{\"Data\":{\"LineContent\":\"Business\",\"ReferenceNumber\":\"6834555\"}}},
"partitionKey": null,
"replyTo": null
}
Expected Output:
| audit_info | messageBody |
|---|---|
| ingest-eventss | {"Id":"8607379a-348b-4fdd-909e-80b85ac402d1","EventId":31,"EventName":"LandingPage","TriggerId":38,"TriggerName":"Agent.StartInterview","TopicId":5,"TopicName":"businessevents.data","SourceAppId":22,"SourceAppName":"TEST","EventCorrelationId":"e3f091d9-86cf-4516-a173-22d891e1f20a","Environment":"en1","Timestamp":"2022-04-15T20:11:48.9505708Z","Detail":{"Data":{"LineContent":"Business","ReferenceNumber":"6834555"}}} |
Need to do this in Spark 3. Any regexp_extract or Split function ? the Split seems hard as the delimiter : is inside the json message as well.
Solution 1:[1]
You might need this, but it's not recommended because it's not maintainable
find:
{([^\\]*)"{\\"Id\\":\\"([^\\]*)\\",\\"EventId\\":([^,]*),\\"EventName\\":\\"([^\\]*)\\",\\"TriggerId\\":([^,]*),\\"TriggerName\\":\\"([^\\]*)\\",\\"TopicId\\":([^,]*),\\"TopicName\\":\\"([^\\]*)\\",\\"SourceAppId\\":([^,]*),\\"SourceAppName\\":\\"([^\\]*)\\",\\"EventCorrelationId\\":\\"([^\\]*)\\",\\"Environment\\":\\"([^\\]*)\\",\\"Timestamp\\":\\"([^\\]*)\\",\\"Detail\\":{\\"([^\\]*)\\":{\\"LineContent\\":\\"([^\\]*)\\",\\"ReferenceNumber\\":\\"([^\\]*)\\"}}}([^}]*)}
replace:
{"Id":"$2","EventId":$3,"EventName":"$4","TriggerId":$5,"TriggerName":"$6","TopicId":$7,"TopicName":"$8","SourceAppId":$9,"SourceAppName":"($10)","EventCorrelationId":"$11","Environment":"$12","Timestamp":"$13","Detail":{"$14":{"LineContent":"$15","ReferenceNumber":"$16"}}}
I'm not familiar with Spark syntax, so maybe I'll do something like this: regexp_replace(string A, string B, string C) Replace the part of string A that conforms to Java regular expression B with C.
Why not use Python or Java dict to concatenate strings?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
