'Import JSON content from CSV file using spark
Currently, I’m working with the following architecture. I do have a DocumentDB database that has data exported to S3 using DMS (CDC task), once this data is landed on S3 I need to load it into Databricks.
I’m already able to read the CSV content (which has a lot of JSONS), but I don't how to parse/insert it into a Databricks table.
Following my JSON payload which is exported to S3.
{
"_id": {
"$oid": "12332334"
},
"processed": false,
"col1": "0000000122",
"startedAt": {
"$date": 1635667097000
},
"endedAt": {
"$date": 1635667710000
},
"col2": "JFHFGALJF-DADAD",
"col3": 2.75,
"created_at": {
"$date": 1635726018693
},
"updated_at": {
"$date": 1635726018693
},
"__v": 0
}
To extract the data into Daframe I'm using the following spark command:
df = spark.read \
.option("header", "true") \
.option("delimiter", "|") \
.option("inferSchema", "false" ) \
.option("lineterminator", "\n" ) \
.option("encoding", "ISO-8859-1") \
.option("ESCAPE quote", '"') \
.option("escape", "\"") \
.csv("dbfs:/mnt/s3-data-2/folder_name/LOAD00000001.csv")
Solution 1:[1]
Thank you Alex Ott as per your Suggestion and as per this document. you can use from_json in your file to read JSON to CSV
In order to read a JSON string from a CSV file, first, we need to read a CSV file into Spark Dataframe using
spark.read.csv("path")and then parse the JSON string column and convert it to columns usingfrom_json()function. This function takes the first argument as a JSON column name and the second argument as JSON schema.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SaiSakethGuduru-MT |
