'Import JSON content from CSV file using spark

Currently, I’m working with the following architecture. I do have a DocumentDB database that has data exported to S3 using DMS (CDC task), once this data is landed on S3 I need to load it into Databricks.

I’m already able to read the CSV content (which has a lot of JSONS), but I don't how to parse/insert it into a Databricks table.

Following my JSON payload which is exported to S3.

{
    "_id": {
        "$oid": "12332334"
    },
    "processed": false,
    "col1": "0000000122",
    "startedAt": {
        "$date": 1635667097000
    },
    "endedAt": {
        "$date": 1635667710000
    },
    "col2": "JFHFGALJF-DADAD",
    "col3": 2.75,
    "created_at": {
        "$date": 1635726018693
    },
    "updated_at": {
        "$date": 1635726018693
    },
    "__v": 0
}

To extract the data into Daframe I'm using the following spark command:

df = spark.read \
    .option("header", "true") \
    .option("delimiter", "|") \
    .option("inferSchema", "false" ) \
    .option("lineterminator", "\n" ) \
    .option("encoding", "ISO-8859-1") \
    .option("ESCAPE quote", '"') \
    .option("escape", "\"") \
    .csv("dbfs:/mnt/s3-data-2/folder_name/LOAD00000001.csv")


Solution 1:[1]

Thank you Alex Ott as per your Suggestion and as per this document. you can use from_json in your file to read JSON to CSV

In order to read a JSON string from a CSV file, first, we need to read a CSV file into Spark Dataframe using spark.read.csv("path") and then parse the JSON string column and convert it to columns using from_json() function. This function takes the first argument as a JSON column name and the second argument as JSON schema.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 SaiSakethGuduru-MT