'Why reading a json format file resulting all the records going to _corrupt_record in pyspark

I am reading data from an api call and the data is in the form of json like below:

{'success': True, 'errors': \[\], 'requestId': '151a2#fg', 'warnings': \[\], 'result': \[{'id': 10322433, 'name': 'sdfdgd', 'desc': '', 'createdAt': '2016-09-20T13:48:58Z+0000', 'updatedAt': '2020-07-16T13:08:03Z+0000', 'url': 'https://eda', 'subject': {'type': 'Text', 'value': 'Register now'}, 'fromName': {'type': 'Text', 'value': 'ramjdn fg'}, 'fromEmail': {'type': 'Text', 'value': '[email protected]'}, 'replyEmail': {'type': 'Text', 'value': '[email protected]'}, 'folder': {'type': 'Folder', 'value': 478, 'folderName': 'sjha'}, 'operational': False, 'textOnly': False, 'publishToMSI': False, 'webView': False, 'status': 'approved', 'template': 1031, 'workspace': 'Default', 'isOpenTrackingDisabled': False, 'version': 2, 'autoCopyToText': True, 'preHeader': None}\]}

Now when I am creating a dataframe out of this data using below code:

df = spark.read.json(sc.parallelize(\[data\]))

I am getting only one column which is _corrupt_record, below is the dataframe o/p I am getting. I have tried using multine is true but am still not getting the desired output.

+--------------------+
|     \_corrupt_record|
\+--------------------+
|{'id': 12526, 'na...|
\+--------------------+

Expected o/p is the dataframe after exploding json with different columns, like id as one column, name as other column and so on.

I have tried lot of things but not able to fix this.



Solution 1:[1]

I have made certain changes and it worked.

  1. I need to define the custom schema

  2. Then used this bit of code

    data = sc.parallelize([items])
    df = spark.createDataFrame(data,schema=schema)
    

And It worked.

If there are any optimized solution to this please feel free to share.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 joanis