'Create dataframe from json string having true false value

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array.

I had used below for creating dataframe, but getting 'Cannot infer Schema'

spark.createDataFrame(jsonStr)

I tried loading same json from file using below

spark.read.option("multiline", "true").json("/path")

This statement didn't have any issue and loaded the data to spark dataframe.

Is there any similar way to load the data from json variable?

It is fine even if all the values are not normallized.

Edit:

Found out that the issue might be due to true and false(Bool value) present in the json, when I was trying to use createDataFrame python is taking true and false as variable.

Is there any way to bypass this, the file also contains true or false. I tried to convert the list (list of nested dictionary) to json by using json.dumps() also. It is giving error as

Can not infer schema for type : <class 'str'>

Edit 2: Input:

data = [
  {
    "a":"testA",
    "b":"testB",
    "c":false
  }
]

Required output dataframe

a     |  b    |   c 
--------------------
testA | testB | false

I get the required output when loading from the file. The file contains exact same as data.

spark.read.option("multiline", "true").json("/path/test.json")

Also if the data is string then I get a error Can not infer schema for type : <class 'str'>



Solution 1:[1]

If you don't want to load data from json file, you'd have to provide a schema for the JSON and use from_json to parse it

from pyspark.sql import functions as F
from pyspark.sql import types as T

schema = T.ArrayType(T.StructType([
    T.StructField('a', T.StringType()),
    T.StructField('b', T.StringType()),
    T.StructField('c', T.BooleanType()),
]))

df = (spark
    .createDataFrame([('dummy',)], ['x'])
    .withColumn('x', F.from_json(F.lit(data), schema))
)
df.show(10, False)
df.printSchema()

+-----------------------+
|x                      |
+-----------------------+
|[{testA, testB, false}]|
+-----------------------+

root
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: boolean (nullable = true)

Solution 2:[2]

If your input is a json you can deserialize it to a list of dictionary before creating a spark dataframe:

spark.createDataFrame(json.loads(data))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pltc
Solution 2 Marco Paruscio