'Differentiate sql null and json null pyspark dataframe

I have a json file with inconsistent schema in which some fields may or may not present in the successive rows

Sample JSON file

{"table":"TABLEA","ID":1,"COLUMN1":283,"COLUMN2":0,"COLUMN3":0}
{"table":"TABLEA","ID":1,"COLUMN1":null,"COLUMN2":null,"COLUMN3":null}
{"table":"TABLEA","ID":1,"COLUMN4":"CLOSE","COLUMN5":"user"}
{"table":"TABLEA","ID":1,"COLUMN5":"user","COLUMN6":355}
{"table":"TABLEA","ID":1,"COLUMN5":"user","COLUMN4":"NOTE"}
{"table":"TABLEA","ID":1,"COLUMN5":"user","COLUMN4":"NOTE"}

The above json represents various updates that happened to a particular table In the above JSON,

the first event has updates on those only those 3 cols
the second event has updates as null for those 3 cols
the third event has updates for the other 2 cols

Basically, each event contains only columns that only has updates in it. If there are no updates then the column wont be available.

Problem

I want to differentiate the nulls that comes as part of updates in the event vs the nulls that got generated when loading this data to dataframe Since the schema here is dynamic, when I tried to load the json in the dataframe and tried to to display it. this is how it got stored

+-------+-------+-------+-------+------------+-------+---+------+
|COLUMN1|COLUMN2|COLUMN3|COLUMN4|     COLUMN5|COLUMN6| ID| table|
+-------+-------+-------+-------+------------+-------+---+------+
|    283|      0|      0|   null|        null|   null|  1|TABLEA|
|   null|   null|   null|   null|        null|   null|  1|TABLEA|
|   null|   null|   null|  CLOSE|        user|   null|  1|TABLEA|
|   null|   null|   null|   null|        user|    355|  1|TABLEA|
|   null|   null|   null|   NOTE|        user|   null|  1|TABLEA|
|   null|   null|   null|   NOTE|        user|   null|  1|TABLEA|
+-------+-------+-------+-------+------------+-------+---+------+

In which the second row is actually null updates that has been happened from the first 3 columns where as for the other rows, since those columns are not part of the event it just loaded with null values by default.

I wanted to differentiate the nulls that comes in the json file vs the nulls that loaded by default because of schema inconsistency.

What I tried

Tried couple of approaches but nothing works

spark = SparkSession \
        .builder \
        .appName("Test") \
        .getOrCreate()
    applicationId = spark.sparkContext.applicationId
    sc = spark.sparkContext

    print(sc.getConf().getAll())
    input_file_path = "above json file"
    print(str(input_file_path))
    json_df = spark.read.json(input_file_path)

    json_df.show()

    #method1
    json_df.withColumn("testcol", F.when(F.isnull('COLUMN1'), F.lit('NaN')).otherwise(F.col('COLUMN1'))).show()

    #method2
def has_column(df,col):
    try:
       df[col]
       return F.lit(True)
    except Exception:
       return F.lit(False)
    json_df.withColumn("testcol", F.when(has_column(json_df, 'COLUMN1'), F.col('COLUMN1')).otherwise(F.lit('NaN'))).show()

Any help would be appreciated. Thanks!

Solution 1:^[1]

As the nulls become same as soon as data is read , I would try to replace the null in file first and then read it .

with open(input_file_path) as f:
    newText=f.read().replace(":null",":NaN")

with open(input_file_path, "w") as f:
    f.write(newText)

Then json_df.show() should give below table

+-------+-------+-------+-------+-------+-------+---+------+
|COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6| ID| table|
+-------+-------+-------+-------+-------+-------+---+------+
|  283.0|    0.0|    0.0|   null|   null|   null|  1|TABLEA|
|    NaN|    NaN|    NaN|   null|   null|   null|  1|TABLEA|
|   null|   null|   null|  CLOSE|   user|   null|  1|TABLEA|
|   null|   null|   null|   null|   user|    355|  1|TABLEA|
|   null|   null|   null|   NOTE|   user|   null|  1|TABLEA|
|   null|   null|   null|   NOTE|   user|   null|  1|TABLEA|
+-------+-------+-------+-------+-------+-------+---+------+

Solution 2:^[2]

Option1

Just delegate testNG xml generation to some pre-build CI step.

Create a class, which e.g. checks adb devices list and includes or excludes some xml parts.

Run the class and generate xml and use it in test run.

Option2

Use IAnnotationTransformListener to disable tests based on some logic.

You have to define somehow which test depends on which device. In your example, I see you can retrieve this from the class package name. Then you have to check if the device is connected and disable the test if needed.

https://techfortesting.wordpress.com/2019/12/27/iannotation-transformer/

Note: you have to apply this listener in xml, not in the code. This is because it should be applied in the early stage of TestNG execution.

Sorry, I've not provided the final code snippets, because it takes time to implement this.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Anjaneya Tripathi
Solution 2

'Differentiate sql null and json null pyspark dataframe

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]