'Row column names are ignored when using createDataFrame with schema in pyspark

Column names in a Row are ignored and only columns order is considered when passing schema to createDataFrame. Any way to correct this, or is there a different way to pass data to createDataFrame that respects column names?

schema = T.StructType([
    T.StructField("A", T.StringType(), False),
    T.StructField("B", T.StringType(), False)
])
data = [
    Row(B="b1", A="a1"),
    Row(B="b2", A="a2"),
    Row(B="b3", A="a3"),
]
df = spark.createDataFrame(data, schema=schema)
df.show()
# +---+---+
# |  A|  B|
# +---+---+
# | b1| a1|
# | b2| a2|
# | b3| a3|
# +---+---+

In the above snippet, I would expect that column A contains a1, a2, a3 and column B contains b1, b2, b3.

I know that I can do df = spark.createDataFrame([x.asDict(True) for x in data], schema=schema) but then there is no point in using Row.

ORIGINAL QUESTION:

My original question is a subcase of the above question.

When using df = spark.createDataFrame(data, schema) for a schema containing a StructType, instead of using the column names in the struct Row in data, it only considers the order of the row.

Is there a way to change this (ignore the order, use matching column names)?

In the following example, I would expect to see column AAA contain values AAA1, AAA2, etc. But instead, it will take BBB1, BBB2, etc. because data has BBB in the first place.

    import pyspark.sql.types as T
    import pyspark.sql.functions as F

    schema = T.StructType([
        T.StructField("E", T.StringType(), True),
        T.StructField("C", T.StructType([
            T.StructField("AAA", T.StringType(), True),
            T.StructField("BBB", T.StringType(), True),
            T.StructField("CCC", T.StringType(), True),
            T.StructField("DDD", T.StringType(), True),
        ]), True)
    ])
    data = [
        Row(E="E1", C=Row(BBB="BBB1", DDD="DDD1", CCC="CCC1", AAA="AAA1")),
        Row(E="E2", C=Row(BBB="BBB2", DDD=None, CCC="CCC2", AAA="AAA2")),
        Row(E="E3", C=Row(BBB="BBB3", DDD="DDD3", CCC="CCC3", AAA="AAA3")),
        Row(E="E4", C=None)
    ]
    df = spark.createDataFrame(data, schema)
    df2 = df\
        .withColumn("AAA", F.col("C")["AAA"])\
        .withColumn("BBB", F.col("C")["BBB"])\
        .withColumn("CCC", F.col("C")["CCC"])\
        .withColumn("DDD", F.col("C")["DDD"])
    df2.show()

The result of the show is:

+---+--------------------+----+----+----+----+
|  E|                   C| AAA| BBB| CCC| DDD|
+---+--------------------+----+----+----+----+
| E1|{BBB1, DDD1, CCC1...|BBB1|DDD1|CCC1|AAA1|
| E2|{BBB2, null, CCC2...|BBB2|null|CCC2|AAA2|
| E3|{BBB3, DDD3, CCC3...|BBB3|DDD3|CCC3|AAA3|
| E4|                null|null|null|null|null|
+---+--------------------+----+----+----+----+

EDITS:

Changed the order and title to simplify the question.

pyspark apache-spark-sql

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Row column names are ignored when using createDataFrame with schema in pyspark

Sources

Related Questions