'Row column names are ignored when using createDataFrame with schema in pyspark
Column names in a Row are ignored and only columns order is considered when passing schema to createDataFrame. Any way to correct this, or is there a different way to pass data to createDataFrame that respects column names?
schema = T.StructType([
T.StructField("A", T.StringType(), False),
T.StructField("B", T.StringType(), False)
])
data = [
Row(B="b1", A="a1"),
Row(B="b2", A="a2"),
Row(B="b3", A="a3"),
]
df = spark.createDataFrame(data, schema=schema)
df.show()
# +---+---+
# | A| B|
# +---+---+
# | b1| a1|
# | b2| a2|
# | b3| a3|
# +---+---+
In the above snippet, I would expect that column A contains a1, a2, a3 and column B contains b1, b2, b3.
I know that I can do df = spark.createDataFrame([x.asDict(True) for x in data], schema=schema) but then there is no point in using Row.
ORIGINAL QUESTION:
My original question is a subcase of the above question.
When using df = spark.createDataFrame(data, schema) for a schema containing a StructType, instead of using the column names in the struct Row in data, it only considers the order of the row.
Is there a way to change this (ignore the order, use matching column names)?
In the following example, I would expect to see column AAA contain values AAA1, AAA2, etc. But instead, it will take BBB1, BBB2, etc. because data has BBB in the first place.
import pyspark.sql.types as T
import pyspark.sql.functions as F
schema = T.StructType([
T.StructField("E", T.StringType(), True),
T.StructField("C", T.StructType([
T.StructField("AAA", T.StringType(), True),
T.StructField("BBB", T.StringType(), True),
T.StructField("CCC", T.StringType(), True),
T.StructField("DDD", T.StringType(), True),
]), True)
])
data = [
Row(E="E1", C=Row(BBB="BBB1", DDD="DDD1", CCC="CCC1", AAA="AAA1")),
Row(E="E2", C=Row(BBB="BBB2", DDD=None, CCC="CCC2", AAA="AAA2")),
Row(E="E3", C=Row(BBB="BBB3", DDD="DDD3", CCC="CCC3", AAA="AAA3")),
Row(E="E4", C=None)
]
df = spark.createDataFrame(data, schema)
df2 = df\
.withColumn("AAA", F.col("C")["AAA"])\
.withColumn("BBB", F.col("C")["BBB"])\
.withColumn("CCC", F.col("C")["CCC"])\
.withColumn("DDD", F.col("C")["DDD"])
df2.show()
The result of the show is:
+---+--------------------+----+----+----+----+
| E| C| AAA| BBB| CCC| DDD|
+---+--------------------+----+----+----+----+
| E1|{BBB1, DDD1, CCC1...|BBB1|DDD1|CCC1|AAA1|
| E2|{BBB2, null, CCC2...|BBB2|null|CCC2|AAA2|
| E3|{BBB3, DDD3, CCC3...|BBB3|DDD3|CCC3|AAA3|
| E4| null|null|null|null|null|
+---+--------------------+----+----+----+----+
EDITS:
Changed the order and title to simplify the question.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
