'How to create new column in PySpark from existing columns with nullable False
Is it possible to create a new column in a PySpark dataframe with "nullable: False" using the existing column with no null values but "nullable: True" in schema. I'm struggling to get the answer but there's nothing relevant. Any direction or help would be of great help. Thanks in advance!
Solution 1:[1]
You can achieve these by creating a new df from the old one using a schema e.g
from pyspark.sql.types import StringType
from pyspark.sql.functions import lit
df = spark.createDataFrame([(1, "John Doe", 21), (2, "Simple", 33)], ("id", "name", "age"))
new_schema = df.schema.add("noNulls", StringType(), False) # 3rd parameter states the Nullable property
df = df.withColumn('noNulls', df.id+df.age)
new_df = spark.createDataFrame(df.rdd, new_schema)
new_df.printSchema()
This returns the new_df with schema
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- noNulls: string (nullable = false)
You can read more about StructType here.
Note
If you add a column with it's value as literal using lit function. It will be created as nullable=false by default
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
