'How to create new column in PySpark from existing columns with nullable False

Is it possible to create a new column in a PySpark dataframe with "nullable: False" using the existing column with no null values but "nullable: True" in schema. I'm struggling to get the answer but there's nothing relevant. Any direction or help would be of great help. Thanks in advance!



Solution 1:[1]

You can achieve these by creating a new df from the old one using a schema e.g

from pyspark.sql.types import StringType
from pyspark.sql.functions import lit

df = spark.createDataFrame([(1, "John Doe", 21), (2, "Simple", 33)], ("id", "name", "age"))

new_schema = df.schema.add("noNulls", StringType(), False) # 3rd parameter states the Nullable property

df = df.withColumn('noNulls', df.id+df.age)

new_df = spark.createDataFrame(df.rdd, new_schema)

new_df.printSchema()

This returns the new_df with schema

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- noNulls: string (nullable = false)

You can read more about StructType here.

Note

If you add a column with it's value as literal using lit function. It will be created as nullable=false by default

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1