'Is it possible to define recursive DataType in PySpark Dataframe?

I want to create a schema like this example:

friendSchema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("friends",**friendSchema**,True)

I understand the data must be normalized but I was wondering if Spark has the functionality to create a schema like the above. If so, how can one do it? Is it doable using UDT?

Solution 1:^[1]

Yes, it's possible. What you're looking to do is called a nested struct. A StructType schema can itself include StructType fields, which will do what you want. So for example:

def test_build_nested_schema(self):
    internal_struct = StructType([(StructField("friend_struct", StringType()))])
    friend_schema = StructType([
        StructField("firstname", StringType(), True),
        StructField("middlename", StringType(), True),
        StructField("friends", internal_struct, True)])
    empty_df = self.spark.createDataFrame([], schema=friend_schema)
    empty_df.printSchema()

Which will output:

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- friends: struct (nullable = true)
 |    |-- friend_struct: string (nullable = true)

Documentation link.

Solution 2:^[2]

I think maybe you should take a step back and rethink your solution.

You are trying to model relationships between friends, probably the best way to work with this would be using Graphs.

Try reading this: https://databricks.com/blog/2016/03/03/introducing-graphframes.html

Solution 3:^[3]

What you are asking for is not possible. What you are trying to do is a schema with infinite subschemas.

It can be done with a recursive function:

from pyspark.sql.types import *

def friendSchema(n):
    if n == 0:
        return StructType([ \
            StructField("firstname", StringType(), True), \
            StructField("middlename", StringType(), True)])
    else:
        return StructType([ \
            StructField("firstname", StringType(), True), \
            StructField("middlename", StringType(), True), \
            StructField("friends", friendSchema(n - 1))])

Solution 4:^[4]

it's not possible,

but you can implement it by another approach. by storing the data as JSON. and reading it as a virtual table. I know that will cost on the amount of i/o but after this step, you create a table from the select of the virtual table.

convert the data as JSON (with your recursion).
store it on a temporary table.
create a table from select on your temporary table.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Shay Nehmad
Solution 2	Alex
Solution 3	Alfilercio
Solution 4	Beny Gj

'Is it possible to define recursive DataType in PySpark Dataframe?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]