'Is it possible to define recursive DataType in PySpark Dataframe?

I want to create a schema like this example:

friendSchema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("friends",**friendSchema**,True)

I understand the data must be normalized but I was wondering if Spark has the functionality to create a schema like the above. If so, how can one do it? Is it doable using UDT?



Solution 1:[1]

Yes, it's possible. What you're looking to do is called a nested struct. A StructType schema can itself include StructType fields, which will do what you want. So for example:

def test_build_nested_schema(self):
    internal_struct = StructType([(StructField("friend_struct", StringType()))])
    friend_schema = StructType([
        StructField("firstname", StringType(), True),
        StructField("middlename", StringType(), True),
        StructField("friends", internal_struct, True)])
    empty_df = self.spark.createDataFrame([], schema=friend_schema)
    empty_df.printSchema()

Which will output:

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- friends: struct (nullable = true)
 |    |-- friend_struct: string (nullable = true)

Documentation link.

Solution 2:[2]

I think maybe you should take a step back and rethink your solution.

You are trying to model relationships between friends, probably the best way to work with this would be using Graphs.

Try reading this: https://databricks.com/blog/2016/03/03/introducing-graphframes.html

Solution 3:[3]

What you are asking for is not possible. What you are trying to do is a schema with infinite subschemas.

It can be done with a recursive function:

from pyspark.sql.types import *

def friendSchema(n):
    if n == 0:
        return StructType([ \
            StructField("firstname", StringType(), True), \
            StructField("middlename", StringType(), True)])
    else:
        return StructType([ \
            StructField("firstname", StringType(), True), \
            StructField("middlename", StringType(), True), \
            StructField("friends", friendSchema(n - 1))])

Solution 4:[4]

it's not possible,

but you can implement it by another approach. by storing the data as JSON. and reading it as a virtual table. I know that will cost on the amount of i/o but after this step, you create a table from the select of the virtual table.

  • convert the data as JSON (with your recursion).
  • store it on a temporary table.
  • create a table from select on your temporary table.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shay Nehmad
Solution 2 Alex
Solution 3 Alfilercio
Solution 4 Beny Gj