'Is it possible to define recursive DataType in PySpark Dataframe?
I want to create a schema like this example:
friendSchema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("friends",**friendSchema**,True)
I understand the data must be normalized but I was wondering if Spark has the functionality to create a schema like the above. If so, how can one do it? Is it doable using UDT?
Solution 1:[1]
Yes, it's possible. What you're looking to do is called a nested struct. A StructType schema can itself include StructType fields, which will do what you want. So for example:
def test_build_nested_schema(self):
internal_struct = StructType([(StructField("friend_struct", StringType()))])
friend_schema = StructType([
StructField("firstname", StringType(), True),
StructField("middlename", StringType(), True),
StructField("friends", internal_struct, True)])
empty_df = self.spark.createDataFrame([], schema=friend_schema)
empty_df.printSchema()
Which will output:
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- friends: struct (nullable = true)
| |-- friend_struct: string (nullable = true)
Solution 2:[2]
I think maybe you should take a step back and rethink your solution.
You are trying to model relationships between friends, probably the best way to work with this would be using Graphs.
Try reading this: https://databricks.com/blog/2016/03/03/introducing-graphframes.html
Solution 3:[3]
What you are asking for is not possible. What you are trying to do is a schema with infinite subschemas.
It can be done with a recursive function:
from pyspark.sql.types import *
def friendSchema(n):
if n == 0:
return StructType([ \
StructField("firstname", StringType(), True), \
StructField("middlename", StringType(), True)])
else:
return StructType([ \
StructField("firstname", StringType(), True), \
StructField("middlename", StringType(), True), \
StructField("friends", friendSchema(n - 1))])
Solution 4:[4]
it's not possible,
but you can implement it by another approach. by storing the data as JSON. and reading it as a virtual table. I know that will cost on the amount of i/o but after this step, you create a table from the select of the virtual table.
- convert the data as JSON (with your recursion).
- store it on a temporary table.
- create a table from select on your temporary table.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Shay Nehmad |
| Solution 2 | Alex |
| Solution 3 | Alfilercio |
| Solution 4 | Beny Gj |
