'Having trouble convering pyspark dataframe to scala dataframe and passing it to a scala function
I am trying to submit a pyspark application with a scala lib/jar as dependency. I pass this scala jar via the --jars parameter when submitting the pyspark job on GCP Dataproc.
In my python driver program, I have a pyspark dataframe df. When I check its type, it shows what is expected
print(type(df)) -> <class 'pyspark.sql.dataframe.DataFrame'>
The scala jar has a function which takes input a scala spark dataframe. To pass the pyspark dataframe df to this scala function, I use the ._jdf attribute -> df._jdf
But I meet with this error:
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1266, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1266, in <listcomp>
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'JavaMember' object has no attribute '_get_object_id'
I think this is because df._jdf is not of type 'spark.sql.DataFrame' but of type below:
print(type(df._jdf)) -> <class 'py4j.java_gateway.JavaMember'>
Is df._jdf not the correct way to convert a pyspark dataframe to scala? ? or is there a better alternative way to achieve what I am trying to do?
I am following these sources:
https://diogoalexandrefranco.github.io/scala-code-in-pyspark/
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
