'Having trouble convering pyspark dataframe to scala dataframe and passing it to a scala function

I am trying to submit a pyspark application with a scala lib/jar as dependency. I pass this scala jar via the --jars parameter when submitting the pyspark job on GCP Dataproc.

In my python driver program, I have a pyspark dataframe df. When I check its type, it shows what is expected

print(type(df)) -> <class 'pyspark.sql.dataframe.DataFrame'>

The scala jar has a function which takes input a scala spark dataframe. To pass the pyspark dataframe df to this scala function, I use the ._jdf attribute -> df._jdf

But I meet with this error:

  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1266, in _build_args
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1266, in <listcomp>
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'JavaMember' object has no attribute '_get_object_id'

I think this is because df._jdf is not of type 'spark.sql.DataFrame' but of type below: print(type(df._jdf)) -> <class 'py4j.java_gateway.JavaMember'>

Is df._jdf not the correct way to convert a pyspark dataframe to scala? ? or is there a better alternative way to achieve what I am trying to do?

I am following these sources:

https://diogoalexandrefranco.github.io/scala-code-in-pyspark/

https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source