'Saving a pyspark dataframe to mongodb gives an error
I try to save a pyspark dataframe to mongodb using a google cloud dataproc cluster, but it keeps showing me an error message.
I'm using spark 2.4.7 and python 3.7, and mongoDB spark connector 2.4.3
Here is my code:
spark = SparkSession.builder\
.master("yarn")\
.appName("demo")\
.config("spark.mongodb.input.uri",
"mongodb+srv://my_host:27017/people_db") \
.config("spark.mongodb.output.uri",
"mongodb+srv://my_host:27017/people_db") \
.config('spark.jars.packages',
'org.mongodb.spark:mongo-spark-connector_2.12-2.4.3')\
.getOrCreate()
df = spark.read\
.format('csv')\
.options(header=True)\
.load(csv_path)
# ----------Some data processing -----------
df.write\ #This is the block of code that shows the error
.format("com.mongodb.spark.sql.DefaultSource")\
.mode("append")\
.option("collection", "people")\
.save()
Here is the error message:
Solution 1:[1]
The mongo driver jar is not included in the class path. The two mongo jars (connector and driver) are essential in spark/jars path. I was able to run on local and also as dataproc job by referring to the below link. Mongo connector : 2.12_3.0.1 Mongo java driver : 3.12 Spark : 3.0.2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |

