'Python kafka module when used with pyspark causes 'ModuleNotFound' error?
I am able to submit and run a pyspark script I wrote.
Command to Run the script pyspark_client.py:
❯ clear;/opt/spark/bin/pyspark < pyspark_client.py
I am also able to run a python-kafka script separately without any issues. This means I have the module installed.
But the issue arises when i combine both the scripts and submit it to pyspark.
from pyspark import SparkContext
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import *
from json import dumps
from kafka import KafkaProducer
It cannot find the python-kafka module. The error is:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
ModuleNotFoundError: No module named 'kafka'
All I want to do is write the counts of tweets in each window to a kafka topic. The way I get the topic and count from a window is that I iterate through rows of the the window df.
topic,count = row.__getitem__("hashtag").__getitem__("text"),row.__getitem__("count")
Minimum reproducable code for error:
from pyspark import SparkContext
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import *
from json import dumps
from kafka import KafkaProducer
print("Hello")
Version details:
- kafka-python==2.0.2
- pyspark==3.2.1
Solution 1:[1]
Pyspark has its own Kafka support, that doesn't depend on kafka-python module. You should remove it, along with the import. You should also be using spark-submit pyspark_client.py to run the code, not input redirection into the pyspark interpreter.
Alternatively, assuming you're using tweepy, you don't need Spark to produce data to Kafka
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | OneCricketeer |
