'Running a Spark Streaming job in Zeppelin throws connection refused 8998 error
I'm working in a virtual machine. I run a Spark Streaming job which I basically copied from a Databricks tutorial.
%pyspark
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
Py4JJavaError: An error occurred while calling o101.start.
: java.net.ConnectException: Call From VirtualBox/127.0.1.1 to localhost:8998 failed on connection exception: java.net.ConnectException:
I checked and there is no service listening on port 8998. I learned that this port is associated with the Apache Livy-server which I am not using. Can someone point me into the right direction?
Solution 1:[1]
Ok, so I fixed this issue. First, I added 'file://' when specifying the input folder. Second, I added a checkpoint location. See code below:
inputFolder = 'file:///home/sallos/tmp/'
streamingInputDF = (
spark
.readStream
.schema(schema)
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputFolder)
)
streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.SrcIPAddr,
window(streamingInputDF.Datefirstseen, "30 seconds"))
.sum('Bytes').withColumnRenamed("sum(Bytes)", "sum_bytes")
)
query = (
streamingCountsDF
.writeStream.format("memory")\
.queryName("sumbytes")\
.outputMode("complete")\
.option("checkpointLocation","file:///home/sallos/tmp_checkpoint/")\
.start()
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sallos |
