'Running a Spark Streaming job in Zeppelin throws connection refused 8998 error

I'm working in a virtual machine. I run a Spark Streaming job which I basically copied from a Databricks tutorial.


    %pyspark
    
    query = (
      streamingCountsDF
        .writeStream
        .format("memory")        # memory = store in-memory table 
        .queryName("counts")     # counts = name of the in-memory table
        .outputMode("complete")  # complete = all the counts should be in the table
        .start()
    )
    
    Py4JJavaError: An error occurred while calling o101.start.
    : java.net.ConnectException: Call From VirtualBox/127.0.1.1 to localhost:8998 failed on connection exception: java.net.ConnectException:

I checked and there is no service listening on port 8998. I learned that this port is associated with the Apache Livy-server which I am not using. Can someone point me into the right direction?



Solution 1:[1]

Ok, so I fixed this issue. First, I added 'file://' when specifying the input folder. Second, I added a checkpoint location. See code below:

inputFolder = 'file:///home/sallos/tmp/'


streamingInputDF = (
  spark
    .readStream                       
    .schema(schema)               
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .csv(inputFolder)
)

streamingCountsDF = (                 
  streamingInputDF
    .groupBy(
      streamingInputDF.SrcIPAddr, 
      window(streamingInputDF.Datefirstseen, "30 seconds"))
    .sum('Bytes').withColumnRenamed("sum(Bytes)", "sum_bytes")
)  

query = (
  streamingCountsDF
    .writeStream.format("memory")\
    .queryName("sumbytes")\
    .outputMode("complete")\
    .option("checkpointLocation","file:///home/sallos/tmp_checkpoint/")\
    .start()
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sallos