'Half of messages lost when Spark Streaming (4 concurrent drivers) reads from Kafka and writes to MongoDB

I have set up a docker network with:

  • 4 producer containers (each scraping a different forum) that produce to a Kafka container (1 topic for each of the forum)
  • A Kafka container and Zookeeper container
  • 1 Spark master container and 4 Spark worker containers
  • 4 Spark Streaming driver containers to process on the forum posts (1 container for each forum aka Kafka topic), and each driver does a spark-submit to the Spark master to run on 1 Spark worker each
  • 1 MongoDB container for Spark Streaming containers to write data to

However, when I checked MongoDB, I realised that around half of my messages for all 4 topics are lost. I faced this issue only when running multiple Spark drivers at the same time - it was fine when I was only running 1 previously. What could be a possible reason for this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source