'Why would anybody choose Flink over spark? [closed]

I see spark to be superior to Flink. Below is my research.

  1. I see that most of features of Spark are covered in Flink, except for The Fair scheduling of Spark. I tried googling and going through flink documentation but had no luck.
  2. Also if you see Github, Apache Spark has almost double the popularity (number of stars, forks) when compared to Flink. So I am curious to know why Flink doesn't have as much popularity as Spark.
  3. I also see the number of connectors written to Flink are too less/ less maintained than the number of connectors to Spark. (eg: mongodb). Does it mean Flink is yet to be matured/ get market traction?

The answers to the above will help us decide on the appropriate technology.

Edit-1

I am giving more input after having read some answers here.

  1. This will mostly be used for batch processing. Realtime streaming may be 10% of the use case.

  2. What's more important for me is: when we run into issues we need community support. we cant keep scratching our heads for weeks if not months unable to handle issues --this is where the GitHub stars influence my decision a lot. [priority 1]

  3. The deployment will be in cloud. so cost is super important. Mostly we want to have the cluster with 25% nodes as spot instances (because of cost) [priority 2]. If business incurs more cost but its running, then we are OK. But I don't want to fall into cost optimization trap and finally dent business.

  4. Fair scheduling is super important. I can't starve a customer just because the cluster resources are being starved by another customer and will not release the cluster for several hours.

  5. One more point of concern I have is: most of the new / emerging technologies first support Spark. (eg: delta lake) ; So i am thinking: even if I pickup Flink and Flink is really performant -- what's the point, will I end up writing connectors for all rather concentrate on writing business logic.

Note that underlying database is Mongo --which can't be fully eaten away by the current processing. A head room should be left for real time micro services to act --meaning the Spark/Flink cluster will be limited in size.

Edit-2

So the question is why would anybody pickup apache flink over spark? Clearly I am missing to see the important value add of Flink. Can anybody help? With pointers on evidence.



Solution 1:[1]

  1. At the beginning, I would consider why you need to build a data processing platform - will it be batch or real-time processing, or both? If batch is + for Apache Spark, if real-time then for + for Apache Flink if both are minor plus for Apache Spark.

  2. The second element is what API you would like to use. With Apache Spark, I definitely do my best writing and use the Scala API, but it also has a very well documented Python API a.k.a PySpark. I am writing this because it is much easier to find people who know Python than Scala/Java on the market. Of course, Apache Flink also has the Python API, but it is much less documented than PySpark.

  3. The third element is that you will find more people in the market who are familiar with Apache Spark than with Apache Flink. If you would like to expand your team, it will be easier for you to find new people already familiar with technology.

  4. Platform cost. On-premise you don't have to count every dollar for consumption, but if you want to move the platform to Cloud or build it there, you have to recalculate the cost. Basically mostly depends on the 1. point,i.e. whether it is batch or real-time. Batch has the plus that you don't need to have a 24/7 environment up and running, you only run the Spark cluster whenever you are processing data. Then the environment is turned off.

I am sending a link to my blog. You may find there other interesting Spark topics that can interest you: https://bigdata-etl.com/articles/big-data/apache-spark/

Solution 2:[2]

Disclaimer: I'm a Flink committer and I work on Flink at Ververica. It's been quite some time since I worked directly with Spark.

I recommend watching this talk from the Flink Forward conference, where Regina Chen from Goldman Sachs describes how they got significantly better performance and reduced costs by switching to Flink: Dynamically Generated Flink Jobs at Scale.

As for fair scheduling, Flink doesn't have anything exactly like that. Most of the Flink's community's related efforts over the past few releases have focused on better support for containerized, per-application deployments and elastic scaling, rather than session clusters. The adaptive batch scheduler coming in Flink 1.15 might be of interest, for example.

Solution 3:[3]

As always, choosing for a technology depends on the best fit for the business problem that you're trying to solve. I can't answer that based on the original question. Are you interested in batch or streaming data? Is latency important? Do you want to perform stateful processing or stateless? How about exactly once state consistency? What are the sources and what are the sinks that you need to connect to? Do you want to write your logic in Java, Scala, SQL, Python or something else? These are just some of the questions you need to consider.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Pawe? Cie?la
Solution 2
Solution 3 Martijn Visser