'Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?



Solution 1:[1]

YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.

Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?

Solution 2:[2]

the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 OneCricketeer
Solution 2 Cobb Von