'What is the difference between running pyspark program with and without cluster?

I have a program that contain few lines of functions that uses pyspark (the rest is normal Python).

The portion of my code that uses pyspark:

X.to_csv(r'first.txt', header=None, index=None, sep=' ', mode='a')

# load the dataset 
rows = np.loadtxt('first.txt')

rows = sc.parallelize(rows)

mat = RowMatrix(rows)
start_time = time.time()  #to calculate the execution time of the function bellow

# compute SVD 
svd = mat.computeSVD(20, computeU=True)

exemple_one = time.time() - start_time
print("---Exemple one : %s seconds ---" % (exemple_one))

first.txt is a text file that has 2346x27 matrix

0.0 0.0 ... 0.0 0.0 0.06664409020350408 0.0 0.0 0.0 0.0 0.0 .... 0 0.0 0.0

Is there any difference between running my program on a cluster (as YARN) and running it on my own machine using (Python command)? And what are these differences.



Solution 1:[1]

There are no differences in terms of the result you will get.
Depending on your workload, you might have resource issues when running locally.

Spark enables you to use a resource manager (such as YARN) in order to scale your application by acquiring executors from the resource manager.

Please check the following links from Spark's official documentation and see if you have more specific questions:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Guy