'how to traverse a massive dataframe pair-wisely and store the value in a n*n matrix?
Problem Description:
I have a dataset which is about 35 millons rows and 10 columns .
I want to calculate the distance between two rows, which the distancefunction like distance(row1,row2), and then store the value in a huge matrix.
The operations totally needed are nearly 6*10^15, which i think is very huge.
What I've tried :
- upload datafile to HDFS
- read data as dataframe
df.collect()and get aarray1 :array[Row]- traverse
array1pair-wisely and calculate distance - store the
distance(rowi,rowj)in matrix(i,j)
Scala code :
val array1 = df.collect()
val l = array1.length
for(i <-0 until array.length){
for(j <-i+1 until array.length){
val vd = Vectors.dense(i,j,distance(array(i),array(j)))
I want to save each value in Vector like above, and add it to RDD/Dataframe.
But the only way i've searched is by using union.I think it's not good enough.
So there are three questions need to be solved:
collectis an action function,df.collect()will throw Exceptionjava.lang.OutOf.MemoryError : Java heap space. Can this be avoided?- As soon as i get a
distance(rowi,rowj), i want to store it, how? - Can I store the final matrix in HDFS and read it as a matrix in python?
ps: If above all can't be solved, which new idea can i use?
Any answer will help me a lot ,thank you!
Solution 1:[1]
You can check https://github.com/ma7555/evalify (disclaimer: I am the owner)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ma7555 |
