'How to accelerate the speed of creating the huge matrix in python?

I am trying to create a huge matrix(60M X 3.1M),and I have written two script, one is try to use the gpu to speed up, the other is use the cpu to compute. I put the script on the SLURM and it has been running for almost a week, and it seems runs forever.I am a very beginner of python and data analyst, so I wonder if there is any chance to accelerate the process? Here is my code:

cpu version:

f = open('/etc/d_pair.txt', 'r') # the file contain 30M pair of diagnosis (two diseases in one pair)
f2 = open('/etc/ehr.txt', 'r')  # the ehr file, contain 60M patients and theirs diagosis
nf = open('/result/C_matrix.txt', 'w', newline='')
fline = f.readlines()
f2line =f2.readlines()  # a is list of all disease pair
All = []
for Id in f2line:  # each one of 60M patients
    b = Id.split(' ')  # turn each id and diagnosis into list
    nf.write(str(b[0]))
    nf.write(' ')
    for name in fline:  # a is list of all disease pair
        name = name.split(',')
        if name[0] in b:  # if the patient have both 2 d, give the value '1' to him/her
            if name[1] in b:
                nf.write('1')
            else:  # if the patient have 1 or 0 disease in those pair ,give the value 0 to him/her
                nf.write('0')
        else:
            nf.write('0')
        nf.write(' ')
    nf.write('\n')

f, f2, nf.close()

the gpu version:

import torch
import pandas as pd
torch.cuda.set_device(0)
f = open('/etc/d_pair.txt', 'r')
f2 = open('/etc/ehr.txt', 'r')
fline = f.readlines()
f2line =f2.readlines()  # a is list of all d pair
All = []
for Id in f2line:  # each one of 600M
    TIMES = []
    b = Id.split(' ')  # turn each id and diagnosis into list
    TIMES.append(int(b[0]))
    for name in fline:  # a is list of all d pair
        name = name.split(',')
        if name[0] in b:
            if name[1] in b:
                TIMES.append(1)
            else:
                TIMES.append(0)
        else:
            TIMES.append(0)
    All.append(TIMES)
c = torch.as_tensor(All).cuda()  # turn the 2-dimension list into tensor.
pc = pd.DataFrame(c).astype("int")
pc.to_csv('/result/matrix.csv', index=False, header=False)

I have read some similar questions, but some of their matrixs are not as huge as mine, some of them are aim to compute the matrix(dot and etc), so I am not sure if pytable or scipy .sparse.csc.matrix is suitable for my circumstance.

Thanks in advance!

python

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to accelerate the speed of creating the huge matrix in python?

Sources

Related Questions