'Utilising maximum GPU power

I have one Machine Learning engineering problem. Following is the description of my system.

I have two GPUs with 16 GB memory each
My NLP model (stanza pipeline) occupies roughly 1.5 GB of memory.
I have 12 cores on my device.

In order to utilize all available GPU power, I want to create 20 instances of my model and 20 child processes each of which will get a separate instance of its own. Once all the child processes are started, I want each of them to consume data points from a shared pandas data frame, so that no GPU instance is sitting idle during the entire execution.

Current Approach: Currently, I do create 20 instances and 20 child processes in my python program, but I also partition the data into 20 equal parts. This does result in parallel execution and optimal utilization of my GPU power, but once a single chile process finishes its task, the GPU instance corresponding to that child process is sitting idle doing nothing.

So, once any process is finished with the data partition assigned to it, I want that child process to help other child processes finish their task. Therefore, the ideal solution in my mind is:

Parent process should initialize 20 instances of the model to use all GPU memory.
Parent process should also initialize 20 child processes and assign one independence model instance to each child process.
Each child process should consume data points from a shared pandas data frame.

I'm not sure if it is possible to implement this solution. I will post the code of my current approach here:

def process_df(partition_tuple):
    current = current_process()
    pos = current._identity[0]-1
    
    raw_df, gpu = partition_tuple
    torch.cuda.set_device(gpu)
    raw_sentences = []
    nlp = stanza.Pipeline(lang='en', processors='tokenize', logging_level='ERROR')

    with tqdm(total=len(raw_df), desc=f"Process: {pos}", position=pos) as pbar:
        for _, row in raw_df.iterrows():
            item_no = row.item_no
            doc_id = row.doc_id
            text = row.text
            doc = nlp(text)

            for sent in doc.sentences:
                sent_id = sent.id
                sent_text = sent.text
                data_point = {}
                data_point['doc_id'] = doc_id
                data_point['item_no'] = item_no
                data_point['sentence_id'] = sent_id
                data_point['text'] = sent_text
                raw_sentences.append(data_point)
            pbar.update(1)

    return pd.DataFrame(raw_sentences)


if __name__ == "__main__":
    #
    #  Here is the logic to load data_points from disk. 
    # 
    raw_data = pd.DataFrame(data_points)
    raw_data.head()
    
    freeze_support()
    ncores = 20
    data_partitions = np.array_split(raw_data, ncores)
    gpu_mappings = [0]*(ncores//2) + [1]*(ncores//2)
    partitions = list(zip(data_partitions, gpu_mappings))

    print(f"Number of samples: {len(raw_data)}")
    print(f"Total partitions = {len(data_partitions)}, Size of each partition: {len(data_partitions[0])}")

    with Pool(processes=ncores, initargs=(tqdm.get_lock(),), initializer=tqdm.set_lock) as pool:
        results = pool.map(process_df, partitions)
        pool.close()
        pool.join()
        
        results_list = results.get()
        
        pickle.dump(results_list, open('res_list_complete.p', 'wb'))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Utilising maximum GPU power

Sources

Related Questions