'Empty PCA matrix

For a pathway pi, I want to extract G matrix data to produce an intermediate matrix B∈Rn×ri⁠, where ri is the number of genes involved in the pathway pi. That is, the matrix B consists of samples in rows and genes for a given pathway in columns.

The intermediate matrix B is a transposed matrix of the common_mrna dataframe with (347 x 8053 dimensions). The kegg_list has the same length (347 items) as the number of columns in common_mrna dataframe (8053 x 347 dimensions), meaning that each pathway in kegg_list correspond to the row number (not index) of the common_mrna dataframe. If the index of the kegg_list matches the row index of the common_mrna dataframe, I want to append each row to the empty matrix B, which is then transposed and converted into a 347x8053 dataframe.

Next, using PCA, I want to decompose matrix B into uncorrelated components, yielding Gpi∈Rn×q⁠, where q=5 is the number of principal components (PCs).

The problem with my code below is that it yielded an empty dataframe G.

Code:

import numpy as np 
from sklearn.decomposition import PCA

p = [] # Initialize pathway list (columns)
G = np.zeros((8053,347)) # Initialize mRNA expression matrix
B = np.zeros((8053,347)) # Initialize intermediate matrix B
q = 5 # Number of PCs

# Populate intermediate matrix B
for i, p in enumerate(kegg_list):
  Bi = 0
  for index, row in common_mrna.iterrows():
    if i==len(index):
      np.append(B, row)
B = B.transpose()

# PCA for yielding matrix G
pca_G = PCA(n_components=q)
pc = pca_G.fit_transform(B)
G = pd.DataFrame(pc)
G.to_csv("./gbm_tcga/PCA_mrna.csv", index=False)
G

common_mrna input dataframe

common_mrna = pd.DataFrame([[0.6747, -1.4892, -2.0670, 0.2337, 0.1255], [0.0051, 0.2122, -0.6536, 1.3746, -1.6958], [-0.4994, -0.2472, -0.1614, 0.9809, 1.3159]], columns=['TCGA-28-5207-01', 'TCGA-02-0089-01','TCGA-87-5896-01', 'TCGA-06-5410-01','TCGA-16-0861-01'], index=["DIABLO", "MRPL33", "RBM39"])

kegg_list input list

    kegg_list = ['Glycolysis_/_Gluconeogenesis',
     'Citrate_cycle_(TCA_cycle)',
     'Pentose_phosphate_pathway',
     'Pentose_and_glucuronate_interconversions',
     'Fructose_and_mannose_metabolism',
     'Galactose_metabolism']

Desired output:

B = array([[ 0.6747,  0.0051, -0.4994], [-1.4892,  0.2122, -0.2472], [-2.067 , -0.6536, -0.1614], [0.2337, 1.3746, 0.9809], [0.1255, -1.6958, 1.3159]])

G dataframe output example

    0   1   2   3   4
0   38.212563   84.281414   -16.431037  10.795291   -9.838612
1   4.576981    -17.445719  -8.810916   13.394762   -19.474494
2   28.976645   -15.513577  -40.145026  24.518149   -4.071515
3   -13.337420  51.460401   32.327822   29.451669   -11.260542
4   -70.198273  10.969363   -11.111083  8.880538    -7.346486
... ... ... ... ... ...
342 16.002266   -32.598450  10.614456   20.556477   -6.707023
343 78.455711   90.474320   33.427067   5.214462    1.915552
344 34.473105   -35.156964  -2.786122   28.337833   17.163662
345 -21.152972  -7.683508   6.547692    13.456135   -23.355560
346 12.167149   8.383400    -61.680875  5.856363    7.181636
347 rows × 5 columns


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source