'Empty PCA matrix
For a pathway pi, I want to extract G matrix data to produce an intermediate matrix B∈Rn×ri, where ri is the number of genes involved in the pathway pi. That is, the matrix B consists of samples in rows and genes for a given pathway in columns.
The intermediate matrix B is a transposed matrix of the common_mrna dataframe with (347 x 8053 dimensions). The kegg_list has the same length (347 items) as the number of columns in common_mrna dataframe (8053 x 347 dimensions), meaning that each pathway in kegg_list correspond to the row number (not index) of the common_mrna dataframe. If the index of the kegg_list matches the row index of the common_mrna dataframe, I want to append each row to the empty matrix B, which is then transposed and converted into a 347x8053 dataframe.
Next, using PCA, I want to decompose matrix B into uncorrelated components, yielding Gpi∈Rn×q, where q=5 is the number of principal components (PCs).
The problem with my code below is that it yielded an empty dataframe G.
Code:
import numpy as np
from sklearn.decomposition import PCA
p = [] # Initialize pathway list (columns)
G = np.zeros((8053,347)) # Initialize mRNA expression matrix
B = np.zeros((8053,347)) # Initialize intermediate matrix B
q = 5 # Number of PCs
# Populate intermediate matrix B
for i, p in enumerate(kegg_list):
Bi = 0
for index, row in common_mrna.iterrows():
if i==len(index):
np.append(B, row)
B = B.transpose()
# PCA for yielding matrix G
pca_G = PCA(n_components=q)
pc = pca_G.fit_transform(B)
G = pd.DataFrame(pc)
G.to_csv("./gbm_tcga/PCA_mrna.csv", index=False)
G
common_mrna input dataframe
common_mrna = pd.DataFrame([[0.6747, -1.4892, -2.0670, 0.2337, 0.1255], [0.0051, 0.2122, -0.6536, 1.3746, -1.6958], [-0.4994, -0.2472, -0.1614, 0.9809, 1.3159]], columns=['TCGA-28-5207-01', 'TCGA-02-0089-01','TCGA-87-5896-01', 'TCGA-06-5410-01','TCGA-16-0861-01'], index=["DIABLO", "MRPL33", "RBM39"])
kegg_list input list
kegg_list = ['Glycolysis_/_Gluconeogenesis',
'Citrate_cycle_(TCA_cycle)',
'Pentose_phosphate_pathway',
'Pentose_and_glucuronate_interconversions',
'Fructose_and_mannose_metabolism',
'Galactose_metabolism']
Desired output:
B = array([[ 0.6747, 0.0051, -0.4994], [-1.4892, 0.2122, -0.2472], [-2.067 , -0.6536, -0.1614], [0.2337, 1.3746, 0.9809], [0.1255, -1.6958, 1.3159]])
G dataframe output example
0 1 2 3 4
0 38.212563 84.281414 -16.431037 10.795291 -9.838612
1 4.576981 -17.445719 -8.810916 13.394762 -19.474494
2 28.976645 -15.513577 -40.145026 24.518149 -4.071515
3 -13.337420 51.460401 32.327822 29.451669 -11.260542
4 -70.198273 10.969363 -11.111083 8.880538 -7.346486
... ... ... ... ... ...
342 16.002266 -32.598450 10.614456 20.556477 -6.707023
343 78.455711 90.474320 33.427067 5.214462 1.915552
344 34.473105 -35.156964 -2.786122 28.337833 17.163662
345 -21.152972 -7.683508 6.547692 13.456135 -23.355560
346 12.167149 8.383400 -61.680875 5.856363 7.181636
347 rows × 5 columns
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
