'How to perform McNemar (or Chi-square) test on multi-dimensional data?
Question:
Assume for this part that the total count for every sample is 5000 (i.e., sum of column = 5000).
Imagine there was a row (gene G) in this dataset for which the count is expected to be 1 in 10% of samples and 0 in the remaining 90% of samples. We are doing an experiment where we would like to know if the expression of gene G changes in experimental vs control conditions, and we will measure n samples (single cells) from each condition.
Plot the statistical power to detect a 10% increase in the expression of G in experimental vs control at Bonferroni-corrected p < 0.05 as a function of n, assuming that we will be performing a similar test for significance on 1000 genes total. How many samples from each condition do we need to measure to achieve a power of 95%?
First, for gene G, I created a pandas dataframe for control and experimental conditions, where the ratio of 0:1 is 10% and 20%, respectively. I extrapolated the conditions for 1000 genes and then perform cross-tabulation analysis.
import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower
n = 5000 # no of records
nog = 1000 # no of genes
gene_list = ["gene_" + str(i) for i in range(0,nog)]
def generate_gene_df(gene, n):
df = pd.DataFrame.from_dict(
{"Gene" : gene,
"Cells": (f'Cell{x}' for x in range(1, n+1)),
"Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
"Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
orient='columns'
)
df = df.set_index(["Gene","Cells"])
return df
# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]
df = pd.concat(gene_df_list)
df = df.reset_index()
table = pd.crosstab([df["Gene"], df["Cells"]],
[df["Control"], df["Experimental"]]).to_numpy()
Table:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0],
...,
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0]])
Now, I want to plot the statistical power at Bonferroni-corrected p < 0.05 as a function of n. I also want to find out how many samples from each condition do we need to measure to achieve a power of 95%.
My attempt:
McNemar's test
result = mcnemar(table, exact=True)
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))
alpha=0.05
if result.pvalue > alpha:
print('Same proportions of errors (fail to reject H0)')
else:
print('Different proportions of errors (reject H0)')
Output:
statistic=0.000, p-value=1.000 Same proportions of errors (fail to reject H0)
I calculated the power analysis using:
nobs = 5000
alpha = 0.05
effect_size = chisquare_effectsize(0.5, 0.5*1.1, correction=None, cohen=True, axis=0)
analysis = GofChisquarePower()
power_chisquare = analysis.solve_power(effect_size=effect_size, nobs=nobs, alpha=alpha)
print('Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: %.3f' % power_chisquare)
Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: 0.050
Why does the power curve look atypical? Did I perform the analyses correctly? Is McNemar an appropriate statistical test method in this case?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

