'How to perform McNemar (or Chi-square) test on multi-dimensional data?

Question:

Assume for this part that the total count for every sample is 5000 (i.e., sum of column = 5000).

Imagine there was a row (gene G) in this dataset for which the count is expected to be 1 in 10% of samples and 0 in the remaining 90% of samples. We are doing an experiment where we would like to know if the expression of gene G changes in experimental vs control conditions, and we will measure n samples (single cells) from each condition.

Plot the statistical power to detect a 10% increase in the expression of G in experimental vs control at Bonferroni-corrected p < 0.05 as a function of n, assuming that we will be performing a similar test for significance on 1000 genes total. How many samples from each condition do we need to measure to achieve a power of 95%?


First, for gene G, I created a pandas dataframe for control and experimental conditions, where the ratio of 0:1 is 10% and 20%, respectively. I extrapolated the conditions for 1000 genes and then perform cross-tabulation analysis.

import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower


n = 5000 # no of records
nog = 1000   # no of genes
gene_list = ["gene_" + str(i) for i in range(0,nog)]

def generate_gene_df(gene, n):

    df = pd.DataFrame.from_dict(
         {"Gene" : gene,
          "Cells": (f'Cell{x}' for x in range(1, n+1)),
          "Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
          "Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
         orient='columns'
    )

    df = df.set_index(["Gene","Cells"])
    
    return df

# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]

df = pd.concat(gene_df_list)
df = df.reset_index()

table = pd.crosstab([df["Gene"], df["Cells"]], 
                    [df["Control"], df["Experimental"]]).to_numpy()

Table:

array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 1, 0],
       ...,
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0]])

Now, I want to plot the statistical power at Bonferroni-corrected p < 0.05 as a function of n. I also want to find out how many samples from each condition do we need to measure to achieve a power of 95%.

My attempt:

McNemar's test

result = mcnemar(table, exact=True)
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))

alpha=0.05
if result.pvalue > alpha:
    print('Same proportions of errors (fail to reject H0)')
else:
    print('Different proportions of errors (reject H0)')

Output:

statistic=0.000, p-value=1.000 Same proportions of errors (fail to reject H0)

I calculated the power analysis using:

nobs = 5000
alpha = 0.05
effect_size = chisquare_effectsize(0.5, 0.5*1.1, correction=None, cohen=True, axis=0)

    analysis = GofChisquarePower()
    power_chisquare = analysis.solve_power(effect_size=effect_size, nobs=nobs, alpha=alpha)
    print('Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: %.3f' % power_chisquare)

Output

Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: 0.050

Why does the power curve look atypical? Did I perform the analyses correctly? Is McNemar an appropriate statistical test method in this case?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source