'CSV with Embebbed JSON using Python

I'm working with GWAS data. Need help.

My data looks like this:

IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.

As above I have a total of 512 rows and 2 Million columns.

Desired output: CSV with embedded JSON

SNP,Genotyping
{rs098083,"{""CC"" : [ 1, 63, 6, 18, 33, ...],""CT"" : [ 2, 54, 6, 7, 8, ...],""TT"" : [ 4, 9, 12, 13, ...],""AA"" : [86, 124, 4, 19, ...],""AT"" : [86, 98, 4, 74, ....],...}}"     
{kgp794789,"{""CC"" : [ 11, 3, 68, 10, 3, ...],""CT"" : [ 20, 58, 06, 47, 98, ...],""TT"" : [ 4, 99, 82, 190, ...],""AA"" : [89, 13, 54, 19, ...],""AT"" : [8, 88, 44, 74, ....],...}}"
{rs09848309,"{""CC"" : [ 18, 78, 9, 98, 23, ...],""CT"" : [ 20, 55, 6, 78, 84, ...],""TT"" : [ 94, 19, 54, 39, ...],""AA"" : [76, 134, 46, 19, ...],""AT"" : [58, 88, 39, 434, ....],...}}"
.
.
.
.

The SNP column of the row contains the ID of the SNP. The genotyping column will contain a JSON BLOB. This BLOB will be a set of key-value pairs. The key is a particular genotype (e.g., CC, CT, TT, ....) and the value is a list of the IIDs with a genotype matching the key.

Bash Command I used to perform above task:

%%bash

jq -Rrn '
  [ inputs / "," ] | transpose | (.[0][1:] | map(tonumber)) as $h | .[1:][]
  | .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
  | join(", ")
' SampleSNPsData.csv >> SampleSNPsModified.csv

The output of the above bash command:

{rs098083, {"CC":[6,74,421,350,302,413,155,48,106,368,173,169,325,...], "AA":[351,434,17,96,39,170,115,343,180,285,299,...], "AT":[403,213,312,8,184,2,21,5,103,42,4,122,267,86,423,442,191,12,232,334,214,166,289,367,45]}
{kgp794789, {"AC":[6,74,421,350,302,413,155,48,106,368,173,169,325,351,434,17,96,39,115,180,285,2,21,5,103,42,4,122,267,86,423,12,334,214,166,45],"CC":[170,343,299,403,213,312,8,184,442,191,232,289,367]}
.
.
.

please help with python, new to python

Python logic something like this:

read header line
select a fraction of SNPs
for each data line
   for snps in fraction
      build a dict of genotype to sample ID

Dict[str(snpID) : Dict[str(genotype) : List[int(sample)]]

snp_content = defaultdict(defaultdict(list))

...
fields = line.split(',')
sample_id = fields[0]
# do some math to figure out column index in tranche
snp = header_fields[i]
genotype = fields[i]
snp_content[snp][genotype].append(sample_id)

for snp in snp_content.keys():
    f.writeline(f'{snp},"{json.serialize(snp_content[snp])}"')

python json gwas

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'CSV with Embebbed JSON using Python

Sources

Related Questions