'Using Entrez.efetch to determine SNP reference allele

I have a list of 100,000 rs reference numbers and want to use python to find the reference allele. From research I think that the Entrez.efetch function would be most appropriate for this. However, the output for:

handle = Entrez.efetch(db="SNP", id="rs114525117", retmode="text")

print(handle.read())

is:

<DocumentSummary uid="114525117"><SNP_ID>114525117</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>1000Genomes</STUDY><FREQ>A=0.027756/139</FREQ></MAF><MAF><STUDY>ALSPAC</STUDY><FREQ>A=0.036585/141</FREQ></MAF><MAF><STUDY>GnomAD</STUDY><FREQ>A=0.027927/3856</FREQ></MAF><MAF><STUDY>GoNL</STUDY><FREQ>A=0.031062/31</FREQ></MAF><MAF><STUDY>Korea1K</STUDY><FREQ>A=0.000546/1</FREQ></MAF><MAF><STUDY>NorthernSweden</STUDY><FREQ>A=0.03/18</FREQ></MAF><MAF><STUDY>Qatari</STUDY><FREQ>A=0.046296/10</FREQ></MAF><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/18</FREQ></MAF><MAF><STUDY>Siberian</STUDY><FREQ>G=0.5/2</FREQ></MAF><MAF><STUDY>TWINSUK</STUDY><FREQ>A=0.036947/137</FREQ></MAF><MAF><STUDY>ALFA</STUDY><FREQ>A=0.045749/1301</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES/><ACC>NC_000001.11</ACC><CHR>1</CHR><HANDLE>EVA_UK10K_ALSPAC,USC_VALOUEV,ILLUMINA,JJLAB,SSMP,EVA-GONL,SGDP_PRJ,GNOMAD,KHV_HUMAN_GENOMES,EVA_DECODE,ACPOP,CSHL,TOPMED,KOGIC,SWEGEN,1000GENOMES,EVA,HUMAN_LONGEVITY,WEILL_CORNELL_DGM,EVA_UK10K_TWINSUK</HANDLE><SPDI>NC_000001.11:823655:G:A</SPDI><FXN_CLASS/><VALIDATED>by-frequency,by-alfa,by-cluster</VALIDATED><DOCSUM>HGVS=NC_000001.11:g.823656G&gt;A,NC_000001.10:g.759036G&gt;A|SEQ=[G/A]|LEN=1</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>132</ORIG_BUILD><UPD_BUILD>155</UPD_BUILD><CREATEDATE>2010/07/14 10:57</CREATEDATE><UPDATEDATE>2021/04/26 02:13</UPDATEDATE><SS>230395467,479934148,482400003,533405090,647516140,779484328,781105187,834954380,974769275,1289338992,1599378337,1642372370,1917960251,2019498395,2137544192,2147484379,2159368230,2632465446,2750637571,2986148872,3066397479,3343272513,3626006481,3630505572,3685992019,3726716316,3745720889,3798743491,3847995174,3943629349</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>1:823656</CHRPOS><CHRPOS_PREV_ASSM>1:759036</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>0114525117</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0000823656</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>

this output has a list of all the studies but on the website: https://www.ncbi.nlm.nih.gov/snp/rs114525117

the summary at the top clearly has the reference allele (G) with the alternative allele next to it.

How can I use python to pull this information for all RS number?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source