'parent-child relation data scraping with selenium, beautifulsoup
I hope you're all doing good! I'm trying to scrape this list (https://cov-lineages.org/lineage_list.html) of lineages, and the Lineages are parent-child related. What I have to do:
- loop through the list (this one https://cov-lineages.org/lineage_list.html) and click each element scrape its data
- then go to a link (in the same page) that has the mutation table of each lineage and scrap it as well,
- scroll down to the table that has children of that lineage, loop through them, click each one of them and scrap its data, and also each child if it has children we should do the same process and scrap them. I've included here an Explanation by screenshots in a pdf file please take a look at it and see if you could come up with an idea on how can I implement trees or nested dictionaries.
Solution 1:[1]
You do not need Selenium to perform this task, requests will do the job.
This code will get all the rows in the list:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://cov-lineages.org/lineage_list.html')
soup = BeautifulSoup(res.text, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
print(row)
From here you can get all the individual cells with row.find_all('td'). Use the inspector CTRL+SHIFT+I to identify the html element needed.
Solution 2:[2]
The data is all within the json source for the site to render it. Just get the data directly, it's more efficient. This will get all the data you'd scrape with Selenium in a fraction of the time. This will take seconds, as opposed to hours, by having Selenium clicking on each individual 1907 Parent links, followed by (I don't even know how many...but appears you'd have Selenium clicking on 2181 or so links total) sublinks under that.
In terms of converting it into that output, it was a little tricky to work out the logic and figure out which lineage are descendants of which parents, on then to construct that from the leaf node, up. And I'm sure there is better way to code it out, but I think this manages to do it:
import requests
import pandas as pd
import re
# Source data
# This will get each individual lineage data into the desired form
_url = 'https://raw.githubusercontent.com/cov-lineages/lineages-website/master/_data/lineage_data.json'
jsonData = requests.get(_url).json()
jsonData = [v for k,v in jsonData.items()]
sourceData = {}
for _each in jsonData:
_lineage = _each['Lineage']
_description = _each['Description']
_most_common_countries = _each['Countries']
_earliest_date = _each['Earliest date']
_number_designated = _each['Number designated']
_number_assigned = _each['Number assigned']
_children = []
sourceData[_lineage] = {
'id':_lineage,
'description':_description,
'most_common_countries':_most_common_countries,
'earliest_date':_earliest_date,
'number_designated':_number_designated,
'number_assigned':_number_assigned,
'children':[]}
# This parses the yml file to work out which child belongs to which parent
_url = 'https://cov-lineages.org/data/lineages.yml'
_response = requests.get(_url).text
_lineages = re.findall('(name: |parent: )(.*)', _response)
parent_children = {}
# Create dictionary of all parent lineages
for _idx, _lineage in enumerate(_lineages):
if _lineage[0] == 'parent: ' and _lineage[1] != '' and _lineage[1] not in parent_children.keys():
parent_children[_lineage[-1]] = {'children':[]}
if _lineage[1] == '' and _lineages[_idx-1][1] not in parent_children.keys():
parent_children[_lineages[_idx-1][1]] = {'children':[]}
# Match parent with appropriate children
for _idx, _lineage in enumerate(_lineages):
if (_idx+1 == len(_lineages) or (_lineages[_idx][0] == 'name: ' and _lineages[_idx+1][0] == 'name: ')) or (_lineages[_idx+1][-1] == ''):
continue
if _lineages[_idx+1][0] == 'parent: ':
parent_children[_lineages[_idx+1][-1]]['children'].append(_lineages[_idx][-1])
# Creates a list and dictionary so that I can call out the parent
# given a child by it's key/lineage id
parent_child_relations = []
child_parent_relations = {}
for parent, children in parent_children.items():
child_list = children['children']
for child in child_list:
parent_child_relations.append([parent, child])
child_parent_relations.update({child:parent})
# Creates the "family tree" of each child to then iterate through
nested_child_parent = {}
for each in child_parent_relations:
familyOrder = []
current = each
belong_to = child_parent_relations[current]
familyOrder.append(belong_to)
continueLoop = True
while continueLoop == True:
current = belong_to
try:
belong_to = child_parent_relations[current]
familyOrder.append(belong_to)
except:
continueLoop = False
#familyOrder.reverse()
nested_child_parent[each] = familyOrder
# Sorts that list from the "deepest" branches so that I can
# reconstruct from bottom leaf
sorted_nested_child_parent = {}
for each in nested_child_parent.items():
length_of_branches = len(each[-1])
if length_of_branches not in sorted_nested_child_parent.keys():
sorted_nested_child_parent[length_of_branches] = []
sorted_nested_child_parent[length_of_branches].append(each)
lengthKeys = list(sorted_nested_child_parent.keys())
lengthKeys.sort()
lengthKeys.reverse()
# Starts to add the children lineage data into appropriate parent's children list
# in the source data
for x in lengthKeys:
listToAggregate = sorted_nested_child_parent[x]
for each in listToAggregate:
current = each[0]
for parent in each[1]:
lineageData = sourceData[current]
if parent not in sourceData.keys():
sourceData[parent] = {
'id':parent,
'description':'NA',
'most_common_countries':'NA',
'earliest_date':'NA',
'number_designated':'NA',
'number_assigned':'NA',
'children':[]}
# if lineageData not already in children, add it
if not lineageData in sourceData[parent]['children']:
sourceData[parent]['children'].append(lineageData)
current = parent
# Gets the list of the main/top lineages
mainNodes = []
parent_list = list(pd.read_html('https://cov-lineages.org/lineage_list.html')[0]['Lineage'])
for each in parent_list:
try:
parent = child_parent_relations[each]
child = each
except:
print(f'{each} is not a child.')
mainNodes.append(each)
# Gets the main/top lineages from the source data
# and puts into the output list
output = []
for each in mainNodes:
output.append(sourceData[each])
Sample Output:
[
{
"id": "A",
"description": "Root of the pandemic lies within lineage A. Many sequences originating from China and many global exports; including to South East Asia Japan South Korea Australia the USA and Europe represented in this lineage",
"most_common_countries": "United States of America 27.0%, United_Arab_Emirates 12.0%, China 9.0%, Germany 8.0%, Canada 5.0%",
"earliest_date": "2019-12-30",
"number_designated": 1698,
"number_assigned": 2317,
"children": [
{
"id": "B",
"description": "Second major haplotype (and first to be discovered)",
"most_common_countries": "United States of America 37.0%, United Kingdom 20.0%, China 7.0%, Mexico 6.0%, Germany 3.0%",
"earliest_date": "2019-12-24",
"number_designated": 4009,
"number_assigned": 9162,
"children": [
{
"id": "B.1",
"description": "A large European lineage the origin of which roughly corresponds to the Northern Italian outbreak early in 2020.",
"most_common_countries": "United States of America 46.0%, United Kingdom 8.0%, Turkey 8.0%, Canada 4.0%, France 4.0%",
"earliest_date": "2020-01-03",
"number_designated": 46252,
"number_assigned": 95711,
"children": [
{
"id": "B.1.1",
"description": "European lineage with 3 clear SNPs `28881GA`,`28882GA`,`28883GC`",
"most_common_countries": "United Kingdom 27.0%, United States of America 14.0%, Japan 7.0%, Russia 5.0%, Turkey 4.0%",
"earliest_date": "2020-01-08",
"number_designated": 22834,
"number_assigned": 49224,
"children": [
{
"id": "B.1.1.1",
"description": "England",
"most_common_countries": "United Kingdom 53.0%, Peru 10.0%, Belgium 4.0%, United States of America 3.0%, Italy 2.0%",
"earliest_date": "2020-03-02",
"number_designated": 1745,
"number_assigned": 2913,
"children": [
{
"id": "C.36",
"description": "Alias of B.1.1.1.36, Egypt mainly and other countries",
"most_common_countries": "Egypt 33.0%, Germany 11.0%, United Kingdom 10.0%, United States of America 7.0%, Denmark 6.0%",
"earliest_date": "2020-03-13",
"number_designated": 220,
"number_assigned": 1042,
"children": [
{
"id": "C.36.3",
"description": "Alias of B.1.1.1.36.3, Europe and USA lineage, from pango-designation issue #80",
"most_common_countries": "Germany 18.0%, United States of America 18.0%, Switzerland 9.0%, Italy 8.0%, United Kingdom 7.0%",
"earliest_date": "2021-01-04",
"number_designated": 493,
"number_assigned": 1681,
"children": [
{
"id": "C.36.3.1",
"description": "Alias of B.1.1.1.36.3.1, Europe and USA lineage, from pango-designation issue #80",
"most_common_countries": "Germany 64.0%, United States of America 18.0%, Belgium 9.0%, Bulgaria 3.0%, Netherlands 3.0%",
"earliest_date": "2021-03-29",
"number_designated": 54,
"number_assigned": 324,
"children": []
}
]
},
{
"id": "C.36.1",
"description": "Alias of B.1.1.1.36.1, Canada",
"most_common_countries": "Canada 97.0%, United States of America 2.0%, Burkina_Faso 1.0%, Egypt 1.0%",
"earliest_date": "2020-06-24",
"number_designated": 21,
"number_assigned": 199,
"children": []
},
{
"id": "C.36.2",
"description": "Alias of B.1.1.1.36.2, Switzerland",
"most_common_countries": "Switzerland 80.0%, Norway 7.0%, Germany 3.0%, United States of America 3.0%, Sweden 3.0%",
"earliest_date": "2020-10-16",
"number_designated": 18,
"number_assigned": 30,
"children": []
}
]
},
{
"id": "C.1",
"description": "Alias of B.1.1.1.1, South Africa",
"most_common_countries": "South_Africa 91.0%, Zambia 4.0%, United States of America 3.0%, Mozambique 1.0%, Zimbabwe 0.0%",
"earliest_date": "2020-01-03",
"number_designated": 242,
"number_assigned": 351,
"children": [
{
"id": "C.1.1",
"description": "Alias of B.1.1.1.1.1, Mozambique",
"most_common_countries": "Mozambique 100.0%",
"earliest_date": "2020-11-25",
"number_designated": 12,
"number_assigned": 13,
"children": []
},
{
"id": "C.1.2",
"description": "Alias of B.1.1.1.1.2, mostly South Africa, from pango-designation issue #139",
"most_common_countries": "South_Africa 88.0%, Eswatini 4.0%, Russia 2.0%, United Kingdom 1.0%, Botswana 1.0%",
"earliest_date": "2021-04-07",
"number_designated": 15,
"number_assigned": 281,
"children": []
}
]
},
{
"id": "C.2",
"description": "Alias of B.1.1.1.2, South Africa and some European",
"most_common_countries": "South_Africa 44.0%, Zimbabwe 32.0%, Denmark 8.0%, United Kingdom 8.0%, Australia 6.0%",
"earliest_date": "2020-06-09",
"number_designated": 25,
"number_assigned": 50,
"children": [
{
"id": "C.2.1",
"description": "Alias of B.1.1.1.2.1, Aruba and Curacao",
"most_common_countries": "Aruba 60.0%, United States of America 28.0%, Cura\u00e7ao 9.0%, Netherlands 3.0%, Finland 1.0%",
"earliest_date": "2020-12-18",
"number_designated": 58,
"number_assigned": 150,
"children": []
}
]
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mark |
| Solution 2 |
