'How do I get all href in an ul in a page with a scrollbar

I would like to get all hrefs which are within these li's in this ul: Click here to see screenshot

So far I wrote this line:

  import bs4, requests, re

  product_pages = []

  def get_product_pages(openurl): 
  global product_pages
  url = 'https://www.ah.nl/producten/aardappel-groente-fruit'
  res = requests.get(url) 
  soup = bs4.BeautifulSoup(res.text, 'html.parser')
  for li in soup.findAll('li', attrs={'class': 'taxonomy-sub-selector_root__3rtWx'}):
    for a in li.findAll('a', href=True):
        print(a.attrs['href'])

get_product_pages('')

But it is only giving me the hrefs from the first three li's. I am wondering why it is only the first three and I am wondering how to get all eight..

In the page there is a scroll bar, which might cause trouble?



Solution 1:[1]

The taxonomies and all other page data is stored inside page in <script> so beautifulsoup doesn't see it. To get all children taxonomies from current category you can use next example (parsing the <script> tag with re/json):

import re
import json
import requests


base_url = "https://www.ah.nl/producten"
url = base_url + "/aardappel-groente-fruit/fruit"

html_doc = requests.get(url).text

data = re.search(r"window\.__INITIAL_STATE__= ({.*})", html_doc)
data = data.group(1).replace("undefined", "null")
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

taxonomies = {t["id"]: t for t in data["taxonomy"]["topLevel"]}
for t in data["taxonomy"]["taxonomies"]:
    taxonomies[t["id"]] = t


def get_taxonomy(t, current, dupl=None):
    if dupl is None:
        dupl = set()
    tmp = current + "/" + t["slugifiedName"]
    yield tmp
    for c in t["children"]:
        if c in taxonomies and c not in dupl:
            dupl.add(c)
            yield from get_taxonomy(taxonomies[c], tmp, dupl)


for t in taxonomies.values():
    if t["parents"] == [0]:
        for t in get_taxonomy(t, base_url):
            if url in t:  # print only URL from current category
                print(t)

Prints:

https://www.ah.nl/producten/aardappel-groente-fruit/fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels/groente-en-fruitbox
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bananen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/sinaasappels-mandarijnen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/peren
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/ananas-mango-kiwi
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/aardbeien-frambozen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/druiven-kersen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bramen-bessen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen/exotisch-fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/perziken-nectarines
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/meloen-kokosnoot
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/grapefruit-minneola
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/citroen-limoen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruit-spread
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/vijgen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/kaki-papaya-cherimoya
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/granaatappel-passiefruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruitsalade-mix
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/gedroogd-fruit

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrej Kesely