'BeautifulSoup getting href of a list with ++ 10k records

BeautifulSoup getting href of a list with ++ records

I have the following soup:

<a href="some_url">next</a>
<span class="class">...</span>

From this I want to extract the href, "some_url"

and the whole list of the pages that are listed on this page: https://www.catholic-hierarchy.org/diocese/laa.html

note: there are a whole lot of links to sub-pages: which i need to parse. at the moment: I'm using the standard documentation over at Crummy, but I'm looking for something a little more organized.

what i have is:

from BeautifulSoup import BeautifulSoup

html = '''<a href="https://www.catholic-hierarchy.org/diocese/laa.html">next</a>
<span class="class"><a href="another_url">later</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print ("Found the URL:", a['href']) 

The expected output should be:

Found the URL: some_url
Found the URL: another_url

and if you look at the page mentioned above i want all tags with an href, so i thought that i have to omit the name parameter:

href_tags = soup.find_all(href=True)

any ideas how to get the first steps done..!?



Solution 1:[1]

This example will grab all URLs of dioceses, get some info about each of them and creates final dataframe. To speed-up the process multiprocessing.Pool is used:

import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool


def get_dioceses_urls(section_url):
    dioceses_urls = set()

    while True:
        print(section_url)

        soup = BeautifulSoup(
            requests.get(section_url, headers=headers).content, "lxml"
        )
        for a in soup.select('ul a[href^="d"]'):
            dioceses_urls.add(
                "https://www.catholic-hierarchy.org/diocese/" + a["href"]
            )

        # is there Next Page button?
        next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
        if next_page:
            section_url = (
                "https://www.catholic-hierarchy.org/diocese/"
                + next_page["href"]
            )
        else:
            break

    return dioceses_urls


def get_diocese_info(url):
    print(url)

    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html5lib")

    data = {
        "Title 1": soup.h1.get_text(strip=True),
        "Title 2": soup.h2.get_text(strip=True),
        "Title 3": soup.h3.get_text(strip=True) if soup.h3 else "-",
        "URL": url,
    }

    li = soup.find(
        lambda tag: tag.name == "li"
        and "type of jurisdiction:" in tag.text.lower()
        and tag.find() is None
    )
    if li:
        for l in li.find_previous("ul").find_all("li"):
            t = l.get_text(strip=True, separator=" ")
            if ":" in t:
                k, v = t.split(":", maxsplit=1)
                data[k.strip()] = v.strip()

    # get other info about the diocese
    # ...

    return data


if __name__ == "__main__":
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
    }

    # get main sections:
    url = "https://www.catholic-hierarchy.org/diocese/laa.html"
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )

    main_sections = [url]
    for a in soup.select("a[target='_parent']"):
        main_sections.append(
            "https://www.catholic-hierarchy.org/diocese/" + a["href"]
        )

    all_data, dioceses_urls = [], set()
    with Pool() as pool:
        # get all dioceses urls:
        for urls in pool.imap_unordered(get_dioceses_urls, main_sections):
            dioceses_urls.update(urls)

        # get info about all dioceses:
        for info in pool.imap_unordered(get_diocese_info, dioceses_urls):
            all_data.append(info)

    # create dataframe from the info about dioceses
    df = pd.DataFrame(all_data).sort_values("Title 1")

    # save it to csv file
    df.to_csv("data.csv", index=False)
    print(df.head().to_markdown())

Prints:

Title 1 Title 2 Title 3 URL Type of Jurisdiction Established Description Elevated Metropolitan Rite Country Mailing Address Italian Title Erected Square Kilometers Telephone Official Web Site Fax Province Conference Region Catholic Directory Abbreviation Name Changed State Region Web Site United Split Restored Cardinal’s Blog The Pilot Territory Added Offcial Web Site Cathedral See Transferred Diocesan Newspaper Catholic Communications Network Square Miles Vatican Web Site Official Web Site (old) Santuario della Santa Casa di Loreto Blog Official Blog Cathedral Web Site Archdiocesan Newspaper Catholic News Service Basilica de Esquipulas Parish
1934 Abaradira (Titular See) Abaradirensis https://www.catholic-hierarchy.org/diocese/d2a01.html Titular See 1933 pr. Bizacena nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
632 Abari (Titular See) Abaritanus https://www.catholic-hierarchy.org/diocese/d2a02.html Titular See 1933 pr. Bizacena nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
5441 Abbir Germaniciana (Titular See) Abbiritanus Germanicianorum https://www.catholic-hierarchy.org/diocese/d2a03.html Titular See 1933 pr. Proconsolare; m. Cartagine nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
4388 Abbir Maius (Titular See) Abbiritanus https://www.catholic-hierarchy.org/diocese/d2a04.html Titular See nan pr. Proconsolare; m. Cartagine nan nan nan nan nan Abbir Maggiore nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
6065 Abdera (Titular See) Abderitanus https://www.catholic-hierarchy.org/diocese/d4a57.html Titular See nan pr. Rhodope in Tracia; m. Traianopoli nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan

and saves data.csv (screenshot from LibreOffice):

enter image description here

Solution 2:[2]

this will be my first answer on Stackoverflow. I hope I can help :)

I want to show you a simple method to get the urls on the page you specified!

In order to fulfill your need, we will use the module called scrapeasy. It would be appropriate to say that it is a simple webscraping library for Python. Likewise, although it is very simple to use, it is very functional.

Let's go to our example:

from scrapeasy import Website, Page

web = Website("https://targetsite.com")
links = web.getSubpagesLinks()

for i in range(0, len(links)):
    print(links[i])

There is a very simple usage method like this for Scrapeasy. I hope you review the documentation and discover other usable methods.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2