'BeautifulSoup getting href of a list with ++ 10k records

BeautifulSoup getting href of a list with ++ records

I have the following soup:

<a href="some_url">next</a>
<span class="class">...</span>

From this I want to extract the href, "some_url"

and the whole list of the pages that are listed on this page: https://www.catholic-hierarchy.org/diocese/laa.html

note: there are a whole lot of links to sub-pages: which i need to parse. at the moment: I'm using the standard documentation over at Crummy, but I'm looking for something a little more organized.

what i have is:

from BeautifulSoup import BeautifulSoup

html = '''<a href="https://www.catholic-hierarchy.org/diocese/laa.html">next</a>
<span class="class"><a href="another_url">later</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print ("Found the URL:", a['href'])

The expected output should be:

Found the URL: some_url
Found the URL: another_url

and if you look at the page mentioned above i want all tags with an href, so i thought that i have to omit the name parameter:

href_tags = soup.find_all(href=True)

any ideas how to get the first steps done..!?

Solution 1:^[1]

This example will grab all URLs of dioceses, get some info about each of them and creates final dataframe. To speed-up the process multiprocessing.Pool is used:

import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool


def get_dioceses_urls(section_url):
    dioceses_urls = set()

    while True:
        print(section_url)

        soup = BeautifulSoup(
            requests.get(section_url, headers=headers).content, "lxml"
        )
        for a in soup.select('ul a[href^="d"]'):
            dioceses_urls.add(
                "https://www.catholic-hierarchy.org/diocese/" + a["href"]
            )

        # is there Next Page button?
        next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
        if next_page:
            section_url = (
                "https://www.catholic-hierarchy.org/diocese/"
                + next_page["href"]
            )
        else:
            break

    return dioceses_urls


def get_diocese_info(url):
    print(url)

    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html5lib")

    data = {
        "Title 1": soup.h1.get_text(strip=True),
        "Title 2": soup.h2.get_text(strip=True),
        "Title 3": soup.h3.get_text(strip=True) if soup.h3 else "-",
        "URL": url,
    }

    li = soup.find(
        lambda tag: tag.name == "li"
        and "type of jurisdiction:" in tag.text.lower()
        and tag.find() is None
    )
    if li:
        for l in li.find_previous("ul").find_all("li"):
            t = l.get_text(strip=True, separator=" ")
            if ":" in t:
                k, v = t.split(":", maxsplit=1)
                data[k.strip()] = v.strip()

    # get other info about the diocese
    # ...

    return data


if __name__ == "__main__":
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
    }

    # get main sections:
    url = "https://www.catholic-hierarchy.org/diocese/laa.html"
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )

    main_sections = [url]
    for a in soup.select("a[target='_parent']"):
        main_sections.append(
            "https://www.catholic-hierarchy.org/diocese/" + a["href"]
        )

    all_data, dioceses_urls = [], set()
    with Pool() as pool:
        # get all dioceses urls:
        for urls in pool.imap_unordered(get_dioceses_urls, main_sections):
            dioceses_urls.update(urls)

        # get info about all dioceses:
        for info in pool.imap_unordered(get_diocese_info, dioceses_urls):
            all_data.append(info)

    # create dataframe from the info about dioceses
    df = pd.DataFrame(all_data).sort_values("Title 1")

    # save it to csv file
    df.to_csv("data.csv", index=False)
    print(df.head().to_markdown())

Prints:

	Title 1	Title 2	Title 3	URL	Type of Jurisdiction	Established	Description	Elevated	Metropolitan	Rite	Country	Mailing Address	Italian Title	Erected	Square Kilometers	Telephone	Official Web Site	Fax	Province	Conference Region	Catholic Directory Abbreviation	Name Changed	State	Region	Web Site	United	Split	Restored	Cardinal’s Blog	The Pilot	Territory Added	Offcial Web Site	Cathedral	See Transferred	Diocesan Newspaper	Catholic Communications Network	Square Miles	Vatican Web Site	Official Web Site (old)	Santuario della Santa Casa di Loreto	Blog	Official Blog	Cathedral Web Site	Archdiocesan Newspaper	Catholic News Service	Basilica de Esquipulas	Parish
1934	Abaradira	(Titular See)	Abaradirensis	https://www.catholic-hierarchy.org/diocese/d2a01.html	Titular See	1933	pr. Bizacena	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
632	Abari	(Titular See)	Abaritanus	https://www.catholic-hierarchy.org/diocese/d2a02.html	Titular See	1933	pr. Bizacena	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
5441	Abbir Germaniciana	(Titular See)	Abbiritanus Germanicianorum	https://www.catholic-hierarchy.org/diocese/d2a03.html	Titular See	1933	pr. Proconsolare; m. Cartagine	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
4388	Abbir Maius	(Titular See)	Abbiritanus	https://www.catholic-hierarchy.org/diocese/d2a04.html	Titular See	nan	pr. Proconsolare; m. Cartagine	nan	nan	nan	nan	nan	Abbir Maggiore	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
6065	Abdera	(Titular See)	Abderitanus	https://www.catholic-hierarchy.org/diocese/d4a57.html	Titular See	nan	pr. Rhodope in Tracia; m. Traianopoli	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan

and saves data.csv (screenshot from LibreOffice):

Solution 2:^[2]

this will be my first answer on Stackoverflow. I hope I can help :)

I want to show you a simple method to get the urls on the page you specified!

In order to fulfill your need, we will use the module called scrapeasy. It would be appropriate to say that it is a simple webscraping library for Python. Likewise, although it is very simple to use, it is very functional.

Let's go to our example:

from scrapeasy import Website, Page

web = Website("https://targetsite.com")
links = web.getSubpagesLinks()

for i in range(0, len(links)):
    print(links[i])

There is a very simple usage method like this for Scrapeasy. I hope you review the documentation and discover other usable methods.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2

'BeautifulSoup getting href of a list with ++ 10k records

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]