'BeautifulSoup getting href of a list with ++ 10k records
BeautifulSoup getting href of a list with ++ records
I have the following soup:
<a href="some_url">next</a>
<span class="class">...</span>
From this I want to extract the href, "some_url"
and the whole list of the pages that are listed on this page: https://www.catholic-hierarchy.org/diocese/laa.html
note: there are a whole lot of links to sub-pages: which i need to parse. at the moment: I'm using the standard documentation over at Crummy, but I'm looking for something a little more organized.
what i have is:
from BeautifulSoup import BeautifulSoup
html = '''<a href="https://www.catholic-hierarchy.org/diocese/laa.html">next</a>
<span class="class"><a href="another_url">later</a></span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print ("Found the URL:", a['href'])
The expected output should be:
Found the URL: some_url
Found the URL: another_url
and if you look at the page mentioned above i want all tags with an href, so i thought that i have to omit the name parameter:
href_tags = soup.find_all(href=True)
any ideas how to get the first steps done..!?
Solution 1:[1]
This example will grab all URLs of dioceses, get some info about each of them and creates final dataframe. To speed-up the process multiprocessing.Pool is used:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
def get_dioceses_urls(section_url):
dioceses_urls = set()
while True:
print(section_url)
soup = BeautifulSoup(
requests.get(section_url, headers=headers).content, "lxml"
)
for a in soup.select('ul a[href^="d"]'):
dioceses_urls.add(
"https://www.catholic-hierarchy.org/diocese/" + a["href"]
)
# is there Next Page button?
next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
if next_page:
section_url = (
"https://www.catholic-hierarchy.org/diocese/"
+ next_page["href"]
)
else:
break
return dioceses_urls
def get_diocese_info(url):
print(url)
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html5lib")
data = {
"Title 1": soup.h1.get_text(strip=True),
"Title 2": soup.h2.get_text(strip=True),
"Title 3": soup.h3.get_text(strip=True) if soup.h3 else "-",
"URL": url,
}
li = soup.find(
lambda tag: tag.name == "li"
and "type of jurisdiction:" in tag.text.lower()
and tag.find() is None
)
if li:
for l in li.find_previous("ul").find_all("li"):
t = l.get_text(strip=True, separator=" ")
if ":" in t:
k, v = t.split(":", maxsplit=1)
data[k.strip()] = v.strip()
# get other info about the diocese
# ...
return data
if __name__ == "__main__":
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
}
# get main sections:
url = "https://www.catholic-hierarchy.org/diocese/laa.html"
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
main_sections = [url]
for a in soup.select("a[target='_parent']"):
main_sections.append(
"https://www.catholic-hierarchy.org/diocese/" + a["href"]
)
all_data, dioceses_urls = [], set()
with Pool() as pool:
# get all dioceses urls:
for urls in pool.imap_unordered(get_dioceses_urls, main_sections):
dioceses_urls.update(urls)
# get info about all dioceses:
for info in pool.imap_unordered(get_diocese_info, dioceses_urls):
all_data.append(info)
# create dataframe from the info about dioceses
df = pd.DataFrame(all_data).sort_values("Title 1")
# save it to csv file
df.to_csv("data.csv", index=False)
print(df.head().to_markdown())
Prints:
| Title 1 | Title 2 | Title 3 | URL | Type of Jurisdiction | Established | Description | Elevated | Metropolitan | Rite | Country | Mailing Address | Italian Title | Erected | Square Kilometers | Telephone | Official Web Site | Fax | Province | Conference Region | Catholic Directory Abbreviation | Name Changed | State | Region | Web Site | United | Split | Restored | Cardinal’s Blog | The Pilot | Territory Added | Offcial Web Site | Cathedral | See Transferred | Diocesan Newspaper | Catholic Communications Network | Square Miles | Vatican Web Site | Official Web Site (old) | Santuario della Santa Casa di Loreto | Blog | Official Blog | Cathedral Web Site | Archdiocesan Newspaper | Catholic News Service | Basilica de Esquipulas | Parish | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1934 | Abaradira | (Titular See) | Abaradirensis | https://www.catholic-hierarchy.org/diocese/d2a01.html | Titular See | 1933 | pr. Bizacena | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 632 | Abari | (Titular See) | Abaritanus | https://www.catholic-hierarchy.org/diocese/d2a02.html | Titular See | 1933 | pr. Bizacena | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5441 | Abbir Germaniciana | (Titular See) | Abbiritanus Germanicianorum | https://www.catholic-hierarchy.org/diocese/d2a03.html | Titular See | 1933 | pr. Proconsolare; m. Cartagine | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4388 | Abbir Maius | (Titular See) | Abbiritanus | https://www.catholic-hierarchy.org/diocese/d2a04.html | Titular See | nan | pr. Proconsolare; m. Cartagine | nan | nan | nan | nan | nan | Abbir Maggiore | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 6065 | Abdera | (Titular See) | Abderitanus | https://www.catholic-hierarchy.org/diocese/d4a57.html | Titular See | nan | pr. Rhodope in Tracia; m. Traianopoli | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
and saves data.csv (screenshot from LibreOffice):
Solution 2:[2]
this will be my first answer on Stackoverflow. I hope I can help :)
I want to show you a simple method to get the urls on the page you specified!
In order to fulfill your need, we will use the module called scrapeasy. It would be appropriate to say that it is a simple webscraping library for Python. Likewise, although it is very simple to use, it is very functional.
Let's go to our example:
from scrapeasy import Website, Page
web = Website("https://targetsite.com")
links = web.getSubpagesLinks()
for i in range(0, len(links)):
print(links[i])
There is a very simple usage method like this for Scrapeasy. I hope you review the documentation and discover other usable methods.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |

