'Beautiful Soup: Extract text at the a anchor after url
I have some html where the URL in the a href comes before the title that would appear on the page. I am trying to get at that title and url and extract that into a data frame. The following code is what I have so far.
import requests
from bs4 import BeautifulSoup
url = 'https://patentsview.org/download/data-download-tables'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all("div", class_="file-title")
print(results)
pd.DataFrame([a.text for a in soup.select('.file-title a')], columns=['Title'])
As it stands, I only have the one column I would like the results to be in the following format:
| Title | URL |
|---|---|
| application | URL1 |
| assignee | URL2 |
| ... | ... |
I was following this page on Real Python but I have have come to a standstill since I cannot seem to translate their next part into my needs.
Any help with this would be wonderful. Thank you in advance for your help.
EDIT 1: I have made some edits to the original question. I want to expand it to also include the URL that the title is attached to in a second column. I have also incorporated the code that was provided on the first answer.
Solution 1:[1]
Just call .text on the <a> in each of the <div> to print your information:
for e in soup.find_all("div", class_="file-title"):
print(e.a.text)
or with css selector:
for a in soup.select('.file-title a'):
print(a.text)
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://patentsview.org/download/data-download-tables'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
for e in soup.find_all("div", class_="file-title"):
print(e.a.text)
Output
application
assignee
botanic
cpc_current
cpc_group
cpc_subgroup
cpc_subsection
figures
...
Or as DataFrame
pd.DataFrame([a.text for a in soup.select('.file-title a')], columns=['Title'])
Output:
| Title |
|---|
| application |
| assignee |
| botanic |
| cpc_current |
| cpc_group |
| cpc_subgroup |
| cpc_subsection |
| figures |
| foreigncitation |
| foreign_priority |
| government_interest |
| government_organization |
| inventor |
| ipcr |
| lawyer |
| location |
| mainclass |
| mainclass_current |
EDIT
Based on comment to get both "Title" and "Url"
data = []
for a in soup.select('.file-title a'):
data.append({
'Title':a.text,
'Url':a['href']
})
pd.DataFrame(data)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
