'Web scraping DOI with BeautifulSoup

I'm currently working on a project about web scraping and I need information from Google Scholar records. I need to extract the DOI of an article and the corresponding HTML page is like this.

<span data-v-d3a5356a="" class="metadata--doi">DOI:
      <a data-v-d3a5356a="" id="article--doi--link-metadataSec" href="//doi.org/10.1007/s00508-019-1485-6">10.1007/s00508-019-1485-6</a>&nbsp;</span>

I'm not able to extract it with the function

page = BeautifulSoup(response.text, 'html.parser')
page.find_all("span", "data-v-d3a5356a")

How can I extract the string "10.1007/s00508-019-1485-6" ?



Solution 1:[1]

That webpage is a Dynamic page - that means the data is loaded by JavaScript. beautifulsoup will not work with Dynamic pages. You have to use selenium to scrape this site.

However, if you see under the Network tab in Chrome DevTools, you can see that the data is being loaded from an API. You can directly fetch data from that API. Here is the link

Here is how to extract the data from that API endpoint.

import requests

url = 'https://europepmc.org/api/get/articleApi?query=(EXT_ID:30980146%20AND%20SRC:med)&format=json&resultType=core'
r = requests.get(url)
x = r.json()

print(f"DOI: {x['resultList']['result'][0]['doi']}")
DOI: 10.1007/s00508-019-1485-6

Solution 2:[2]

Ram has already shown how to scrape DOI data from europepmc.org, I have added code sample to extract DOI link and Abstract as well and combined everything together including parsing data from ieeexplore.ieee.org: DOI, DOI URL, Abstract.

Have a look at parsed JSON string from ieeexplore.ieee.org

from bs4 import BeautifulSoup
import requests, re, json

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

links = [
    "https://europepmc.org/api/get/articleApi?query=(EXT_ID:30980146%20AND%20SRC:med)&format=json&resultType=core",
    "https://ieeexplore.ieee.org/abstract/document/9599583"
]

data = []

for link in links:
    if "ieeexplore" in link:
        html = requests.get(link, headers=headers, timeout=30)
        soup = BeautifulSoup(html.text, "lxml")

        # https://regex101.com/r/8vfYNp/1
        doi = json.loads(re.findall(r"xplGlobal\.document\.metadata=(.*?);", str(soup.select("script")))[0])["doi"]
        doi_link = json.loads(re.findall(r"xplGlobal\.document\.metadata=(.*?);", str(soup.select("script")))[0])["doiLink"]
        abstract = json.loads(re.findall(r"xplGlobal\.document\.metadata=(.*?);", str(soup.select("script")))[0])["abstract"]

        data.append({
            "parsed_url": link,
            "doi": doi,
            "doi_link": doi_link,
            "abstract": abstract,
        })
    else:
        html = requests.get(link, headers=headers, timeout=30).json()
        
        doi = html["resultList"]["result"][0]["doi"]
        doi_link = html["resultList"]["result"][0]["fullTextUrlList"]["fullTextUrl"][0]["url"]
        abstract = html["resultList"]["result"][0]["abstractText"]

        data.append({
            "parsed_url": link,
            "doi": doi,
            "doi_link": doi_link,
            "abstract": abstract,
        })

print(json.dumps(data, indent=2))

Full output:

[
  {
    "parsed_url": "https://europepmc.org/api/get/articleApi?query=(EXT_ID:30980146%20AND%20SRC:med)&format=json&resultType=core",
    "doi": "10.1007/s00508-019-1485-6",
    "doi_link": "https://doi.org/10.1007/s00508-019-1485-6",
    "abstract": "This position statement is based on current evidence available on the safety and benefits of continuous subcutaneous insulin infusion therapy (CSII, pump therapy) in diabetes with an emphasis on the effects of CSII on glycemic control, hypoglycaemia rates, occurrence of ketoacidosis, quality of life and the use of insulin pump therapy in pregnancy. The current article represents the recommendations of the Austrian Diabetes Association for the clinical praxis of insulin pump treatment in children, adolescents and adults."
  },
  {
    "parsed_url": "https://ieeexplore.ieee.org/abstract/document/9599583",
    "doi": "10.1109/JPHOT.2021.3124611",
    "doi_link": "https://doi.org/10.1109/JPHOT.2021.3124611",
    "abstract": "This paper comprehensively investigated noise characteristics of superluminal propagation based on low-noise single-frequency Brillouin lasing oscillation with the aid of a population inversion dynamic grating. Thanks to high-degree polarization alignment between the Brillouin pump and the lased Stokes lightwaves in polarization maintaining fibers, efficient Brillouin lasing resonance with over 10-dB relative intensity noise suppression has been demonstrated to activate Brillouin loss-induced anomalous dispersion in the vicinity of pump signals, benefiting a noise-insensitive superluminal propagation along kilometer-long optical fibers with robust resistance to ambient disturbance. Consequently, sinusoidally modulated pump signals experienced the time advancement of 4634.0 ns at the group velocity of 10.63\n<italic xmlns:mml=\"http://www.w3.org/1998/Math/MathML\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">c</i>\n. Results show that the variance of the fractional advancement with polarization maintaining fibers is 2.54 \u00d7 10\n<sup xmlns:mml=\"http://www.w3.org/1998/Math/MathML\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\u22124</sup>\n which is two orders of magnitude lower than that of conventional single mode fibers. Furthermore, the dependence of the group velocity on the modulation frequency was experimentally investigated, showing good agreement with the theoretical analysis."
  }
]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Dmitriy Zub