'Get a list of elements into separated arrays

Hello fellow developer out there,

I'm new to Python & I need to write a web scraper to catch info from Scholar Google.

I ended up coding this function to get values using Xpath:

thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []

for t in thread:
    if not atr:
        xThread = t.text 
    else:
        xThread = t.get_attribute('href')

    xArray.append(xThread)

    return xArray

I don't know if it's a good or a bad solution. So, I humbly accept any suggestions to make it work better.

Anyway, my actual problem is that I am getting all authors name from the page I am scraping and what I really need are the names, grouped by result. When I ask to print the results I wish I could have something like this:

[[author1, author2,author 3],[author 4,author 5,author6]]

What am I getting right now is:

[author1,author3,author4,author5,author6]

The structure is as follows:

<div class="gs_a">
    LR Hisch,
<a href="/citations?user=xuBuLKYAAAAJ&amp;hl=es&amp;oi=sra">AM Gobin</a>
    ,AR Lowery,
<a href="/citations?user=ziumTX0AAAAJ&amp;hl=es&amp;oi=sra">F Tam</a>
 ... -Annals of biomedical ...,2006 - Springer
</div>

And the same structure is repetead all over the page for different documents and authors.

And this is the call to the function I explained earlier:

authors = (clothoSpins(".//*[@class='gs_a']//a"))

Which gets me the entire list of authors.



Solution 1:[1]

Here is the logic (used selenium in the below code but update it as per your need).

Logic:

url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=python&btnG="
driver.get(url)
# get the authors and add to list
listBooks = []
books = driver.find_elements_by_xpath("//div[@class='gs_a']")
for bookNum in books:
    auths = []
    authors = driver.find_elements_by_xpath("(//div[@class='gs_a'])[%s]/a|(//div[@class='gs_a'])[%s]/self::*[not(a)]"%(bookNum+1,bookNum+1))
    for author in authors:
        auths.append(author.text)
    listBooks.append(auths)

Output:

[['F Pedregosa', 'G Varoquaux', 'A Gramfort'], ['PD Adams', 'PV Afonine'], ['TE Oliphant'], ['JW Peirce'], ['S Anders', 'PT Pyl', 'W Huber'], ['MF Sanner'], ['S Bird', 'E Klein'], ['M Lutz - 2001 - books.google.com'], ['G Rossum - 1995 - dl.acm.org'], ['W McKinney - … of the 9th Python in Science Conference, 2010 - pdfs.semanticscholar.org']]

Screenshot: enter image description here

Solution 2:[2]

To group by result you can create an empty list, iterate over results, and append extracted data to the list as a dict, and returned result could be serialized to a JSON string using json_dumps() method e.g:

temp_list = []

for result in results:
    # extracting title, link, etc.

    temp_list.append({
         "title": title,
         # other extracted elements
     })

print(json.dumps(temp_list, indent=2))

"""
Returned results is a list of dictionaries:
[
  {
    "title": "A new biology for a new century",
    # other extracted elements..
  }
]

"""

Code and full example in the online IDE:

from parsel import Selector
import requests, json, re

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "biology",  # search query
    "hl": "en"       # language
    }

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
    }

html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

data = []

for result in selector.css(".gs_ri"):

    # xpath("normalize-space()") to get blank text nodes as well to get the full string output
    title = result.css(".gs_rt a").xpath("normalize-space()").get()

    # https://regex101.com/r/7bmx8h/1
    authors = re.search(r"^(.*?)-", result.css(".gs_a").xpath("normalize-space()").get()).group(1).strip()
    snippet = result.css(".gs_rs").xpath("normalize-space()").get()

    # https://regex101.com/r/47erNR/1
    year = re.search(r"\d+", result.css(".gs_a").xpath("normalize-space()").get()).group(0)

    # https://regex101.com/r/13468d/1
    publisher = re.search(r"\d+\s?-\s?(.*)", result.css(".gs_a").xpath("normalize-space()").get()).group(1)
    cited_by = int(re.search(r"\d+", result.css(".gs_or_btn.gs_nph+ a::text").get()).group(0))

    data.append({
        "title": title,
        "snippet": snippet,
        "authors": authors,
        "year": year,
        "publisher": publisher,
        "cited_by": cited_by
        })

print(json.dumps(data, indent=2, ensure_ascii=False))

Output:

[
  {
    "title": "A new biology for a new century",
    "snippet": "… A society that permits biology to become an engineering discipline, that allows that science … science of biology that helps us to do this, shows the way. An engineering biology might still …",
    "authors": "CR Woese",
    "year": "2004",
    "publisher": "Am Soc Microbiol",
    "cited_by": 743
  }, ... other results
  {
    "title": "Campbell biology",
    "snippet": "… Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point …",
    "authors": "JB Reece, LA Urry, ML Cain, SA Wasserman…",
    "year": "2014",
    "publisher": "fvsuol4ed.org",
    "cited_by": 1184
  }
]

Note: in the example above, I'm using parsel library which is very similar to beautifulsoup and selenium in terms of data extraction.


Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to create the parser from scratch, maintain it, figure out how to scale it without getting blocked.

Example code to integrate:

from serpapi import GoogleSearch
import os, json

params = {
  "api_key": os.getenv("API_KEY"),  # SerpApi API key
  "engine": "google_scholar",       # parsing engine
  "q": "biology",                   # search query 
  "hl": "en"                        # language
}

search = GoogleSearch(params)       # where data extraction happens
results = search.get_dict()         # JSON -> Python dictionary

for result in results["organic_results"]:
    print(json.dumps(result, indent=2))

Output:

{
  "position": 0,
  "title": "A new biology for a new century",
  "result_id": "KNJ0p4CbwgoJ",
  "link": "https://journals.asm.org/doi/abs/10.1128/MMBR.68.2.173-186.2004",
  "snippet": "\u2026 A society that permits biology to become an engineering discipline, that allows that science \u2026 science of biology that helps us to do this, shows the way. An engineering biology might still \u2026",
  "publication_info": {
    "summary": "CR Woese - Microbiology and molecular biology reviews, 2004 - Am Soc Microbiol"
  },
  "resources": [
    {
      "title": "nih.gov",
      "file_format": "HTML",
      "link": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/"
    },
    {
      "title": "View it @ CTU",
      "link": "https://scholar.google.com/scholar?output=instlink&q=info:KNJ0p4CbwgoJ:scholar.google.com/&hl=en&as_sdt=0,11&scillfp=15047057806408271473&oi=lle"
    }
  ],
  "inline_links": {
    "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=KNJ0p4CbwgoJ",
    "html_version": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/",
    "cited_by": {
      "total": 743,
      "link": "https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=80005&sciodt=0,11&hl=en",
      "cites_id": "775353062728716840",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=775353062728716840&engine=google_scholar&hl=en"
    },
    "related_pages_link": "https://scholar.google.com/scholar?q=related:KNJ0p4CbwgoJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
    "versions": {
      "total": 20,
      "link": "https://scholar.google.com/scholar?cluster=775353062728716840&hl=en&as_sdt=0,11",
      "cluster_id": "775353062728716840",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=775353062728716840&engine=google_scholar&hl=en"
    }
  }
}
{
  "position": 9,
  "title": "Campbell biology",
  "result_id": "YnWp49O_RTMJ",
  "type": "Book",
  "link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf",
  "snippet": "\u2026 Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point \u2026",
  "publication_info": {
    "summary": "JB Reece, LA Urry, ML Cain, SA Wasserman\u2026 - 2014 - fvsuol4ed.org"
  },
  "resources": [
    {
      "title": "fvsuol4ed.org",
      "file_format": "PDF",
      "link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf"
    }
  ],
  "inline_links": {
    "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=YnWp49O_RTMJ",
    "cited_by": {
      "total": 1184,
      "link": "https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=80005&sciodt=0,11&hl=en",
      "cites_id": "3694569986105898338",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=3694569986105898338&engine=google_scholar&hl=en"
    },
    "related_pages_link": "https://scholar.google.com/scholar?q=related:YnWp49O_RTMJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
    "versions": {
      "total": 33,
      "link": "https://scholar.google.com/scholar?cluster=3694569986105898338&hl=en&as_sdt=0,11",
      "cluster_id": "3694569986105898338",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=3694569986105898338&engine=google_scholar&hl=en"
    },
    "cached_page_link": "http://scholar.googleusercontent.com/scholar?q=cache:YnWp49O_RTMJ:scholar.google.com/+biology&hl=en&as_sdt=0,11"
  }
}

If you need to parse data from all Google Scholar Organic results, there's a dedicated Scrape historic 2017-2021 Organic, Cite Google Scholar results to CSV, SQLite blog post of mine at SerpApi that shows how to do it with API.

Disclaimer, I work for SerpApi.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Dmitriy Zub