'Web-scraping using python

I am trying to extract data from this website, It is almost impossible to scrape as after any search it's not changing its URL.

I want to search based on PUBLISHER IPI '00144443097' and extract all data they have insideclass="items-container".

My code

quote_page = 'https://portal.themlc.com/search'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('section', attrs={'class': 'items-container'})
name = name_box.text
print(name)

Here as the URL after search doesn't change it's not giving me any value.

After extracting values I want to sort them in pandas



Solution 1:[1]

When the url doesn't change, you can use the developer tools to see if an api is being called. In this case there are two apis. One gives basic information about the writer and the other gives the information on the works. You can parse the json response however you wish from here.

Note: this a post, not a get

url = 'https://api.ptl.themlc.com/api/search/writer?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()

url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()

url = 'https://api.ptl.themlc.com/api/search/publisher?page=1&limit=10'
payload = {"publisherIpi":"00144443097"}
requests.post(url, json=payload).json()

# this url gets the 161 works for the publisheripid you want.  it's convoluted, but you may be able to automate, but I used developer tools to find the right publisheripid
url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'publisherIpId': "7305902"}
requests.post(url, json=payload).json()

Solution 2:[2]

To find the publisheripid, you need to open some of works within the author and look for the work endpoint. hopefully this image loads correctly

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Jonathan Leon