'The list of items I scrape from a webpage differs from the source of the page

I'm trying to scrape a list of zpids from this webpage using the requests module. The zpids are available within a list right next to searchListZpids in the page source (ctrl + u). They are 40 in number.

The script below can fetch the zpids errorlessly. However, the problem is the list the script produces are different from the ones available on that webpage. Some of the zpids in the list I received have exact matchings with those available on that page.

Sometimes the list I get is accurate but most of the time they are different.

The script that I'm using:

import re
import requests

link = 'https://www.zillow.com/ct/9_p/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

res = requests.get(link,headers=headers)
zpids = re.findall(r"searchListZpids[\s\S]+?\[(.*?)\]",res.text)[0]
print(zpids)

Output I get at this moment:

57912175, 177202011, 57838346, 57702376, 2083150985, 2091636205, 59028017, 2066602375, 57843835, 2066598335, 58845027, 58904562, 58118011, 58838731, 57930222, 2066611590, 59977275, 197747278, 57932219, 57893209, 58775017, 2066600444, 2066601022, 58059157, 177275234, 58819070, 59297439, 58859881, 2078457589, 58775318, 57790587, 57689409, 2066601997, 57394605, 177286302, 58133143, 59068957, 58096934, 240506947, 83121293

How can I scrape the exact list of zpids from that webpage using requests?

EDIT:

I thought to further clarify the whereabouts of the list of pids I wish to extract from that site's page source. After navigating to this link, when you press Ctrl + U, you should see the page source in a different tab. Now, press ctrl + f and write searchListZpids in the search bar like this. This time you should see the list of pids right next to searchListZpids in that page, which is more like this. This is the very list what I wish to extract. Sometimes the list the above script produces is identical to the ones available in the page source but most of the time they are different.



Solution 1:[1]

How can I scrape the exact list of zpids from that webpage using requests?

Try following pattern to get exactly the list

zpids = re.findall(r"searchListZpids[\s\S]+?(\[.*?\])", res.text)[0]
zpids = eval(zpids)

print(zpids)

Output:

[57893930, 58081832, 58860541, 58890802, 69057773, 210140012, 58913561, 58820158, 82973247, 197791135, 58838105, 2066643456, 57784503, 210141981, 57312365, 58776838, 58859881, 60103968, 58088978, 333569598, 58952600, 177281260, 2066618828, 2066555488, 57246785, 57336201, 58960631, 58042110, 58028998, 174037330, 60005139, 174072877, 210140037, 210145877, 57248627, 57278888, 57330507, 57958372, 174447170, 58875088]

Solution 2:[2]

I have run your code several times and not found a mismatch once.

t.py file

import re
import requests

link = 'https://www.zillow.com/ct/9_p/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

res = requests.get(link,headers=headers)
zpids = re.findall(r"searchListZpids[\s\S]+?\[(.*?)\]",res.text)[0]
print(zpids)
with open("html.txt","w") as f:
    f.write(res.text)
    f.write("\n")

in terminal

date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2

output

(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:11 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:13 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:15 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:17 EST 2022
(.picamenv) anupamkumar@m1 lib % date && python3 t.py>1 && perl -ne 'print "$1\n" if /searchListZpids\",\"\[((.*?))\]\"/' html.txt>2 && diff 1 2 && rm html.txt 1 2
Thu Jan 27 13:08:19 EST 2022

Solution 3:[3]

You are doing all right and you are mistaken thinking that you are getting an incorrect list of zpids.

This list of zpids is a list of agent listings that are displayed on the current page (in your case 9th page, because you are using the 9_p route in your URL).

In fact, you have more than 5000 agent listings according to your request and you are even not specifying the order of these agent listings, so they can differ from request to request (and you should see it in your browser too).

You can try to set sorting in your request. For example, this URL shows agent listings sorted by price from low to high. But it is not the full solution to your problem too, because the full list of objects can always change on the source website.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 anu
Solution 3 Oleksii Tambovtsev