'Scraping using Beatiful Soup and Getting Error "Access to this page has been denied."
I am working on an Html parse and scraping Trulia with Beautiful soup in python. I am fairly new to python and feel as though my code is correct but I keep getting access denied. I assume this is because I am hitting the website too many times which is why I tried a sleep function, but even then I am getting access denied. I want to use a for loop to scrape multiple pages at once, I am still able to scrape one page at a time but whenever I attempt to scrape multiple and use the for loop I get access denied.
```
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
import time
real_estate_new=pd.DataFrame(columns=['Address', 'Beds', 'Baths', 'Price', 'sqft'])
address=[]
beds=[]
baths=[]
prices=[]
sqft=[]
for i in range(1,6):
time.sleep(5)
website = requests.get('https://www.trulia.com/for_sale/Knoxville,TN/1p_beds/' + str(i) +
'_p/')
#print('https://www.trulia.com/for_sale/Knoxville,TN/1p_beds/' + str(i) + '_p/')
soup = BeautifulSoup(website.content, 'html.parser')
result = soup.find_all('li', {'class' : 'Grid__CellBox-sc-144isrp-0
SearchResultsList__WideCell-b7y9ki-2 jiZmPM'})
result_update = [k for k in result if k.has_attr('data-testid')]
for result in result_update:
try:
address.append(result.find('div', {'data-testid':'property-address'}).get_text())
except:
address.append('n/a')
print(address)
try:
beds.append(result.find('div', {'data-testid':'property-beds'}).get_text())
except:
beds.append('n/a')
try:
baths.append(result.find('div', {'data-testid':'property-baths'}).get_text())
except:
baths.append('n/a')
try:
prices.append(result.find('div', {'data-testid':'property-price'}).get_text())
except:
prices.append('n/a')
try:
sqft.append(result.find('div', {'data-testid':'property-price'}).get_text())
except:
sqft.append('n/a')
for j in range (len(address)):
real_estate_new=real_estate_new.append({'Address':address[j], 'Beds':beds[j],
'Baths':baths[j], 'Price':prices[j], 'sqft':sqft[j]}, ignore_index=True)
print(soup.prettify())
Solution 1:[1]
I would suggest using graphql. First we need payload for query. Inside it, we can change pages, cities and everything we need to search. I will give an example of the first page with limits of 190. City of Knoxville, TN.
payload = json.dumps({
"operationName": "WEB_searchResultsMapQuery",
"variables": {
"isSwipeableFactsEnabled": False,
"heroImageFallbacks": [
"STREET_VIEW",
"SATELLITE_VIEW"
],
"searchDetails": {
"searchType": "FOR_SALE",
"location": {
"cities": [
{
"city": "Knoxville",
"state": "TN"
}
]
},
"filters": {
"sort": {
"type": "DATE",
"ascending": False
},
"page": 1,
"limit": 190,
"isAlternateListingSource": False,
"bedrooms": {
"min": "1",
"max": "*"
},
"propertyTypes": [],
"listingTypes": [],
"pets": [],
"rentalListingTags": [],
"foreclosureTypes": [],
"buildingAmenities": [],
"unitAmenities": [],
"landlordPays": [],
"offset": 40,
"propertyAmenityTypes": []
}
},
"includeOffMarket": False,
"includeLocationPolygons": True,
"isSPA": False,
"includeNearBy": True
},
"query": "query WEB_searchResultsMapQuery($searchDetails: SEARCHDETAILS_Input!, $heroImageFallbacks: [MEDIA_HeroImageFallbackTypes!], $includeOffMarket: Boolean!, $includeLocationPolygons: Boolean!, $isSPA: Boolean!, $includeNearBy: Boolean!, $isSwipeableFactsEnabled: Boolean = false) {\n searchResultMap: searchHomesByDetails(searchDetails: $searchDetails, includeNearBy: $includeNearBy) {\n ...SearchResultsMapClientFragment\n __typename\n }\n offMarketHomes: searchOffMarketHomes(searchDetails: $searchDetails) @include(if: $includeOffMarket) {\n ...HomeMarkerLayersContainerFragment\n ...HoverCardLayerFragment\n __typename\n }\n}\n\nfragment SearchResultsMapClientFragment on SEARCH_Result {\n ...HomeMarkerLayersContainerFragment\n ...HoverCardLayerFragment\n ...SearchLocationBoundaryFragment @include(if: $includeLocationPolygons)\n ...SchoolSearchMarkerLayerFragment\n ...TransitLayerFragment\n __typename\n}\n\nfragment HomeMarkerLayersContainerFragment on SEARCH_Result {\n ...HomeMarkersLayerFragment\n __typename\n}\n\nfragment HomeMarkersLayerFragment on SEARCH_Result {\n homes {\n location {\n coordinates {\n latitude\n longitude\n __typename\n }\n __typename\n }\n url\n metadata {\n compositeId\n __typename\n }\n ...HomeMarkerFragment\n __typename\n }\n nearByHomes {\n ...HomeMarkerFragment\n __typename\n }\n __typename\n}\n\nfragment HomeMarkerFragment on HOME_Details {\n media {\n hasThreeDHome\n __typename\n }\n location {\n coordinates {\n latitude\n longitude\n __typename\n }\n __typename\n }\n displayFlags {\n enableMapPin\n __typename\n }\n price {\n calloutMarkerPrice: formattedPrice(formatType: SHORT_ABBREVIATION)\n __typename\n }\n url\n ... on HOME_Property {\n activeForSaleListing {\n openHouses {\n formattedDay\n __typename\n }\n __typename\n }\n __typename\n }\n ...HomeDetailsTopThirdFragment @include(if: $isSPA)\n __typename\n}\n\nfragment HomeDetailsTopThirdFragment on HOME_Details {\n bathrooms {\n summaryBathrooms: formattedValue(formatType: COMMON_ABBREVIATION)\n __typename\n }\n bedrooms {\n summaryBedrooms: formattedValue(formatType: COMMON_ABBREVIATION)\n __typename\n }\n floorSpace {\n formattedDimension\n __typename\n }\n location {\n city\n coordinates {\n latitude\n longitude\n __typename\n }\n neighborhoodName\n stateCode\n zipCode\n cityStateZipAddress: formattedLocation(formatType: CITY_STATE_ZIP)\n homeFormattedAddress: formattedLocation\n summaryFormattedLocation: formattedLocation(formatType: STREET_COMMUNITY_BUILDER)\n __typename\n }\n media {\n metaTagHeroImages: heroImage(fallbacks: $heroImageFallbacks) {\n url {\n desktop: custom(size: {width: 2048, height: 200})\n __typename\n }\n __typename\n }\n topThirdHeroImages: heroImage(fallbacks: $heroImageFallbacks) {\n __typename\n url {\n extraSmallSrc: custom(size: {width: 375, height: 275})\n smallSrc: custom(size: {width: 570, height: 275})\n mediumSrc: custom(size: {width: 768, height: 500})\n largeSrc: custom(size: {width: 992, height: 500})\n hiDipExtraSmallSrc: custom(size: {width: 1125, height: 825})\n hiDpiSmallSrc: custom(size: {width: 1710, height: 825})\n hiDpiMediumSrc: custom(size: {width: 2048, height: 1536})\n __typename\n }\n webpUrl: url(compression: webp) {\n extraSmallWebpSrc: custom(size: {width: 375, height: 275})\n smallWebpSrc: custom(size: {width: 570, height: 275})\n mediumWebpSrc: custom(size: {width: 768, height: 500})\n largeWebpSrc: custom(size: {width: 992, height: 500})\n hiDipExtraSmallWebpSrc: custom(size: {width: 1125, height: 825})\n hiDpiSmallWebpSrc: custom(size: {width: 1710, height: 825})\n hiDpiMediumWebpSrc: custom(size: {width: 2048, height: 1536})\n __typename\n }\n }\n totalPhotoCount\n __typename\n }\n metadata {\n compositeId\n currentListingId\n __typename\n }\n pageText {\n title\n metaDescription\n __typename\n }\n price {\n formattedPrice\n ... on HOME_LastSoldPrice {\n formattedPriceDifferencePercent\n formattedSoldDate(dateFormat: \"MMM D, YYYY\")\n listingPrice {\n formattedPrice(formatType: SHORT_ABBREVIATION)\n __typename\n }\n priceDifferencePercent\n pricePerDimension {\n formattedDimension\n __typename\n }\n __typename\n }\n ... on HOME_ForeclosureEstimatePrice {\n price\n typeDescription\n __typename\n }\n ... on HOME_PriceRange {\n currencyCode\n max\n min\n __typename\n }\n ... on HOME_SinglePrice {\n currencyCode\n price\n __typename\n }\n __typename\n }\n tracking {\n key\n value\n __typename\n }\n url\n ... on HOME_Property {\n currentStatus {\n isOffMarket\n isRecentlySold\n isForeclosure\n isActiveForRent\n isActiveForSale\n isRecentlyRented\n label\n __typename\n }\n __typename\n }\n ... on HOME_RentalCommunity {\n location {\n rentalCommunityFormattedLocation: formattedLocation(formatType: STREET_COMMUNITY_NAME)\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment HoverCardLayerFragment on SEARCH_Result {\n homes {\n ...HomeHoverCardFragment\n __typename\n }\n nearByHomes {\n ...HomeHoverCardFragment\n __typename\n }\n __typename\n}\n\nfragment HomeHoverCardFragment on HOME_Details {\n ...HomeDetailsCardFragment\n ...HomeDetailsCardHeroFragment\n ...HomeDetailsCardPhotosFragment\n ...HomeDetailsGroupInsightsFragment @include(if: $isSwipeableFactsEnabled)\n location {\n coordinates {\n latitude\n longitude\n __typename\n }\n __typename\n }\n displayFlags {\n enableMapPin\n showMLSLogoOnMapMarkerCard\n __typename\n }\n __typename\n}\n\nfragment HomeDetailsCardFragment on HOME_Details {\n __typename\n location {\n city\n stateCode\n zipCode\n fullLocation: formattedLocation(formatType: STREET_CITY_STATE_ZIP)\n partialLocation: formattedLocation(formatType: STREET_ONLY)\n __typename\n }\n price {\n formattedPrice\n __typename\n }\n url\n tags(include: MINIMAL) {\n level\n formattedName\n icon {\n vectorImage {\n svg\n __typename\n }\n __typename\n }\n __typename\n }\n fullTags: tags {\n level\n formattedName\n icon {\n vectorImage {\n svg\n __typename\n }\n __typename\n }\n __typename\n }\n floorSpace {\n formattedDimension\n __typename\n }\n lotSize {\n ... on HOME_SingleDimension {\n formattedDimension(minDecimalDigits: 2, maxDecimalDigits: 2)\n __typename\n }\n __typename\n }\n bedrooms {\n formattedValue(formatType: TWO_LETTER_ABBREVIATION)\n __typename\n }\n bathrooms {\n formattedValue(formatType: TWO_LETTER_ABBREVIATION)\n __typename\n }\n isSaveable\n preferences {\n isSaved\n __typename\n }\n metadata {\n compositeId\n legacyIdForSave\n __typename\n }\n tracking {\n key\n value\n __typename\n }\n displayFlags {\n showMLSLogoOnListingCard\n addAttributionProminenceOnListCard\n __typename\n }\n ... on HOME_RoomForRent {\n numberOfRoommates\n availableDate: formattedAvailableDate(dateFormat: \"MMM D\")\n providerListingId\n __typename\n }\n ... on HOME_RentalCommunity {\n activeListing {\n provider {\n summary(formatType: SHORT)\n listingSource {\n logoUrl\n __typename\n }\n __typename\n }\n __typename\n }\n location {\n communityLocation: formattedLocation(formatType: STREET_COMMUNITY_NAME)\n __typename\n }\n providerListingId\n __typename\n }\n ... on HOME_Property {\n currentStatus {\n isRecentlySold\n isRecentlyRented\n isActiveForRent\n isActiveForSale\n isOffMarket\n isForeclosure\n __typename\n }\n priceChange {\n priceChangeDirection\n __typename\n }\n activeListing {\n provider {\n summary(formatType: SHORT)\n extraShortSummary: summary(formatType: EXTRA_SHORT)\n listingSource {\n logoUrl\n __typename\n }\n __typename\n }\n dateListed\n __typename\n }\n lastSold {\n provider {\n summary(formatType: SHORT)\n extraShortSummary: summary(formatType: EXTRA_SHORT)\n listingSource {\n logoUrl\n __typename\n }\n __typename\n }\n __typename\n }\n providerListingId\n __typename\n }\n ... on HOME_FloorPlan {\n priceChange {\n priceChangeDirection\n __typename\n }\n provider {\n summary(formatType: SHORT)\n __typename\n }\n __typename\n }\n}\n\nfragment HomeDetailsCardHeroFragment on HOME_Details {\n media {\n heroImage(fallbacks: $heroImageFallbacks) {\n url {\n small\n __typename\n }\n webpUrl: url(compression: webp) {\n small\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment HomeDetailsCardPhotosFragment on HOME_Details {\n media {\n __typename\n heroImage(fallbacks: $heroImageFallbacks) {\n url {\n small\n __typename\n }\n webpUrl: url(compression: webp) {\n small\n __typename\n }\n __typename\n }\n photos {\n url {\n small\n __typename\n }\n webpUrl: url(compression: webp) {\n small\n __typename\n }\n __typename\n }\n }\n __typename\n}\n\nfragment HomeDetailsGroupInsightsFragment on HOME_Details {\n ... on HOME_Property {\n groupedInsights {\n insights {\n ... on HOME_FeatureInsights {\n insightTags {\n formattedName\n __typename\n }\n __typename\n }\n ... on HOME_SmartInsights {\n insightTags {\n formattedName\n __typename\n }\n __typename\n }\n ... on HOME_ContextualPhrases {\n phrases {\n description\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment SearchLocationBoundaryFragment on SEARCH_Result {\n location {\n encodedPolygon\n ... on SEARCH_ResultLocationCity {\n locationId\n __typename\n }\n ... on SEARCH_ResultLocationCounty {\n locationId\n __typename\n }\n ... on SEARCH_ResultLocationNeighborhood {\n locationId\n __typename\n }\n ... on SEARCH_ResultLocationPostalCode {\n locationId\n __typename\n }\n ... on SEARCH_ResultLocationState {\n locationId\n __typename\n }\n __typename\n }\n __typename\n}\n\nfragment SchoolSearchMarkerLayerFragment on SEARCH_Result {\n schools {\n ...SchoolMarkersLayerFragment\n __typename\n }\n __typename\n}\n\nfragment SchoolMarkersLayerFragment on School {\n id\n latitude\n longitude\n categories\n ...SchoolHoverCardFragment\n __typename\n}\n\nfragment SchoolHoverCardFragment on School {\n id\n name\n gradesRange\n providerRating {\n rating\n __typename\n }\n streetAddress\n studentCount\n latitude\n longitude\n __typename\n}\n\nfragment TransitLayerFragment on SEARCH_Result {\n transitStations {\n stationName\n iconUrl\n coordinates {\n latitude\n longitude\n __typename\n }\n radius\n __typename\n }\n __typename\n}\n"
})
Now we need set up headers:
headers = {
'authority': 'www.trulia.com',
'accept': '*/*',
'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
}
POST request to provide us with a huge amount of information. I will display what you had in the example.
url = "https://www.trulia.com/graphql"
response = requests.request("POST", url, headers=headers, data=payload)
results = []
for home in json.loads(response.text)['data']['searchResultMap']['homes']:
results.append([home['location']['fullLocation'], home['bedrooms']['formattedValue'],
home['bathrooms']['formattedValue'], home['price']['formattedPrice'],
home['floorSpace']['formattedDimension']])
real_estate_new = pd.DataFrame(data=results, columns=['Address', 'Beds', 'Baths', 'Price', 'sqft'])
print(real_estate_new)
Outputs:
Address Beds ... Price sqft
0 8121 Corteland Dr, Knoxville, TN 37909 3bd ... $450,000 2,340 sqft
1 525 Brunello Way, Knoxville, TN 37919 4bd ... $749,900 2,864 sqft
2 529 Brunello Way, Knoxville, TN 37919 4bd ... $749,900 2,864 sqft
3 1211 Highland Ave #204, Knoxville, TN 37916 1bd ... $150,000 598 sqft
4 6836 Old Kent Dr, Knoxville, TN 37919 4bd ... $1,299,900 3,308 sqft
.. ... ... ... ... ...
185 Sevier Meadows, Knoxville, TN 37920 4bd ... $330,990+ 2,804 sqft
186 1900 Ridgecrest Dr #201, Knoxville, TN 37918 2bd ... $299,900 1,747 sqft
187 9919 Dayflower Way, Knoxville, TN 37932 3bd ... $400,000 2,688 sqft
188 Coward Mill, Knoxville, TN 37919 4bd ... $321,990+ 1,764 sqft
189 315 Justice Valley St, Knoxville, TN 37934 4bd ... $999,900 3,123 sqft
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sergey K |
