'Scraping Booking coments with python
I am trying to get the titles of Booking.com comments from this website:
https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75,
where r_lang=all basically says that the website should show comments in every language.
In order to obtain the titles from this page I do this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
From the website (see screenshot), the first two titles should be "Sencillamente placentera" and "It could have been great.". However, somehow the url only loads comments in spanish: “Sencillamente placentera”
“La atención de la chica del restaurante”
“El desayuno estilo buffet, completo ”
“Me gusto la ubicación, y la vista.”
“Su ubicación es muy buena.”
I noticed that if in the url I change the 'museo.es.' to 'museo.en.', I get the headers of english comments. But this is inconsistent, because if I load the original url, I get comments in english, french, spanish, etc. How can I fix this? Thanks
Solution 1:[1]
Servers can be configured to send different responses based on the browser making the request. Adding a User-Agent seems to fix the problem.
import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75'
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
}
)
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
Output:
“Sencillamente placentera”
“It could had been great.”
“will never stay their in the future.”
“Hôtel bien situé.”
...
Solution 2:[2]
You could always use a browser as a plan B. Selenium doesn't have this problem
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75')
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
print(titles)
Solution 3:[3]
New way to access Booking.com reviews is to use the new reviewlist.html endpoint. For example for hotel in original question reviews are located over at:
This endpoint is particularly great because it supports many filters and offers up to 25 reviews per page.
Here's a snippet in Python with parsel and httpx:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('@data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb", # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
I write more about scraping this endpoint on my blog How to Scrape Booking.com which has more illustrations and videos if more information is needed.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Bitto Bennichan |
| Solution 2 | |
| Solution 3 | Granitosaurus |

