'How to scrape links from dynamically loaded website (Fincaraiz) using Scrapy
I would like help on how to use Scrapy in Python to extract data from the following page
https://fincaraiz.com.co/apartamentos/arriendos?ubicacion=cali
I need to extract the links of each item, like for example the first shows a detail-link when hovering on the photo:
https://fincaraiz.com.co/inmueble/apartamento-en-arriendo/florida-blanca/bogota/6738284

Problem
This page loads the content dynamically, so when I make the request from Scrapy, the result I get is HTML, CSS, JavaScript and some things.
However, the data from the apartments as such is not obtained.
So I cannot apply XPath because the response does not contain the data, instead it is loaded dynamically.
Question
How to scrape it without using Selenium, Scrapy Splash , or some other external library ?
Solution 1:[1]
When you check the network panel when the site loads, you can find the api call that it uses to dynamically load the contents. You can then replicate the api call simply by copying the curl request and converting it to python:
import requests
headers = {
'authority': 'api.fincaraiz.com.co',
'accept': '*/*',
'origin': 'https://fincaraiz.com.co',
'referer': 'https://fincaraiz.com.co/'
}
json_data = {
'filter': {
'offer': {
'slug': [
'rent',
],
},
'property_type': {
'slug': [
'apartment',
],
},
'location_path': 'cali',
},
'fields': {
'exclude': [],
'facets': [
'rooms.slug',
'baths.slug',
'locations.countries.slug',
'locations.states.slug',
'locations.cities.slug',
'locations.neighbourhoods.slug',
'locations.groups.slug',
'locations.groups.subgroups.slug',
'offer.slug',
'property_type.slug',
'categories.slug',
'stratum.slug',
'age.slug',
'media.floor_plans.with_content',
'media.photos.with_content',
'media.videos.with_content',
'products.slug',
'is_new',
],
'include': [
'area',
'baths.id',
'baths.name',
'baths.slug',
'client.client_type',
'client.company_name',
'client.first_name',
'client.fr_client_id',
'client.last_name',
'client.logo.full_size',
'garages.name',
'is_new',
'locations.cities.fr_place_id',
'locations.cities.name',
'locations.cities.slug',
'locations.countries.fr_place_id',
'locations.countries.name',
'locations.countries.slug',
'locations.groups.name',
'locations.groups.slug',
'locations.groups.subgroups.name',
'locations.groups.subgroups.slug',
'locations.neighbourhoods.fr_place_id',
'locations.neighbourhoods.name',
'locations.neighbourhoods.slug',
'locations.states.fr_place_id',
'locations.states.name',
'locations.states.slug',
'locations.location_point',
'max_area',
'max_price',
'media.photos.list.image.full_size',
'media.photos.list.is_main',
'media.videos.list.is_main',
'media.videos.list.video',
'media.logo.full_size',
'min_area',
'min_price',
'offer.name',
'price',
'products.configuration.tag_id',
'products.configuration.tag_name',
'products.label',
'products.name',
'products.slug',
'property_id',
'property_type.name',
'fr_property_id',
'fr_parent_property_id',
'rooms.id',
'rooms.name',
'rooms.slug',
'stratum.name',
'title',
],
'limit': 25,
'offset': 0, #set to 25 to get the second page, 50 for the 3rd page etc.
'ordering': [],
'platform': 41,
'with_algorithm': False,
},
}
response = requests.post('https://api.fincaraiz.com.co/document/api/1.0/listing/search', headers=headers, json=json_data)
data = response.json()
For the second url (https://fincaraiz.com.co/apartamentos/arriendos/florida-blanca/zona-occidente/bogota?pagina=1):
import requests
headers = {
'authority': 'api.fincaraiz.com.co',
'accept': '*/*',
'origin': 'https://fincaraiz.com.co',
'referer': 'https://fincaraiz.com.co/'
}
json_data = {
'filter': {
'offer': {
'slug': [
'rent',
],
},
'property_type': {
'slug': [
'apartment',
],
},
'locations': {
'neighbourhoods': {
'slug': [
'colombia-cundinamarca-bogot\xE1-3632371-florida-blanca',
],
},
},
},
'fields': {
'exclude': [],
'facets': [
'rooms.slug',
'baths.slug',
'locations.countries.slug',
'locations.states.slug',
'locations.cities.slug',
'locations.neighbourhoods.slug',
'locations.groups.slug',
'locations.groups.subgroups.slug',
'offer.slug',
'property_type.slug',
'categories.slug',
'stratum.slug',
'age.slug',
'media.floor_plans.with_content',
'media.photos.with_content',
'media.videos.with_content',
'products.slug',
'is_new',
],
'include': [
'area',
'baths.id',
'baths.name',
'baths.slug',
'client.client_type',
'client.company_name',
'client.first_name',
'client.fr_client_id',
'client.last_name',
'client.logo.full_size',
'garages.name',
'is_new',
'locations.cities.fr_place_id',
'locations.cities.name',
'locations.cities.slug',
'locations.countries.fr_place_id',
'locations.countries.name',
'locations.countries.slug',
'locations.groups.name',
'locations.groups.slug',
'locations.groups.subgroups.name',
'locations.groups.subgroups.slug',
'locations.neighbourhoods.fr_place_id',
'locations.neighbourhoods.name',
'locations.neighbourhoods.slug',
'locations.states.fr_place_id',
'locations.states.name',
'locations.states.slug',
'locations.location_point',
'max_area',
'max_price',
'media.photos.list.image.full_size',
'media.photos.list.is_main',
'media.videos.list.is_main',
'media.videos.list.video',
'media.logo.full_size',
'min_area',
'min_price',
'offer.name',
'price',
'products.configuration.tag_id',
'products.configuration.tag_name',
'products.label',
'products.name',
'products.slug',
'property_id',
'property_type.name',
'fr_property_id',
'fr_parent_property_id',
'rooms.id',
'rooms.name',
'rooms.slug',
'stratum.name',
'title',
],
'limit': 25,
'offset': 0,
'ordering': [],
'platform': 41,
'with_algorithm': True,
},
}
response = requests.post('https://api.fincaraiz.com.co/document/api/1.0/listing/search', headers=headers, json=json_data)
data = response.json()
There are 11 items in data['hits']['hits']
Solution 2:[2]
The previous answer requires the bothersome effort of manually collecting the Json for every different url that you're working with. I have found an easier solution which will allow you to upload multiple start_urls and arrange the json as you wish for the data that you need
I have integrated scrapy_playwright rather than scrapy_splash because splash would return a blank screen and not load for whatever reason - no matter how long I set the timer.
Playwright takes quite a few seconds to load the single page but scrapy is fast as it's asynchronous so you can quickly retrieve all the json from the page.
Here's the script:
import scrapy
from scrapy_playwright.page import PageCoroutine
from scrapy import http
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:97.0) Gecko/20100101 Firefox/97.0',
'Accept': '*/*',
'Accept-Language': 'en-GB,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://fincaraiz.com.co/finca-raiz/venta/apartamento-en-arriendo/florida-blanca/bogota?pagina=1',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'TE': 'trailers',
}
class ListingsSpider(scrapy.Spider):
name = 'listings'
start_urls = []
for pages in range(1, 40):
start_urls.append(f'https://fincaraiz.com.co/finca-raiz/venta/apartamento-en-arriendo/florida-blanca/bogota?pagina={pages+1}')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback = self.parse,
meta = dict(
playwright = True,
playwright_page_coroutines = [
PageCoroutine("screenshot", path="image.png", full_page=True), #screenshot to check that you're on the page
PageCoroutine("wait_for_timeout", 30000)
]
)
)
def parse(self, response):
""" Parse the data:
1. Get the href for each link on the page -
this provides a unique id used in the json """
container = response.xpath("//div[@id='listingContainer']/div")
listing_id = []
for links in container:
data=links.xpath(".//a//@href").get()
try:
listing_id.append(data.split("/")[-1])
except:
continue
"""Send the Json Request to another parser"""
for listings in listing_id:
yield http.JsonRequest(
url = f'https://fincaraiz.com.co/_next/data/build/es/proyecto-de-vivienda/mirador-del-puerto/puerto-colombia/puerto-colombia/{listings}.json',
callback = self.parse_listing,
headers=headers
)
def parse_listing(self, response):
""" Parse the Json here however you wish """
yield {
'json':response.json()
}
Output:
{'json': {'pageProps': {'additional': {'benefit': None, 'deliveryTime': None, 'downPayment': None, 'finishes': None, 'schedule': None, 'financing': None, 'units': None, 'tourUrl': None}, 'capacity': {'name': 'Sin especificar', 'id': 0, 'slug': 'UNDEFINED'}, 'dates': {'renovated': '2022-02-19T20:01:47.783000+00:00', 'deleted': '0001-01-01', 'expired': '2022-05-20T21:16:31.453000+00:00', 'moderated': '2021-10-09T23:25:09+00:00', 'created': '2021-10-09T23:24:31.770000+00:00', 'published': '2022-02-19T21:20:44.207000+00:00', 'updated': '2022-02-19T21:20:44.207000+00:00'}, 'description': 'Espectacular casa Esquinera para uso residencial o comercial, 1 piso:Sala,Comedor,Estudio,Cocina,Deposito,Baño y garaje 2 carros, 2Piso:5 habiraciones y 2 baños, 3Piso: Zona lavadero,lavadoras,1 habiracion y cocina con terraza (El apto del 3 piso esta en obra gris), a 2 cuadras de plaza del quirigua y 20 min portal 80, alimentadores y servicio publico a 1 cuadra, en frente posee parqueadero y zona verde.', ... ...
...
...
...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
