Category "web-scraping"

How to scrape wikipedia text from <p> without id or class?

I am scraping a Wikipedia text but the <p> does not have any class or id: import requests as r from bs4 import BeautifulSoup as bs url=r.get("https://en.

How to use scrapy to scrape google play reviews of applications?

I wrote this spider to scrape reviews of apps from google play. I am partially successful in this. I am able to extract the name, date, and review only. My ques

How to do Scrapy historical output comparison using Spidermon

So Scrapinghub is releasing a new feature for Scrapy quality insurance. It says it has historical comparison features where it can detect if the current scrape

removing `\n` using bs4 get_text()

from bs4 import BeautifulSoup # current output as below """ 'DOMINGUEZ, JONATHAN D. VS. RAMOS,\n SILVIA M' """ # d

Trouble modifying the language option in selenium python bindings

I've created a script in python in combination with selenium to scrape different app names from google play store and they all are coming through when I execute

Can't grab coordinates from ArcGIS iframe in a webpage using requests

I've created a script to get coordinates (-119.412 49.023 in this case) from a map located in a webpage using requests module. When I try using my script below

how to use same cookies over multiple requests when using python requests

I am new to python requests and am using it to scrape a website and get to a certain webpage, first I login and then I do a few requests for other webpages: im

OSError: [Errno 22} Invalid argument: 'downloaded/misc/jquery.js?v=1.4.4'

tfp = open(filename, 'wb') OSError: [Errno 22} Invalid argument: 'downloaded/misc/jquery.js?v=1.4.4' Can anyone help me with this error? I figure it has somet

Scraping content from urls in dataframe using R

Sorry, I'm relatively new to R and don't know it very well yet. I have also seen that similar questions have been asked more often. However, the corresponding s

Why can't I scrape table data in order?

I'm trying to scrape table data off of this website: https://www.nfl.com/standings/league/2019/REG I have working code (below), however, it seems like the table

Python - BeautifulSoup - How to return two different elements or more, with different attributes?

HTML Exemple <html> <div book="blue" return="abc"> <h4 class="link">www.example.com</h4> <p class="author">RODRIGO</p> </

Python get string from an html page

I have to create an array which contains all the element within title="", for example: title="xxxxx", title="xxx2", title='xxx4', etc... I need to get xxxx,

How can I download images on a page using puppeteer?

I'm new to web scraping and want to download all images on a webpage using puppeteer: const puppeteer = require('puppeteer'); let scrape = async () => {

Can't manipulate dataframe in pandas

Don't understand why I can't do even the most simple data manipulation with this data i've scraped. I've tried all sorts of methjods to manipulate the data but

soup.find() function is not working, how do I find the ID value?

If I have the following HTML that was found with BeautifulSoup, can someone explain why print(soup.find(id="style")) or print(soup.find(id="id")) does not work

How to scrape all data from first page to last page using beautifulsoup

I have been trying to scrape all data from the first page to the last page, but it returns only the first page as the output. How can I solve this? Below is my

Web Scraping price AirBnB data with Python

I have been trying to web scrape an air bnb website to obtain the price without much luck. I have successfully been able to bring in the other areas of interest

How to get text from a div span in soup?

Hi I am trying to get the text within a span from beautiful soup however it doesn't return the 631. I want to get the 631 from this html. <div class="jsx-302

scraping yell with python requests gives 403 error

I have this code from requests.sessions import Session url = "https://www.yell.com/s/launderettes-birmingham.html" s = Session() headers = { 'user-agent':"

Find the CSRF token from head tag in htlm using Beautifulsoup

HTML looks like this: <head csrf-token="eCUDIDdtOwAHTgR4WE9ZWydwIAYvKQYIFRtXKWw7Nn4=..."> I was trying to extract this way: token = soup.find('input', {'