'downloaded images have the same file size and are corrupted
All images downloaded from the image scraper have the same file size of 130 kb and are corrupted and cannot be seen in the image viewer.
I really have no idea what the problem is.
Anyone please give me some advice on this matter.
import requests
import parsel
import os
import time
url = 'https://movie-screencaps.com/movie-directory/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)
selector = parsel.Selector(response.text)
movie_list = selector.xpath('//div[@class="tagindex"]/ul/li')
for li in movie_list:
movie_name = li.xpath('.//a/text()').get().strip()
movie_url = li.xpath('.//a/@href').get()
print(movie_name, movie_url)
# dir = f'download/{movie_name}'
dir = f'{movie_name}'
if not os.path.exists(dir):
os.makedirs(dir)
page_response = requests.get(movie_url, headers=headers)
page_selector = parsel.Selector(page_response.text)
page_text = page_selector.xpath('//div[@class="wp-pagenavi"]/text()').get()
last_page = int(page_text.split(' ')[-1])
for page in range(1, last_page + 1):
page_url = f'{movie_url}/page/{page}'
print(f'===== Downloading from page {page} =====')
image_response = requests.get(url=page_url, headers=headers)
image_selector = parsel.Selector(image_response.text)
images_url_list = image_selector.xpath('//div[@align="center"]/a/@href').getall()
for image_url in images_url_list:
image_data = requests.get(url=page_url, headers=headers).content
# print(image_data)
file_name = image_url.split('/')[-1]
with open(f'{dir}/{file_name}', mode='wb') as f:
f.write(image_data)
print(file_name)
time.sleep(2)
Solution 1:[1]
The problem is a typo where you are fetching the page_url for each image_url instead of fetching the image_url:
...
for image_url in images_url_list:
image_data = requests.get(url=page_url, headers=headers).content
file_name = image_url.split('/')[-1]
...
Should be:
...
for image_url in images_url_list:
# Typo is here...
image_data = requests.get(url=image_url, headers=headers).content
file_name = image_url.split('/')[-1]
...
Solution 2:[2]
I tested your code and you just got a little mistake
change:
image_data = requests.get(url=page_url, headers=headers).content
to:
image_data = requests.get(url=image_url, headers=headers).content
tested and works just fine :)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | rcshon |
| Solution 2 | tomsouris |
