'wget --warc-file gets only main page and robot pages?
I am trying to do a little project on a small-ish WARC file. I used this command:
[ ! -f course.warc.gz ] && wget -r -l 3 "https://www.ru.nl/datascience/" --delete-after --no-directories --warc-file="course" || echo Most likely, course.warc.gz already exists
First time I ran it, everything went fine, got over 150 pages worth, amazing. Now I wanted to redo it from scratch, so I deleted the file 'course.warc.gz'; problem is, when I run the same command now I get 3 pages: the one requested for, and two robot pages to boot. Why is this happening?
Solution 1:[1]
Wget can follow links in HTML, [...] This is sometimes referred to as “recursive downloading.” While doing that, Wget respects the Robot Exclusion Standard (/robots.txt). (wget manual)
The robots.txt includes the following rule:
# Block alle andere spiders
User-agent: *
Disallow: /
Difficult to answer whether what happened during the previous run of wget. Maybe the robots.txt changed?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sebastian Nagel |
