'Contradictory rules in robots.txt
I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt
User-agent: *
Disallow: *
Allow: /
Does Allow: / mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.
Solution 1:[1]
If you are following the original robots.txt standard:
- The
*in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a/, so that rule disallows nothing. - The
AllowRule isn't in the specification, so that line would be ignored. - Anything that isn't specifically disallowed is allowed to be crawled.
Verdict: You can crawl the site.
Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:
- Both
Allow: /andDisallow: *match any specific path on the site. - In the case of such a conflict, the more specific rule (ie longer) rule wins.
/and*are each one character, so neither is considered more specific than the other. - In a case of a tie for specificity, the least restrictive rule wins.
Allowis considered less restrictive thanDisallow.
Verdict: You can crawl the site.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
