'Trying to add code that extract only lines that contains "word" and write a new .txt file from requests
This code opens a text file (list.txt) with websites and then extract URLS from webarchive.org from those websites, and write them to a new text file (urls.txt). I need to extract from web.archive.org only links that contain "word", but I am getting error:
if `word' in url: IndentationError: unexpected indent
Can someone explain why and give the right code here?
The code:
urls = []
with open("list.txt", "r") as f_in:
for line in map(str.strip, f_in):
if line == "":
continue
urls.append(line)
archive_url = "http://web.archive.org/cdx/search/cdx?url=*.{}&output=text&fl=original&collapse=urlkey"
with open("url.txt", "w") as f_out:
for url in urls:
r = requests.get(archive_url.format(url))
if 'word' in url:
print(r.text, file=f_out)
print("\n", file=f_out)
Solution 1:[1]
There are two issues:
- You have a leading space before the
ifstatement - In the line after this statement, you must indent the code
This should solve your problem:
urls = []
with open("list.txt", "r") as f_in:
for line in map(str.strip, f_in):
if line == "":
continue
urls.append(line)
archive_url = "http://web.archive.org/cdx/search/cdx?url=*.{}&output=text&fl=original&collapse=urlkey"
with open("url.txt", "w") as f_out:
for url in urls:
r = requests.get(archive_url.format(url))
if 'word' in url:
print(r.text, file=f_out)
print("\n", file=f_out)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Desi Pilla |
