'Why do I get different headers with requests vs. urllib2/urllib3/curl/wget?
I've been struggling to figure this out. Note that with wget and with curl I get the same headers that I see with urllib2/urllib3 but with requests, the 'Content-Length' header is missing on a small number of HEAD requests. It looks like I need to abandon using requests all together. Even using allow_redirects=True and other options did not solve this problem. What's going on?
Example:
>>> import requests
>>> import urllib3
>>> requests.__version__
'2.27.1'
>>> urllib3.__version__
'1.26.9'
>>> url='https://files.pushshift.io/reddit/submissions/sha256sums.txt'
>>> http = urllib3.PoolManager()
>>> resp_req = requests.head(url)
>>> resp_urllib = http.request('HEAD', url)
>>> resp_req.headers['content-length']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gfariello/venv/p3.8.5/lib/python3.8/site-packages/requests/structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'content-length'
>>> resp_urllib.headers['content-length']
'15633'
Note: the headers dictionaries in the urllib2, urllib3, and requests (which uses urllib3 on the backend) are case-insensitive and when I print the headers case-insensitive dict, 'Content-Length' is clearly missing from requests but in the urllib-direct requests it's there. When I inspect the underlying urllib3.requests.headers object used by the requests object (via resp_req.raw.headers in the example above), it's also not there.
>>> resp_req.raw.headers
HTTPHeaderDict({
'Date': 'Fri, 25 Mar 2022 02:39:01 GMT',
'Content-Type': 'text/plain; charset=UTF-8',
'Connection': 'keep-alive',
'last-modified': 'Thu, 05 Aug 2021 23:52:19 GMT',
...})
>>> resp_urllib.headers
HTTPHeaderDict({
'Date': 'Fri, 25 Mar 2022 02:39:18 GMT',
'Content-Type': 'text/plain; charset=UTF-8',
'Content-Length': '15633',
'Connection': 'keep-alive',
'last-modified': 'Thu, 05 Aug 2021 23:52:19 GMT',
...})
curl -v -I https://files.pushshift.io/reddit/submissions/sha256sums.txt output shows the content-length (lowercase) in the output:
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 200
HTTP/2 200
< date: Fri, 25 Mar 2022 02:30:25 GMT
date: Fri, 25 Mar 2022 02:30:25 GMT
< content-type: text/plain; charset=UTF-8
content-type: text/plain; charset=UTF-8
< content-length: 15633
content-length: 15633
< last-modified: Thu, 05 Aug 2021 23:52:19 GMT
last-modified: Thu, 05 Aug 2021 23:52:19 GMT
< etag: "610c79b3-3d11"
etag: "610c79b3-3d11"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
