'Why do I get different headers with requests vs. urllib2/urllib3/curl/wget?

I've been struggling to figure this out. Note that with wget and with curl I get the same headers that I see with urllib2/urllib3 but with requests, the 'Content-Length' header is missing on a small number of HEAD requests. It looks like I need to abandon using requests all together. Even using allow_redirects=True and other options did not solve this problem. What's going on?

Example:

>>> import requests
>>> import urllib3
>>> requests.__version__
'2.27.1'
>>> urllib3.__version__
'1.26.9'
>>> url='https://files.pushshift.io/reddit/submissions/sha256sums.txt'
>>> http = urllib3.PoolManager()
>>> resp_req = requests.head(url)
>>> resp_urllib = http.request('HEAD', url)
>>> resp_req.headers['content-length']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gfariello/venv/p3.8.5/lib/python3.8/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-length'
>>> resp_urllib.headers['content-length']
'15633'

Note: the headers dictionaries in the urllib2, urllib3, and requests (which uses urllib3 on the backend) are case-insensitive and when I print the headers case-insensitive dict, 'Content-Length' is clearly missing from requests but in the urllib-direct requests it's there. When I inspect the underlying urllib3.requests.headers object used by the requests object (via resp_req.raw.headers in the example above), it's also not there.

>>> resp_req.raw.headers
HTTPHeaderDict({
  'Date': 'Fri, 25 Mar 2022 02:39:01 GMT',
  'Content-Type': 'text/plain; charset=UTF-8',
  'Connection': 'keep-alive',
  'last-modified': 'Thu, 05 Aug 2021 23:52:19 GMT',
  ...})
>>> resp_urllib.headers
HTTPHeaderDict({
  'Date': 'Fri, 25 Mar 2022 02:39:18 GMT',
  'Content-Type': 'text/plain; charset=UTF-8',
  'Content-Length': '15633',
  'Connection': 'keep-alive',
  'last-modified': 'Thu, 05 Aug 2021 23:52:19 GMT',
  ...})

curl -v -I https://files.pushshift.io/reddit/submissions/sha256sums.txt output shows the content-length (lowercase) in the output:

* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 200
HTTP/2 200
< date: Fri, 25 Mar 2022 02:30:25 GMT
date: Fri, 25 Mar 2022 02:30:25 GMT
< content-type: text/plain; charset=UTF-8
content-type: text/plain; charset=UTF-8
< content-length: 15633
content-length: 15633
< last-modified: Thu, 05 Aug 2021 23:52:19 GMT
last-modified: Thu, 05 Aug 2021 23:52:19 GMT
< etag: "610c79b3-3d11"
etag: "610c79b3-3d11"


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source