-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files retrieved with urllib over https are truncated #129264
Comments
Is the file position of the response file-like object at 0 or not? In addition, Does it also happen for files that have non-NUL bytes or is it for files that only have NULs? it might happen that it's the OS that is actually truncating the file itself, so you might also want to check that the buffer that was retrieved has the appropriate size (namely, the result size and the actual size on the disk may be different due to some optimized copyfileobj, but I don't know if this is the case). |
Thanks for the fast response.
It does happen with non-NUL files. The problem manifested itself when I was trying to install my own snap packages using Ansible. First I ruled out Ansible as the problem (with this script) and then I tested a file of same length with NULs to eliminate the content as a factor.
I don't seem to be able to query r.tell() or r.fp.tell() - I get an UnsupportedOperation error - but I can deduce that it starts at zero because the (non-NUL) file always starts correctly and then ends with some varying amount of data omitted. This was confirmed with Beyond Compare in Hex Comparison. The file is byte-for-byte identical until the one retrieved in Python stops early. There is no other corruption before the truncation. I'm really bewildered. My next step will be to use a proxy so I can read inside the TLS connection, but I'll have to come back to that. |
FWIW: The script works for me (macOS, python 3.12 using the python.org installer), giving the same result as 'curl -O' for the URL. |
Check the response header to see what kind of response it is. Check for Content-Length and Transfer-Encoding fields. In my case on an older Python with correct file apparently received: >>> print(r.headers)
x-amz-id-2: eeLzjyXFqvH1+xLS401aIGUwu1aSVCgsVkDoZH5//QIGDCi54HLUpHR/TTfhWv7izKg0K5XAHDrrKd1zy6qpkSGtB4UQLqzJtTDHFLGHUvA=
x-amz-request-id: 96ZBWZVGSWAQKRGZ
Date: Mon, 27 Jan 2025 02:33:10 GMT
Last-Modified: Fri, 24 Jan 2025 14:31:32 GMT
ETag: "7fd748cb6861a39b13525537ccd3293d-11"
x-amz-server-side-encryption: AES256
x-amz-version-id: yljzBue_PWs52Ozr0ybXOAyS6YEnk419
Accept-Ranges: bytes
Content-Type: binary/octet-stream
Content-Length: 187527168
Server: AmazonS3
Connection: close
I would try bypassing the higher-level shutil, urllib and http modules. Assuming the response header is close enough to above, the function below should do the request using the socket and ssl modules. You should be able to paste it directly into the interpreter and run it with “do_request()”. def do_request(host="electricworry-public.s3.eu-west-1.amazonaws.com", port=443, target="/test", length=187527168):
import socket, ssl
context = ssl.create_default_context()
with socket.create_connection((host, port)) as tcp_conn, \
context.wrap_socket(tcp_conn, suppress_ragged_eofs=False, server_hostname=host) as ssl_conn:
request = f'GET {target} HTTP/1.1\r\n' \
f'Host: {host}\r\n' \
'\r\n'
ssl_conn.sendall(request.encode('ascii'))
with ssl_conn.makefile('rb') as reader:
header = reader.read(10000)
[header, body] = header.split(b'\r\n\r\n', 1)
print(repr(header))
header = header.lower().decode('ascii')
assert f'\r\ncontent-length: {length}\r\n' in header
assert '\ntransfer-encoding:' not in header
body += reader.read(length - len(body))
print(len(body))
assert len(body) == length |
Bug report
Bug description:
I am finding that some files downloaded with urllib are always truncated. I have a demonstration file which is 187527168 bytes of NULs.
If I download with wget it always is retrieved ok:
If I attempt the following python3 code I end up with a slightly truncated file:
Here's what I end up with:
I've tried this on several computers:
A wireshark packet capture seems to indicate that the remote side completes and closes the connection (FIN, PSH, ACK) which it should as urllib by default sends "Connection: close" in the headers.
Is this a known problem? The problem doesn't happen when I switch from https to http.
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: