Files retrieved with urllib over https are truncated #129264

electricworry · 2025-01-24T14:55:55Z

Bug report

Bug description:

I am finding that some files downloaded with urllib are always truncated. I have a demonstration file which is 187527168 bytes of NULs.

If I download with wget it always is retrieved ok:

root@697bf25b6113:~# wget https://electricworry-public.s3.eu-west-1.amazonaws.com/test -O test-wget
--2025-01-24 14:41:27--  https://electricworry-public.s3.eu-west-1.amazonaws.com/test
Resolving electricworry-public.s3.eu-west-1.amazonaws.com (electricworry-public.s3.eu-west-1.amazonaws.com)... 52.218.90.80, 52.218.108.120, 3.5.72.214, ...
Connecting to electricworry-public.s3.eu-west-1.amazonaws.com (electricworry-public.s3.eu-west-1.amazonaws.com)|52.218.90.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187527168 (179M) [binary/octet-stream]
Saving to: 'test-wget'

test-wget                                                   100%[=========================================================================================================================================>] 178.84M  5.57MB/s    in 33s     

2025-01-24 14:42:01 (5.38 MB/s) - 'test-wget' saved [187527168/187527168]

root@697bf25b6113:~# ls -l
total 183132
-rw-r--r-- 1 root root 187527168 Jan 24 14:31 test-wget

If I attempt the following python3 code I end up with a slightly truncated file:

import urllib.request
import shutil
request = urllib.request.Request("https://electricworry-public.s3.eu-west-1.amazonaws.com/test")
r = urllib.request.urlopen(request, None, 1000)
f = open("test-python", "wb")
shutil.copyfileobj(r, f)
f.close()

Here's what I end up with:

root@697bf25b6113:~# python3
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import shutil
>>> request = urllib.request.Request("https://electricworry-public.s3.eu-west-1.amazonaws.com/test")
>>> r = urllib.request.urlopen(request, None, 1000)
>>> f = open("test-python", "wb")
>>> shutil.copyfileobj(r, f)
>>> f.close()
>>> 
root@697bf25b6113:~# ls -l
total 363136
-rw-r--r-- 1 root root 184313073 Jan 24 14:43 test-python
-rw-r--r-- 1 root root 187527168 Jan 24 14:31 test-wget

I've tried this on several computers:

Physical host Dell XPS 13 running Ubuntu 24.04
Physical own-build workstation running Linux Mint 22.1 Xia
Docker container running debian:bookworm

A wireshark packet capture seems to indicate that the remote side completes and closes the connection (FIN, PSH, ACK) which it should as urllib by default sends "Connection: close" in the headers.

Is this a known problem? The problem doesn't happen when I switch from https to http.

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

The text was updated successfully, but these errors were encountered:

picnixz · 2025-01-24T15:17:26Z

shutil.copyfile says:

Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied

Is the file position of the response file-like object at 0 or not? In addition, urlopen returns in this case a modified HTTPResponse object which is a BufferedIO object and has an underlying fp attribute. Could you perhaps check that this the case?

Does it also happen for files that have non-NUL bytes or is it for files that only have NULs? it might happen that it's the OS that is actually truncating the file itself, so you might also want to check that the buffer that was retrieved has the appropriate size (namely, the result size and the actual size on the disk may be different due to some optimized copyfileobj, but I don't know if this is the case).

electricworry · 2025-01-24T15:56:32Z

Thanks for the fast response.

Does it also happen for files that have non-NUL bytes or is it for files that only have NULs? it might happen that it's the OS that is actually truncating the file itself, so you might also want to check that the buffer that was retrieved has the appropriate size (namely, the result size and the actual size on the disk may be different due to some optimized copyfileobj, but I don't know if this is the case).

It does happen with non-NUL files. The problem manifested itself when I was trying to install my own snap packages using Ansible. First I ruled out Ansible as the problem (with this script) and then I tested a file of same length with NULs to eliminate the content as a factor.

shutil.copyfile says:

Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied

Is the file position of the response file-like object at 0 or not? In addition, urlopen returns in this case a modified HTTPResponse object which is a BufferedIO object and has an underlying fp attribute. Could you perhaps check that this the case?

I don't seem to be able to query r.tell() or r.fp.tell() - I get an UnsupportedOperation error - but I can deduce that it starts at zero because the (non-NUL) file always starts correctly and then ends with some varying amount of data omitted. This was confirmed with Beyond Compare in Hex Comparison. The file is byte-for-byte identical until the one retrieved in Python stops early. There is no other corruption before the truncation.

I'm really bewildered. My next step will be to use a proxy so I can read inside the TLS connection, but I'll have to come back to that.

ronaldoussoren · 2025-01-26T19:31:40Z

FWIW: The script works for me (macOS, python 3.12 using the python.org installer), giving the same result as 'curl -O' for the URL.

vadmium · 2025-01-27T04:28:44Z

Check the response header to see what kind of response it is. Check for Content-Length and Transfer-Encoding fields. In my case on an older Python with correct file apparently received:

>>> print(r.headers)
x-amz-id-2: eeLzjyXFqvH1+xLS401aIGUwu1aSVCgsVkDoZH5//QIGDCi54HLUpHR/TTfhWv7izKg0K5XAHDrrKd1zy6qpkSGtB4UQLqzJtTDHFLGHUvA=
x-amz-request-id: 96ZBWZVGSWAQKRGZ
Date: Mon, 27 Jan 2025 02:33:10 GMT
Last-Modified: Fri, 24 Jan 2025 14:31:32 GMT
ETag: "7fd748cb6861a39b13525537ccd3293d-11"
x-amz-server-side-encryption: AES256
x-amz-version-id: yljzBue_PWs52Ozr0ybXOAyS6YEnk419
Accept-Ranges: bytes
Content-Type: binary/octet-stream
Content-Length: 187527168
Server: AmazonS3
Connection: close

I would try bypassing the higher-level shutil, urllib and http modules. Assuming the response header is close enough to above, the function below should do the request using the socket and ssl modules. You should be able to paste it directly into the interpreter and run it with “do_request()”.

def do_request(host="electricworry-public.s3.eu-west-1.amazonaws.com", port=443, target="/test", length=187527168):
    import socket, ssl
    context = ssl.create_default_context()
    with socket.create_connection((host, port)) as tcp_conn, \
            context.wrap_socket(tcp_conn, suppress_ragged_eofs=False, server_hostname=host) as ssl_conn:
        request = f'GET {target} HTTP/1.1\r\n' \
            f'Host: {host}\r\n' \
            '\r\n'
        ssl_conn.sendall(request.encode('ascii'))
        with ssl_conn.makefile('rb') as reader:
            header = reader.read(10000)
            [header, body] = header.split(b'\r\n\r\n', 1)
            print(repr(header))
            header = header.lower().decode('ascii')
            assert f'\r\ncontent-length: {length}\r\n' in header
            assert '\ntransfer-encoding:' not in header
            body += reader.read(length - len(body))
    print(len(body))
    assert len(body) == length

electricworry added the type-bug An unexpected behavior, bug, or error label Jan 24, 2025

picnixz added the stdlib Python modules in the Lib dir label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files retrieved with urllib over https are truncated #129264

Files retrieved with urllib over https are truncated #129264

electricworry commented Jan 24, 2025 •

edited by github-actions bot

Loading

picnixz commented Jan 24, 2025

electricworry commented Jan 24, 2025

ronaldoussoren commented Jan 26, 2025

vadmium commented Jan 27, 2025

Files retrieved with urllib over https are truncated #129264

Files retrieved with urllib over https are truncated #129264

Comments

electricworry commented Jan 24, 2025 • edited by github-actions bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

picnixz commented Jan 24, 2025

electricworry commented Jan 24, 2025

ronaldoussoren commented Jan 26, 2025

vadmium commented Jan 27, 2025

electricworry commented Jan 24, 2025 •

edited by github-actions bot

Loading