-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 ContentEncoding is disregarded #743
Comments
Thank you for the report. The following two statements seem inconsistent to me:
Why is it difficult to show the precise source code for 2)? |
@mpenkov I don't know if these instructions are correct or incorrect. For example, is Metadata uncompressed_size required (as used in https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L331)? Create the file and upload (I'm referring to the bucket as ): echo "hello world" | gzip -c > a.txt
aws s3 cp a.txt <bucket>/a.txt --content-encoding gzip Check ContentEncoding is set: In [36]: import boto3
In [37]: client = boto3.client("s3")
In [38]: obj = client.get_object(Bucket="<bucket>", Key="a.txt")
In [39]: obj["ContentEncoding"]
Out[39]: 'gzip' Reading with smart_open: In [1]: import smart_open
In [2]: smart_open.open("<bucket>/a.txt", "rb").read()
Out[2]: b'\x1f\x8b\x08\x00\xa2k\x87c\x00\x03\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\xe1\x02\x00-;\x08\xaf\x0c\x00\x00\x00'
In [3]: smart_open.open("<bucket>/a.txt").read()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In [3], line 1
----> 1 smart_open.open("<bucket>/a.txt").read()
File /opt/python/3.9.14/lib/python3.9/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
319 def decode(self, input, final=False):
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte |
Is there a solution to this problem yet? 🤔 |
I think your best bet is to do a |
My problem was a little different. In my scenario, I'm uploading a gzip file to S3, but the
|
Problem description
This I believe is the same issue as #422 but it's for S3.
Certain libraries, like
django_s3_storage
use ContentEncoding https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L330 to express on-the-fly compression/decompression.Smart open does not support this and I have to manually check for the presence of ContentEncoding when reading such files. The s3 documentation specifies:
Is this something that can/will be implemented at some point?
Steps/code to reproduce the problem
It's hard to give precise steps, but simply put uploading a
.txt
file with.txt
extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not.Versions
The text was updated successfully, but these errors were encountered: