-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't seem to track chardet #31
Comments
yeah, most likely needs an update/reword :(
... PR welcome, but I don't have time/knowledge for that :D
…On Mon, Aug 26, 2019 at 10:37 AM Saranga Komanduri ***@***.***> wrote:
We are getting very different results from rchardet compared to the Python
package, where the Python version is noticeably more reliable.
Simple example:
File 1
Column 1,Column 2,Column 3,Column 4,Column 5
1,López,López,2,2
rchardet result:
irb(main):001:0> require 'rchardet'
=> true
irb(main):002:0> cd = CharDet.detect(File.new(file1).read)
=> {"encoding"=>"GB18030", "confidence"=>0.99}
Python chardet result:
$ pip install chardet
Successfully installed chardet-3.0.4
$ chardetect file1
file1: utf-8 with confidence 0.7525
Using the rchardet encoding is incorrect, even though it has extremely
high confidence:
$ iconv -f GB18030 -t utf8 file1
Column 1,Column 2,Column 3,Column 4,Column 5
1,L贸pez,L贸pez,2,2
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#31?email_source=notifications&email_token=AAACYZ74BRNL5QSOQWPWJZ3QGQID3A5CNFSM4IPSJWFKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HHN4W5A>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAACYZ74SA5SIMJGHK55SPTQGQID3ANCNFSM4IPSJWFA>
.
|
Just want to follow up here that I tried out another implementation of chardet and it matches the Python version's result:
I think it goes beyond a documentation issue, but I totally get that you don't have time to fix. Thanks for the quick response. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We are getting very different results from rchardet compared to the Python package, where the Python version is noticeably more reliable.
Simple example:
File 1
rchardet
result:Python
chardet
result:Using the
rchardet
encoding is incorrect, even though it has extremely high confidence:The text was updated successfully, but these errors were encountered: