Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't seem to track chardet #31

Open
skomanduri opened this issue Aug 26, 2019 · 2 comments
Open

Doesn't seem to track chardet #31

skomanduri opened this issue Aug 26, 2019 · 2 comments

Comments

@skomanduri
Copy link

We are getting very different results from rchardet compared to the Python package, where the Python version is noticeably more reliable.

Simple example:
File 1

Column 1,Column 2,Column 3,Column 4,Column 5
1,López,López,2,2

rchardet result:

irb(main):001:0> require 'rchardet'
=> true
irb(main):002:0> cd = CharDet.detect(File.new(file1).read)
=> {"encoding"=>"GB18030", "confidence"=>0.99}

Python chardet result:

$ pip install chardet
Successfully installed chardet-3.0.4
$ chardetect file1
file1: utf-8 with confidence 0.7525

Using the rchardet encoding is incorrect, even though it has extremely high confidence:

$ iconv -f GB18030 -t utf8 file1
Column 1,Column 2,Column 3,Column 4,Column 5
1,L贸pez,L贸pez,2,2
@grosser
Copy link
Collaborator

grosser commented Aug 26, 2019 via email

@skomanduri
Copy link
Author

Just want to follow up here that I tried out another implementation of chardet and it matches the Python version's result:

$ brew install uchardet
$ uchardet file1
UTF-8

I think it goes beyond a documentation issue, but I totally get that you don't have time to fix. Thanks for the quick response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants