-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble with accent or other special characters - Export encoding format #87
Comments
The underlying problem is that one of the goals of nccsv was to have it be readable by spreadsheet programs like Excel. Traditionally they just supported ASCII in csv files (I don't know about now). I will say: although it isn't proper, a solution is to use the ISO 8859-1 character set in the nccsv files that you create. That gives you all of the common accented characters in European languages (with the exception of the Euro character and others above Unicode #255). ERDDAP already reads .nccsv files as if they have the ISO 8859-1 character set. (There is a programmer's saying "Be strict in what you write but liberal when you read.") Doing this may cause trouble down the road (e.g., if someone ever tries to load one of these files into Excel), but you may be willing to accept that risk. Is that a reasonable/acceptable solution for your purposes? |
Hi Bob,
I don't follow the "non-proper" solution- is there a way to download .nccsv
files from an erddap dataset with accented characters in the metadata, and
not have it end up as ascii-escaped unicode (eg "P\u00eaches" instead of
Pêches)?
Ascii-escaped unicode isn't interpreted by Excel, so it seems to be less
supported than UTF-8, which can work in Excel by loading it in a certain
way.
Thanks
Nate
|
Sorry, my bad. I skipped over the first line of the original question "While exporting data written with special characters in .nccsv" and answered the question as if pauline-chauvet were making files to be imported into ERDDAP. Instead, you and pauline-chauvet are interested in the exported nccsv files. But the bulk of my answer remains. A goal of nccsv was to define a csv standard that could be read into spreadsheets and traditionally that has meant that they needed to be ASCII files. Yes, you're right, the encoding system is ugly compared to the UTF-8 characters, but it worked. That said, I see that I (or my successor) should revisit this to see if all (or most) modern spreadsheet programs can now import UTF-8 CSV files. If so, then yes, I (or my successor) will change the nccsv specification to use UTF-8 and change ERDDAP to read and write UTF-8 CSV files. Since I no longer work for NOAA and no longer have access to Excel, you (and others) can help. Can you please make a tiny UTF-8 CSV file with ASCII, 8859-1 (e.g., accented), and UTF-8-specific (e.g., the Euro symbol \u20ac) characters, then import the csv file into various common spreadsheet programs (Excel, Google Docs, Open Office?, Libre Office?, others?) to make sure each can read the file correctly? For each program, post a message here with
|
Thanks for taking a look. I tried opening this CSV file in various tools:
|
That is great! Thank you very much for doing that. Can someone else verify that it works in Windows Excel? |
Excellent! Thank you all very much for doing the tests. Okay. I will make the changes to the NCCSV specification and to the way the ERDDAP reads and writes NCCSV files. Hopefully, it will be in the next release. |
This is great ! |
I was wondering if the encoding can be change from UTF-8 to UTF-8-BOM ? |
For the uninitiated, BOM = Byte Order Mark. I didn't know there was a problem reading the UTF-8 nccsv files into Excel. Do other people have this problem? I didn't have Excel when I made the changes above and so relied on the tests from those users. (I still don't have Excel.) I'm disappointed in Excel. It should be able to detect and read UTF-8 without a BOM. See the top answer for So I think adding the BOM is a bad idea, in general. It seems like trouble to add it just to fix the Excel use case (and probably then cause problems elsewhere), but that is a primary use case. Comments? |
Reading UTF-8 into Excel isn't necessarily a problem. However, as previously mentioned in this conversation, users must use the import wizard to load csv without BOM correctly. |
Although I don't like BOMs and don't like going against the utf-8 file recommendations, your point about spreadsheets being a primary objective is strong. (MS/Excel are non-standard trouble yet again.) Can you (or someone) please check: if a utf-8 nccsv file has a BOM, can it be read correctly in all of the common spreadsheet programs (see above), hopefully automatically? |
I converted my test file above to UTF-8 with BOM: https://gist.github.com/n-a-t-e/005339cdb23905cc4118db57f41cfb72 I tested it in: Google sheets
LibreOffice Community 7.2.5.2 (Mac)
Apple Numbers 10.3.5 (Mac)
Excel 16.77.1 (Mac)
|
I think I should clarify my previous long message- The BOM file works great in Excel on Mac and Windows (tested both). There is a tiny quirk, at least in the Mac version -
|
Thanks for clarifying. So BOM sounds like a good idea. @ChrisPJohn, it is up to you, but it sounds to me like a good idea to:
Thank you @kutoso, @robitaillej, and @n-a-t-e. This is a good improvement. |
That sounds like a solid plan. I'll add this to the todo list. |
While exporting data written with special characters in .nccsv, the encoding makes it look unpleasant for the user.
for example:
-Original
"Observation temps réel des bouées Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)"
-Becomes
"Observation temps r\u00e9el des bou\u00e9es Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)"
Right now, the encoding is ASCII: ".nccsv - Download a NetCDF-3-like 7-bit ASCII NCCSV .csv file with COARDS/CF/ACDD metadata."
Will it be possible in the future to have it in UTF8?
The text was updated successfully, but these errors were encountered: