Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lr_get_spec with csv mixed with others #53

Open
bittonp opened this issue Jun 9, 2023 · 2 comments
Open

lr_get_spec with csv mixed with others #53

bittonp opened this issue Jun 9, 2023 · 2 comments

Comments

@bittonp
Copy link

bittonp commented Jun 9, 2023

lr_get_spec() has challenges when using multiple extensions.

When ‘csv’ files are included with others e.g. ‘Master.Transmission’ we have a problem because the ‘csv’ requires the sep = ‘,’ argument.

For:

extension <- c('csv', 'Master.Transmission', 'ProcSpec') 

you would need:

lr_get_spec(ext = extension, sep = ',') # .Master.Transmission and ProcSpec not affected by sep argument

But this may affect how other file types are read

@Bisaloo
Copy link
Member

Bisaloo commented Jun 9, 2023

Thanks for the report!

The use case you are describing is already possible in lightr:

library(lightr)

spec <- lr_get_spec(
  system.file("testdata", package = "lightr"), 
  ext = c(
   "txt", "csv"
  )
)
#> 6 files found; importing spectra:
#> Warning in value[[3L]](cond): Parsing failed.
#> Please a different value for "sep" argument
#> Warning: Could not import one or more files:
#> /home/hugo/.local/share/R/x86_64-pc-linux-gnu-library/4.3/lightr/testdata/spec.csv

spec <- lr_get_spec(
  system.file("testdata", package = "lightr"), 
  ext = c(
    "txt", "csv"
  ),
  sep = ","
)
#> 6 files found; importing spectra:
head(spec)
#>    wl CRAIC_export OOusb4000 OceanView      UK5  avasoft8      spec
#> 1 300     5.710900  8.148667 -1.759185 64.27088 252.16336  9.885286
#> 2 301     5.794053  8.090286 -2.337772 52.36948   0.00000  7.078429
#> 3 302     4.326612  8.085000 -2.361941 55.71795  74.36793  1.991190
#> 4 303     4.178850  8.042000 -2.142622 36.30163  34.58589  5.415524
#> 5 304     3.671765  7.959000 -2.577687 43.05407 101.20355 11.820286
#> 6 305     2.784688  8.022333 -2.478087 42.62310   0.00000  1.497571

Created on 2023-06-09 with reprex v2.0.2

In fact, it's even something we're doing in our test suite:

expect_message(
expect_snapshot_value(
lr_get_spec(test.file(),
ext = c("TRM", "ttt", "jdx", "jaz", "JazIrrad", "csv", "txt",
"Transmission", "spc"),
sep = ","),
style = "serialize",
cran = TRUE,
tolerance = 1e-10
),
"16 files"
)

However, as you rightfully guessed, this can cause issues for spectra exported on machines on a non-English locale which may use "," as a decimal separator (examples). This is the reason why we don't put "," in the defaults for sep alongside tabs, spaces and ";".

The worst part about it is that lr_get_spec() will successfully return but the output will be completely wrong:

library(lightr)

# Fails if we don't specify dec = ","
lr_get_spec(
  system.file("testdata", "non_english", package = "lightr"), 
  ext = "txt"
)
#> 2 files found; importing spectra:
#> Warning in storage.mode(rawsplit) <- "numeric": NAs introduced by coercion
#> Warning in value[[3L]](cond): subscript out of bounds
#> Warning in storage.mode(rawsplit) <- "numeric": NAs introduced by coercion
#> Warning in value[[3L]](cond): subscript out of bounds
#> Warning: File import failed.
#> Check input files and function arguments.
#> NULL

# Works BUT RETURNS WRONG RESULT if we passed sep = ","
spec_wrong <- lr_get_spec(
  system.file("testdata", "non_english", package = "lightr"), 
  ext = "txt",
  sep = ","
)
#> 2 files found; importing spectra:
head(spec_wrong)
#>    wl OO_comma OceanView_nonEN
#> 1 300      407             453
#> 2 301      426             434
#> 3 302      520             411
#> 4 303      634             412
#> 5 304      691             410
#> 6 305      726             422

# Works once we say decimal = ","
spec_correct <- lr_get_spec(
  system.file("testdata", "non_english", package = "lightr"), 
  ext = "txt",
  decimal = ","
)
#> 2 files found; importing spectra:
head(spec_correct)
#>    wl OO_comma OceanView_nonEN
#> 1 300 5.500324       0.4324595
#> 2 301 5.398784       0.4268919
#> 3 302 5.428368       0.4334054
#> 4 303 5.589730       0.4080000
#> 5 304 5.593622       0.4225405
#> 6 305 5.678289       0.4128378

Created on 2023-06-09 with reprex v2.0.2

A marginally better & safer solution would be to set sep = "," only for csv files. We would then limit the number of files that could have a wrong result. It is not possible with the current codebase but I am working on a version where default options for parsers (dec, sep and specnum) are not set globally by lr_get_spec() but can be set individually by each parser.

This will also come with a dedicated parser for csv files, which will lead to better performance, and if all goes well, the ability to also read csv files with decimal = "," (to be confirmed).

After this change, you might still not be able to parse files with "." and "," decimal point in a single call but you should at least be able to parse csv files without running the risk of creating of wrong output for files with "," decimal point.

@bittonp
Copy link
Author

bittonp commented Jun 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants