Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle large data sets #129

Open
bryanhanson opened this issue May 18, 2020 · 12 comments
Open

How to handle large data sets #129

bryanhanson opened this issue May 18, 2020 · 12 comments
Assignees
Labels
Topic: datasets 📅 Related to datasets in hyperSpec Type: proposal 💡 Proposed ideas for all to consider.
Milestone

Comments

@bryanhanson
Copy link
Collaborator

This topic is touched on in several threads, I thought I'd start a dedicated thread. A while back I had a large data set to deal with, and ever since I've been watching for example R-pkg-devel for this topic. So going through my saved e-mails I found this discussion which seems quite relevant. In particular, the last two messages mentioning R.cache and drat. drat has several vignettes. These look like promising ways to package the data and access it as/when needed. The issue of where to put it remains of course.

@bryanhanson
Copy link
Collaborator Author

This discussion might also be useful: https://www.r-bloggers.com/persistent-config-and-data-for-r-packages/

@cbeleites
Copy link
Owner

@cbeleites
Copy link
Owner

This will be relevant for chondro.

@cbeleites
Copy link
Owner

Record of GSoC weekly video call on 2020-05-18:

@cbeleites cbeleites added Type: proposal 💡 Proposed ideas for all to consider. question ❔ labels May 18, 2020
@cbeleites cbeleites added this to the Version 1.0 milestone May 18, 2020
@ximeg ximeg self-assigned this May 19, 2020
@ximeg
Copy link
Collaborator

ximeg commented May 19, 2020

I will try to move chondro into a separate package with the help of drat.

@GegznaV
Copy link
Collaborator

GegznaV commented May 19, 2020

I'm not delved deep enough to understand the essence of the problem, but I will state my point of view and will ask for explanations.

So while I was translating some of the vignettes, it was not clear for me, why are the datasets like chondro are created again and again every time the package is built. (As a reader of a vignette, I couldn't reproduce them fully, as e.g., it was not clear from the vignette, where should I find the data, but this is the other story). It is not clear to me, why the original spectroscopic files and the instructions on how to create the datasets that are used to illustrate the capabilities of hyperSpec are not in folder raw_data. From my point of view, the whole procedure is too complicated and it could be simplified (but most probably I do not see something important here). In my opinion, the datasets (like chondro) should be created only once and converted to a regular dataset of a package, e.g., by using usethis::use_data(chondro). You may read more on this at https://r-pkgs.org/data.html and in the documentation of:

?usethis::use_data_raw

So, could you summarize why this process to build the example datasets again and again is needed? Is it for unit testing?

@GegznaV
Copy link
Collaborator

GegznaV commented May 19, 2020

data(package = "hyperSpec")

Returns me this:

image

Some datasets (e.g., chondro) are not present in the list. Why is it so?

@bryanhanson
Copy link
Collaborator Author

@GegznaV you are basically correct, the storage and (re)generation of the data is complicated and a bit opaque. This summer we have a student @eoduniyi working on streamlining the whole package, thanks to Google Summer of Code. Data issues are getting a close look but it will take a while to address the wide range of issues.

@cbeleites
Copy link
Owner

cbeleites commented May 21, 2020

The reason for "no raw_data/" is basically history:

  • The "externally built" vignettes in hyperSpec were around before .Rbuildignore existed. So back then, the only possibility was to have Sweave documents somewhere separate and then copy what should go into the package to the appropriate place of the package directory structure.
    The raw_data/ convention is AFAIK quite recent (advanced R book?)
  • And yes, the re-generation in particular of fileio is/was basically a poor-man's unit test where the underlying files could not be shipped with hyperSpec since that would have meant a package size >>100 MB (there are/were even some code chunks in there that were labeled as unit tests).
  • Early on, the internals of hyperSpec objects changed every once in a while. Regenerating the object from its raw data ensured that things kept working.
  • chondro is special in that it would be too large to ship with hyperSpec. I therefore decided to ship basically a PCA compressed version. The parts of that data set are internal (in sysdata.rda), and the chondro object is created on the fly when required (via DelayedAssign()). This apparently has the side effect that it looks to R like a normal variable rather than a data set. Including the need to @export it.
    The same will probably be the case with @bryanhanson's synthetic data set.

@GegznaV GegznaV added the Topic: datasets 📅 Related to datasets in hyperSpec label May 21, 2020
@bryanhanson
Copy link
Collaborator Author

Found this package and interesting discussion of options while looking for something else. Should look over this before going down any path.

@bryanhanson
Copy link
Collaborator Author

Another post that might suggest some options https://blog.r-hub.io/2020/05/29/distribute-data/

@bryanhanson
Copy link
Collaborator Author

A recent change on R devel might be of some use, but might also mess us up: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17777

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Topic: datasets 📅 Related to datasets in hyperSpec Type: proposal 💡 Proposed ideas for all to consider.
Projects
None yet
Development

No branches or pull requests

4 participants