How to handle large data sets #129

bryanhanson · 2020-05-18T00:31:48Z

This topic is touched on in several threads, I thought I'd start a dedicated thread. A while back I had a large data set to deal with, and ever since I've been watching for example R-pkg-devel for this topic. So going through my saved e-mails I found this discussion which seems quite relevant. In particular, the last two messages mentioning R.cache and drat. drat has several vignettes. These look like promising ways to package the data and access it as/when needed. The issue of where to put it remains of course.

The text was updated successfully, but these errors were encountered:

bryanhanson · 2020-05-18T13:16:55Z

This discussion might also be useful: https://www.r-bloggers.com/persistent-config-and-data-for-r-packages/

cbeleites · 2020-05-18T19:58:56Z

R Journal paper about drat

cbeleites · 2020-05-18T19:59:10Z

This will be relevant for chondro.

cbeleites · 2020-05-18T20:26:23Z

Record of GSoC weekly video call on 2020-05-18:

chondro will be replaced by a synthetic data set (Proposal: Add a synthetic data set for speedy testing & demonstration #114, Added synth data infrastructure per Issue #114 #125)
flu, laser, and paracetamol stay as they are: they are sufficiently small to not hinder anything.
They may be replaced by "static" versions rather than requiring them to be generated via make (Get (mostly) rid of make #132)
barbiturates may be replaced by a synthetic data set in the future, but not now

ximeg · 2020-05-19T20:55:03Z

I will try to move chondro into a separate package with the help of drat.

GegznaV · 2020-05-19T21:45:01Z

I'm not delved deep enough to understand the essence of the problem, but I will state my point of view and will ask for explanations.

So while I was translating some of the vignettes, it was not clear for me, why are the datasets like chondro are created again and again every time the package is built. (As a reader of a vignette, I couldn't reproduce them fully, as e.g., it was not clear from the vignette, where should I find the data, but this is the other story). It is not clear to me, why the original spectroscopic files and the instructions on how to create the datasets that are used to illustrate the capabilities of hyperSpec are not in folder raw_data. From my point of view, the whole procedure is too complicated and it could be simplified (but most probably I do not see something important here). In my opinion, the datasets (like chondro) should be created only once and converted to a regular dataset of a package, e.g., by using usethis::use_data(chondro). You may read more on this at https://r-pkgs.org/data.html and in the documentation of:

?usethis::use_data_raw

So, could you summarize why this process to build the example datasets again and again is needed? Is it for unit testing?

GegznaV · 2020-05-19T21:49:18Z

data(package = "hyperSpec")

Returns me this:

Some datasets (e.g., chondro) are not present in the list. Why is it so?

bryanhanson · 2020-05-19T23:20:59Z

@GegznaV you are basically correct, the storage and (re)generation of the data is complicated and a bit opaque. This summer we have a student @eoduniyi working on streamlining the whole package, thanks to Google Summer of Code. Data issues are getting a close look but it will take a while to address the wide range of issues.

cbeleites · 2020-05-21T19:17:58Z

The reason for "no raw_data/" is basically history:

The "externally built" vignettes in hyperSpec were around before .Rbuildignore existed. So back then, the only possibility was to have Sweave documents somewhere separate and then copy what should go into the package to the appropriate place of the package directory structure.
The raw_data/ convention is AFAIK quite recent (advanced R book?)
And yes, the re-generation in particular of fileio is/was basically a poor-man's unit test where the underlying files could not be shipped with hyperSpec since that would have meant a package size >>100 MB (there are/were even some code chunks in there that were labeled as unit tests).
Early on, the internals of hyperSpec objects changed every once in a while. Regenerating the object from its raw data ensured that things kept working.
chondro is special in that it would be too large to ship with hyperSpec. I therefore decided to ship basically a PCA compressed version. The parts of that data set are internal (in sysdata.rda), and the chondro object is created on the fly when required (via DelayedAssign()). This apparently has the side effect that it looks to R like a normal variable rather than a data set. Including the need to @export it.
The same will probably be the case with @bryanhanson's synthetic data set.

bryanhanson · 2020-05-29T14:04:21Z

Found this package and interesting discussion of options while looking for something else. Should look over this before going down any path.

bryanhanson · 2020-06-07T22:55:38Z

Another post that might suggest some options https://blog.r-hub.io/2020/05/29/distribute-data/

bryanhanson · 2020-06-24T00:01:49Z

A recent change on R devel might be of some use, but might also mess us up: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17777

cbeleites added Type: proposal 💡 Proposed ideas for all to consider. question ❔ labels May 18, 2020

cbeleites added this to the Version 1.0 milestone May 18, 2020

ximeg self-assigned this May 19, 2020

This was referenced May 21, 2020

Get (mostly) rid of make #132

Closed

Convert small data sets to "static" versions #137

Closed

GegznaV added the Topic: datasets 📅 Related to datasets in hyperSpec label May 21, 2020

cbeleites removed the question ❔ label Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle large data sets #129

How to handle large data sets #129

bryanhanson commented May 18, 2020

bryanhanson commented May 18, 2020

cbeleites commented May 18, 2020

cbeleites commented May 18, 2020

cbeleites commented May 18, 2020

ximeg commented May 19, 2020

GegznaV commented May 19, 2020

GegznaV commented May 19, 2020

bryanhanson commented May 19, 2020

cbeleites commented May 21, 2020 •

edited

Loading

bryanhanson commented May 29, 2020

bryanhanson commented Jun 7, 2020

bryanhanson commented Jun 24, 2020

How to handle large data sets #129

How to handle large data sets #129

Comments

bryanhanson commented May 18, 2020

bryanhanson commented May 18, 2020

cbeleites commented May 18, 2020

cbeleites commented May 18, 2020

cbeleites commented May 18, 2020

ximeg commented May 19, 2020

GegznaV commented May 19, 2020

GegznaV commented May 19, 2020

bryanhanson commented May 19, 2020

cbeleites commented May 21, 2020 • edited Loading

bryanhanson commented May 29, 2020

bryanhanson commented Jun 7, 2020

bryanhanson commented Jun 24, 2020

cbeleites commented May 21, 2020 •

edited

Loading