Data management #8

eboileau · 2022-07-20T09:42:43Z

Issues and features related to data upload.

Data upload (H5AD)

~~[BUG] Currently, if uploading in H5AD format, the original data is left under uploads/files. We can either handle this case differently, or just remove the original data.~~
[FEATURE] We need to determine how best to allow using existing unstructured metadata, layers, or observation/variable-level matrices (such as UMAP, etc. )

Data upload (large data) - [ENHANCEMENT]

For relatively large datasets (e.g. 10G H5AD file), the current upload is not suitable, this will take forever, or will be interrupted.
Meanwhile, I added a new apache2 config unlimited_uploads.conf with LimitRequestBody 0, and further raised the PHP limits

post_max_size = 0
upload_max_filesize = 30000M
max_execution_time = 300

but there may be other timeout configurations that may also interrupt PHP execution. We need to think of a longer term solution. See #14 , I think this will work, at least for now. If we keep this solution, we should clean the PHP upload script and add proper logging.

Data upload (general)

~~[QUESTION] It looks like the original metadata is left under uploads/files. This might raise security issues. Should we remove it?~~
[DOCUMENTATION] We need to update the documentation, in particular for H5AD (and prioritize this format, at least for scRNA-seq).
[REMARK] File names must match exactly, otherwise upload will fail without any meaningful error message, e.g. if using gene.tab instead of genes.tab. Documentation should either be clear about this, or we allow some fuzziness in file names during upload, or we make sure an appropriate error message is displayed.
[REMARK] For failed uploads, some files may remain under /tmp or files/uploads.

The text was updated successfully, but these errors were encountered:

eboileau · 2022-08-04T11:25:02Z

This commit 1981d80 addresses H5AD data and metadata upload.

As for using existing unstructured metadata, layers, or observation/variable-level matrices, I need more time to figure out how exactly this is handled. Available display types are either taken from the columns (adata.obs) if primary, and/or from obsm if there are stored analyses. However, if obsm such as 'X_pca', 'X_tsne', 'X_umap' are present (but not in columns), they are shown as display parameters e.g. X, Y, but the data is actually not accessible, i.e. we get ERROR: Value of 'x' is not the name of a column in 'data_frame'. So observation matrices need to be in the columns to be usable in the curator view, and remaining unstructured metadata, layers, etc. are unused in primary analyses.

eboileau added the issue list Generic issue/feature list by theme label Jul 20, 2022

eboileau self-assigned this Jul 20, 2022

eboileau mentioned this issue Sep 29, 2022

Upload fails silently for large files #14

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data management #8

Data management #8

eboileau commented Jul 20, 2022 •

edited

Loading

eboileau commented Aug 4, 2022

Data management #8

Data management #8

Comments

eboileau commented Jul 20, 2022 • edited Loading

eboileau commented Aug 4, 2022

eboileau commented Jul 20, 2022 •

edited

Loading