rewritten parallelism to BiocParallel package #104

gorgitko · 2021-06-28T06:03:42Z

Hello,

I have been looking at #83 and decided to rewrite the parallelism to the BiocParallel package. The reasons could be reviewed in the first bullet below. I have successfully tested those modifications on a relatively small dataset (1k PBMC) and also run devtools::check() resulting in zero errors and warnings.

Parallel processing is now controlled by user through a BiocParallel::BiocParallelParam object. This gives more freedom in choosing the parallel backend, as well as brings a better performance in case a computational cluster is started in the beginning and reused multiple times. This is also a case of the sc3() function, which now reuses the same BPPARAM for each of the pipeline's subfunctions. Also, BiocParallel is itself a robust wrapper around the parallel and snow packages, providing a unified interface to backend parallel methods, and comes bundled with powerful logging capabilities (error tracing) and some handy features, such as a native progress bar.
Vignette, DESCRIPTION, and NAMESPACE have been updated accordingly.
Some other files were changed due to formatting. This is caused by the usage of devtools package, which tries to unify the code and documentation.

- Parallel processing is now controlled by user through a BiocParallel::BiocParallelParam object. This gives more freedom in choosing of parallel backend, as well as it brings a better performance in case a computational cluster is started in the beginning and reused multiple times. This is also case of the sc3() function, which now reuses the same BPPARAM for each of the pipeline's subfunctions. Also, BiocParallel is itself a robust wrapper around the parallel and snow packages, providing an unified interface to backend parallel methods, and comes bundled with powerful logging capabilities (error tracing) and some handy features, such as native progress bar. - Vignette, DESCRIPTION, and NAMESPACE have been updated accordingly. - Some other files were changed due to formatting. This is caused by the usage of devtools package, which tries to unify the code and documentation.

wikiselev · 2021-06-30T08:54:44Z

Hi Jiri, many thanks for your request! It all looks awesome and I will try to merge it ASAP, depending on my availability. Thanks again!

wikiselev · 2021-07-16T14:22:40Z

Hi Jiri,

I just fetched your changes to my laptop and tried to run the vignette in a Terminal. When I run the first parallelisation it feels super slow, gets stuck at 67% and does not progress further:

> library(BiocParallel)
> BPPARAM <- SnowParam(workers = 4, type = "SOCK", RNGseed = 1, progressbar = TRUE)
> bpstart(BPPARAM)
> BiocParallel::register(BPPARAM)
> sce <- sc3(sce, ks = 2:4, biology = TRUE)
Setting SC3 parameters...
Calculating distances between the cells...
  |===============================================                       |  67%

It may be a problem with my laptop, but here is my sessionInfo():

> sessionInfo()
R Under development (unstable) (2020-09-24 r79253)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] BiocParallel_1.26.1         scater_1.20.1
 [3] ggplot2_3.3.5               scuttle_1.2.0
 [5] SC3_1.15.1                  SingleCellExperiment_1.14.1
 [7] SummarizedExperiment_1.22.0 Biobase_2.52.0
 [9] GenomicRanges_1.44.0        GenomeInfoDb_1.28.1
[11] IRanges_2.26.0              S4Vectors_0.30.0
[13] BiocGenerics_0.38.0         MatrixGenerics_1.4.0
[15] matrixStats_0.59.0

loaded via a namespace (and not attached):
 [1] viridis_0.6.1             BiocSingular_1.8.1
 [3] viridisLite_0.4.0         DelayedMatrixStats_1.14.0
 [5] shiny_1.6.0               GenomeInfoDbData_1.2.6
 [7] vipor_0.4.5               robustbase_0.93-8
 [9] pillar_1.6.1              lattice_0.20-41
[11] glue_1.4.2                beachmat_2.8.0
[13] digest_0.6.27             RColorBrewer_1.1-2
[15] promises_1.2.0.1          XVector_0.32.0
[17] colorspace_2.0-2          cowplot_1.1.1
[19] htmltools_0.5.1.1         httpuv_1.6.1
[21] Matrix_1.3-4              pcaPP_1.9-74
[23] WriteXLS_6.3.0            pkgconfig_2.0.3
[25] pheatmap_1.0.12           zlibbioc_1.38.0
[27] purrr_0.3.4               xtable_1.8-4
[29] mvtnorm_1.1-2             snow_0.4-3
[31] scales_1.1.1              ScaledMatrix_1.0.0
[33] later_1.2.0               tibble_3.1.2
[35] proxy_0.4-26              farver_2.1.0
[37] generics_0.1.0            ellipsis_0.3.2
[39] withr_2.4.2               ROCR_1.0-11
[41] magrittr_2.0.1            crayon_1.4.1
[43] mime_0.11                 fansi_0.5.0
[45] class_7.3-17              beeswarm_0.4.0
[47] tools_4.1.0               lifecycle_1.0.0
[49] munsell_0.5.0             cluster_2.1.0
[51] DelayedArray_0.18.0       irlba_2.3.3
[53] compiler_4.1.0            e1071_1.7-7
[55] rsvd_1.0.5                rlang_0.4.11
[57] grid_4.1.0                RCurl_1.98-1.3
[59] BiocNeighbors_1.10.0      labeling_0.4.2
[61] bitops_1.0-7              gtable_0.3.0
[63] rrcov_1.5-5               R6_2.5.0
[65] gridExtra_2.3             dplyr_1.0.7
[67] fastmap_1.1.0             utf8_1.2.1
[69] ggbeeswarm_0.6.0          Rcpp_1.0.7
[71] vctrs_0.3.8               DEoptimR_1.0-9
[73] tidyselect_1.1.1          sparseMatrixStats_1.4.0

Do you know what the problem can be?

gorgitko · 2021-07-19T05:19:22Z

Hello Vladimir,

I was trying my changes on an HPC server (~80 cores, 1TB RAM) without any problems, so could the performance of your laptop be the problem? Even on that HPC using 32 workers, the whole calculation takes a considerable amount of time.

The calculation of distances between cells takes a lot of memory and during parallel processing, the amount of used memory is even higher (basically three times more). Could you try first sequential processing with BPPARAM <- SerialParam()? Also, what is the size (dimensionality) of your sce object? I might suspect your laptop to start using swap space when there isn't enough RAM, which obviously leads to a significant calculation speed drop.

Just for curiosity, if you would try the original implementation of parallelism using the same number of workers, it would finish flawlessly?

wikiselev · 2021-07-19T22:20:37Z

Hi Jiri, many thanks for your reply! I am on vacation until Friday, will try your suggestions then. Thanks!

gorgitko · 2021-09-27T10:30:46Z

Hi, any updates on this? Soon I will be publishing a package to the Bioconductor, which doesn't allow remote sources (so I can't put my fork to DESCRIPTION).

wikiselev · 2021-10-14T13:07:22Z

Very sorry! I've changed jobs and have absolutely no time to work on this right now. Probably the best is not to rely on me updating it soon... I hope to get to this at some point... Very sorry again.

gorgitko mentioned this pull request Jun 28, 2021

Consider using Forks for parallelism? #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewritten parallelism to BiocParallel package #104

rewritten parallelism to BiocParallel package #104

gorgitko commented Jun 28, 2021 •

edited

Loading

wikiselev commented Jun 30, 2021

wikiselev commented Jul 16, 2021

gorgitko commented Jul 19, 2021 •

edited

Loading

wikiselev commented Jul 19, 2021

gorgitko commented Sep 27, 2021

wikiselev commented Oct 14, 2021

rewritten parallelism to BiocParallel package #104

Are you sure you want to change the base?

rewritten parallelism to BiocParallel package #104

Conversation

gorgitko commented Jun 28, 2021 • edited Loading

wikiselev commented Jun 30, 2021

wikiselev commented Jul 16, 2021

gorgitko commented Jul 19, 2021 • edited Loading

wikiselev commented Jul 19, 2021

gorgitko commented Sep 27, 2021

wikiselev commented Oct 14, 2021

gorgitko commented Jun 28, 2021 •

edited

Loading

gorgitko commented Jul 19, 2021 •

edited

Loading