Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewritten parallelism to BiocParallel package #104

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gorgitko
Copy link

@gorgitko gorgitko commented Jun 28, 2021

Hello,

I have been looking at #83 and decided to rewrite the parallelism to the BiocParallel package. The reasons could be reviewed in the first bullet below. I have successfully tested those modifications on a relatively small dataset (1k PBMC) and also run devtools::check() resulting in zero errors and warnings.

  • Parallel processing is now controlled by user through a BiocParallel::BiocParallelParam object. This gives more freedom in choosing the parallel backend, as well as brings a better performance in case a computational cluster is started in the beginning and reused multiple times. This is also a case of the sc3() function, which now reuses the same BPPARAM for each of the pipeline's subfunctions. Also, BiocParallel is itself a robust wrapper around the parallel and snow packages, providing a unified interface to backend parallel methods, and comes bundled with powerful logging capabilities (error tracing) and some handy features, such as a native progress bar.
  • Vignette, DESCRIPTION, and NAMESPACE have been updated accordingly.
  • Some other files were changed due to formatting. This is caused by the usage of devtools package, which tries to unify the code and documentation.

- Parallel processing is now controlled by user through a BiocParallel::BiocParallelParam object.
  This gives more freedom in choosing of parallel backend, as well as it brings a better performance
  in case a computational cluster is started in the beginning and reused multiple times.
  This is also case of the sc3() function, which now reuses the same BPPARAM for each of the pipeline's subfunctions.
  Also, BiocParallel is itself a robust wrapper around the parallel and snow packages, providing an unified interface to
  backend parallel methods, and comes bundled with powerful logging capabilities (error tracing) and some
  handy features, such as native progress bar.
- Vignette, DESCRIPTION, and NAMESPACE have been updated accordingly.
- Some other files were changed due to formatting. This is caused by the usage of devtools package,
  which tries to unify the code and documentation.
@wikiselev
Copy link
Member

Hi Jiri, many thanks for your request! It all looks awesome and I will try to merge it ASAP, depending on my availability. Thanks again!

@wikiselev
Copy link
Member

Hi Jiri,

I just fetched your changes to my laptop and tried to run the vignette in a Terminal. When I run the first parallelisation it feels super slow, gets stuck at 67% and does not progress further:

> library(BiocParallel)
> BPPARAM <- SnowParam(workers = 4, type = "SOCK", RNGseed = 1, progressbar = TRUE)
> bpstart(BPPARAM)
> BiocParallel::register(BPPARAM)
> sce <- sc3(sce, ks = 2:4, biology = TRUE)
Setting SC3 parameters...
Calculating distances between the cells...
  |===============================================                       |  67%

It may be a problem with my laptop, but here is my sessionInfo():

> sessionInfo()
R Under development (unstable) (2020-09-24 r79253)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] BiocParallel_1.26.1         scater_1.20.1
 [3] ggplot2_3.3.5               scuttle_1.2.0
 [5] SC3_1.15.1                  SingleCellExperiment_1.14.1
 [7] SummarizedExperiment_1.22.0 Biobase_2.52.0
 [9] GenomicRanges_1.44.0        GenomeInfoDb_1.28.1
[11] IRanges_2.26.0              S4Vectors_0.30.0
[13] BiocGenerics_0.38.0         MatrixGenerics_1.4.0
[15] matrixStats_0.59.0

loaded via a namespace (and not attached):
 [1] viridis_0.6.1             BiocSingular_1.8.1
 [3] viridisLite_0.4.0         DelayedMatrixStats_1.14.0
 [5] shiny_1.6.0               GenomeInfoDbData_1.2.6
 [7] vipor_0.4.5               robustbase_0.93-8
 [9] pillar_1.6.1              lattice_0.20-41
[11] glue_1.4.2                beachmat_2.8.0
[13] digest_0.6.27             RColorBrewer_1.1-2
[15] promises_1.2.0.1          XVector_0.32.0
[17] colorspace_2.0-2          cowplot_1.1.1
[19] htmltools_0.5.1.1         httpuv_1.6.1
[21] Matrix_1.3-4              pcaPP_1.9-74
[23] WriteXLS_6.3.0            pkgconfig_2.0.3
[25] pheatmap_1.0.12           zlibbioc_1.38.0
[27] purrr_0.3.4               xtable_1.8-4
[29] mvtnorm_1.1-2             snow_0.4-3
[31] scales_1.1.1              ScaledMatrix_1.0.0
[33] later_1.2.0               tibble_3.1.2
[35] proxy_0.4-26              farver_2.1.0
[37] generics_0.1.0            ellipsis_0.3.2
[39] withr_2.4.2               ROCR_1.0-11
[41] magrittr_2.0.1            crayon_1.4.1
[43] mime_0.11                 fansi_0.5.0
[45] class_7.3-17              beeswarm_0.4.0
[47] tools_4.1.0               lifecycle_1.0.0
[49] munsell_0.5.0             cluster_2.1.0
[51] DelayedArray_0.18.0       irlba_2.3.3
[53] compiler_4.1.0            e1071_1.7-7
[55] rsvd_1.0.5                rlang_0.4.11
[57] grid_4.1.0                RCurl_1.98-1.3
[59] BiocNeighbors_1.10.0      labeling_0.4.2
[61] bitops_1.0-7              gtable_0.3.0
[63] rrcov_1.5-5               R6_2.5.0
[65] gridExtra_2.3             dplyr_1.0.7
[67] fastmap_1.1.0             utf8_1.2.1
[69] ggbeeswarm_0.6.0          Rcpp_1.0.7
[71] vctrs_0.3.8               DEoptimR_1.0-9
[73] tidyselect_1.1.1          sparseMatrixStats_1.4.0

Do you know what the problem can be?

@gorgitko
Copy link
Author

gorgitko commented Jul 19, 2021

Hello Vladimir,

I was trying my changes on an HPC server (~80 cores, 1TB RAM) without any problems, so could the performance of your laptop be the problem? Even on that HPC using 32 workers, the whole calculation takes a considerable amount of time.

The calculation of distances between cells takes a lot of memory and during parallel processing, the amount of used memory is even higher (basically three times more). Could you try first sequential processing with BPPARAM <- SerialParam()? Also, what is the size (dimensionality) of your sce object? I might suspect your laptop to start using swap space when there isn't enough RAM, which obviously leads to a significant calculation speed drop.

Just for curiosity, if you would try the original implementation of parallelism using the same number of workers, it would finish flawlessly?

@wikiselev
Copy link
Member

Hi Jiri, many thanks for your reply! I am on vacation until Friday, will try your suggestions then. Thanks!

@gorgitko
Copy link
Author

Hi, any updates on this? Soon I will be publishing a package to the Bioconductor, which doesn't allow remote sources (so I can't put my fork to DESCRIPTION).

@wikiselev
Copy link
Member

Very sorry! I've changed jobs and have absolutely no time to work on this right now. Probably the best is not to rely on me updating it soon... I hope to get to this at some point... Very sorry again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants