Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy way of adding empty entries to a List #53

Open
LTLA opened this issue Oct 1, 2019 · 5 comments
Open

Easy way of adding empty entries to a List #53

LTLA opened this issue Oct 1, 2019 · 5 comments

Comments

@LTLA
Copy link
Contributor

LTLA commented Oct 1, 2019

The use case is as follows:

  • I have a SummarizedExperiment containing data for the full set of samples.
  • I have a List containing data for a subset of those samples.
  • I want to store the List in the colData of the SE.

To do so, I would like to generate empty values for samples that aren't in the List, so that my expanded List can nicely fit into my SE object. The same scenario applies for GRangesList objects that I might want to store as rowRanges but don't have entries for all rows of the SE (e.g., because I have some transgenes or spike-ins that don't have genomic coordinates).

My solution is this:

expandList <- function(x, names) {
    lost <- !names %in% names(x)
    if (!any(lost)) {
        return(x[names])
    }

    y <- split(extractROWS(unlist(x), 0L), factor(names[0], levels=names[lost]))
    if (!all(lost)) {
        x0 <- x[names[!lost]]
        mcols(x0) <- NULL
        y <- c(x0, y)[names]
    }

    mcols(y) <- mcols(x)[match(names, names(x)),]
    y
}

Example:

library(S4Vectors)
a <- List(A=1:5, B=2:3)
expandList(a, LETTERS[1:5])
@hpages
Copy link
Contributor

hpages commented Oct 1, 2019

Hi Aaron,

This is more like a subsetting operation that allows extracting non-existing list elements as "empty" elements. Note that this is something that ordinary lists in R support:

as.list(a)[LETTERS[1:5]]
# $A
# [1] 1 2 3 4 5
#
# $B
# [1] 2 3
# 
# $<NA>
# NULL
#
# $<NA>
# NULL
# 
# $<NA>
# NULL

Except that with our List derivatives empty list elements cannot necessarily be represented with NULLs. The "canonical empty list element" of a List derivative actually depends on the particular type of List derivative. It can be computed with something like unlist(extractROWS(x, 0L), use.names=FALSE). Not sure this is the idiom for this though (I would need to spend some time convincing myself that this will always do the right thing). Note that I inverted the order of the operations with respect to your extractROWS(unlist(x), 0L). Both produce the same thing but the latter could be very expensive e.g. when x is a very long SimpleList derivative.

2 more things: there are no reasons to (1) limit the subscript to a character vector and (2) not support duplicated subscripts:

as.list(a)[c(3:1, 2:3)]
# $<NA>
# NULL
# 
# $B
# [1] 2 3
# 
# $A
# [1] 1 2 3 4 5
# 
# $B
# [1] 2 3
# 
# $<NA>
# NULL

So if we were to add something like this to S4Vectors, my proposal would be to go for something along the lines of:

library(IRanges)

subset_and_expand_List <- function(x, i)
{
    stopifnot(is(x, "list_OR_List"))
    i <- normalizeSingleBracketSubscript(i, x, allow.append=TRUE)
    if (length(i) == 0L)
        return(extractROWS(x, i))
    y_len <- max(i) - length(x)
    if (y_len <= 0L)
        return(extractROWS(x, i))
    ## Append 'y_len' empty list elements to 'x'.
    empty_element <- unlist(extractROWS(x, 0L), use.names=FALSE)
    partitioning <- PartitioningByEnd(integer(y_len))
    y <- relist(empty_element, partitioning)
    x <- c(x, y)
    extractROWS(x, i)
}

Then:

a <- List(A=1:5, B=2:3)
subset_and_expand_List(a, LETTERS[1:5])
# IntegerList of length 5
# [["A"]] 1 2 3 4 5
# [["B"]] 2 3
# [[""]] integer(0)
# [[""]] integer(0)
# [[""]] integer(0)

library(GenomicRanges)
example(GRangesList)
subset_and_expand_List(grl, c("B", "gr3", "gr2", "A", "gr2", "A"))
subset_and_expand_List(grl, c(2, 7:2, 4))

Maybe extractROWS and [ should just allow extracting non-existing list elements, in which case we wouldn't need to introduce a dedicated function for this. This would be a big game changer though so would need to be considered very cautiously.

H.

@LTLA
Copy link
Contributor Author

LTLA commented Oct 1, 2019

Both produce the same thing but the latter could be very expensive e.g. when x is a very long SimpleList derivative.

Makes sense.

2 more things: there are no reasons to (1) limit the subscript to a character vector and (2) not support duplicated subscripts:

I don't mind, but I don't like the look of those empty names in the subset_and_expand_List(a, LETTERS[1:5]) example. This seems to cause more problems than they're worth, e.g., #47. Also, it means that I can't get an empty vector by using $C, even though I just expanded it.

This would be a big game changer though so would need to be considered very cautiously.

That probably warrants a bigger discussion about handling of missing values in Vectors.

@hpages
Copy link
Contributor

hpages commented Oct 1, 2019

I purposely left the question of names propagation aside for my first shot at subset_and_expand_List() (note that base R doesn't care to propagate the names either when the subscript is a character vector (e.g. as.list(a)[LETTERS[1:5]]), which I also find ugly). Given that Vector derivatives support subsetting by a character- or factor- Rle, a version of subset_and_expand_List() that propagates the names would look something like:

subset_and_expand_List <- function(x, i)
{
    stopifnot(is(x, "list_OR_List"))
    i0 <- normalizeSingleBracketSubscript(i, x, allow.append=TRUE)
    if (length(i0) == 0L)
        return(extractROWS(x, i0))
    y_len <- max(i0) - length(x)
    if (y_len <= 0L)
        return(extractROWS(x, i0))
    ## Append 'y_len' empty list elements to 'x'.
    empty_element <- unlist(extractROWS(x, 0L), use.names=FALSE)
    partitioning <- PartitioningByEnd(integer(y_len))
    y <- relist(empty_element, partitioning)
    x <- c(x, y)
    ans <- extractROWS(x, i0)
    if (is(i, "Rle")) {
        i_runvals <- runValue(i)
        set_names <- is.character(i_runvals) || is.factor(i_runvals)
    } else {
        set_names <- is.character(i) || is.factor(i)
    }
    if (set_names)
        names(ans) <- i
    ans
}

Anyway, what do you have in mind? Did you just want to discuss alternate implementations for your expandList() function or are you advocating for adding something like this to S4Vectors? If the latter it's important that it extends [ i.e. that is behaves like [ for any subscript that selects existing list elements. Also the function would need a better name.

H.

@LTLA
Copy link
Contributor Author

LTLA commented Oct 1, 2019

I was thinking that something like this could be added to S4Vectors; I have already encountered a need for it in two different contexts (one for GRLs, another time for DataFrameLists).

@hpages
Copy link
Contributor

hpages commented Oct 1, 2019

I could give this a shot. What does @lawremi think about integrating this to extractROWS (and thus to [)? If that doesn't seem like a good option then we would need a good name for this dedicated tolerant-to-missing-list-elements subsetting function. Suggestions? Also, to make the function easier to discover, the error message one gets when trying to extract non-existing List elements with extractROWS or [ could be modified to suggest the use of this dedicated function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants