Easy way of adding empty entries to a List #53

LTLA · 2019-10-01T02:36:03Z

The use case is as follows:

I have a SummarizedExperiment containing data for the full set of samples.
I have a List containing data for a subset of those samples.
I want to store the List in the colData of the SE.

To do so, I would like to generate empty values for samples that aren't in the List, so that my expanded List can nicely fit into my SE object. The same scenario applies for GRangesList objects that I might want to store as rowRanges but don't have entries for all rows of the SE (e.g., because I have some transgenes or spike-ins that don't have genomic coordinates).

My solution is this:

expandList <- function(x, names) {
    lost <- !names %in% names(x)
    if (!any(lost)) {
        return(x[names])
    }

    y <- split(extractROWS(unlist(x), 0L), factor(names[0], levels=names[lost]))
    if (!all(lost)) {
        x0 <- x[names[!lost]]
        mcols(x0) <- NULL
        y <- c(x0, y)[names]
    }

    mcols(y) <- mcols(x)[match(names, names(x)),]
    y
}

Example:

library(S4Vectors)
a <- List(A=1:5, B=2:3)
expandList(a, LETTERS[1:5])

The text was updated successfully, but these errors were encountered:

hpages · 2019-10-01T06:42:04Z

Hi Aaron,

This is more like a subsetting operation that allows extracting non-existing list elements as "empty" elements. Note that this is something that ordinary lists in R support:

as.list(a)[LETTERS[1:5]]
# $A
# [1] 1 2 3 4 5
#
# $B
# [1] 2 3
# 
# $<NA>
# NULL
#
# $<NA>
# NULL
# 
# $<NA>
# NULL

Except that with our List derivatives empty list elements cannot necessarily be represented with NULLs. The "canonical empty list element" of a List derivative actually depends on the particular type of List derivative. It can be computed with something like unlist(extractROWS(x, 0L), use.names=FALSE). Not sure this is the idiom for this though (I would need to spend some time convincing myself that this will always do the right thing). Note that I inverted the order of the operations with respect to your extractROWS(unlist(x), 0L). Both produce the same thing but the latter could be very expensive e.g. when x is a very long SimpleList derivative.

2 more things: there are no reasons to (1) limit the subscript to a character vector and (2) not support duplicated subscripts:

as.list(a)[c(3:1, 2:3)]
# $<NA>
# NULL
# 
# $B
# [1] 2 3
# 
# $A
# [1] 1 2 3 4 5
# 
# $B
# [1] 2 3
# 
# $<NA>
# NULL

So if we were to add something like this to S4Vectors, my proposal would be to go for something along the lines of:

library(IRanges)

subset_and_expand_List <- function(x, i)
{
    stopifnot(is(x, "list_OR_List"))
    i <- normalizeSingleBracketSubscript(i, x, allow.append=TRUE)
    if (length(i) == 0L)
        return(extractROWS(x, i))
    y_len <- max(i) - length(x)
    if (y_len <= 0L)
        return(extractROWS(x, i))
    ## Append 'y_len' empty list elements to 'x'.
    empty_element <- unlist(extractROWS(x, 0L), use.names=FALSE)
    partitioning <- PartitioningByEnd(integer(y_len))
    y <- relist(empty_element, partitioning)
    x <- c(x, y)
    extractROWS(x, i)
}

Then:

a <- List(A=1:5, B=2:3)
subset_and_expand_List(a, LETTERS[1:5])
# IntegerList of length 5
# [["A"]] 1 2 3 4 5
# [["B"]] 2 3
# [[""]] integer(0)
# [[""]] integer(0)
# [[""]] integer(0)

library(GenomicRanges)
example(GRangesList)
subset_and_expand_List(grl, c("B", "gr3", "gr2", "A", "gr2", "A"))
subset_and_expand_List(grl, c(2, 7:2, 4))

Maybe extractROWS and [ should just allow extracting non-existing list elements, in which case we wouldn't need to introduce a dedicated function for this. This would be a big game changer though so would need to be considered very cautiously.

H.

LTLA · 2019-10-01T15:35:45Z

Both produce the same thing but the latter could be very expensive e.g. when x is a very long SimpleList derivative.

Makes sense.

2 more things: there are no reasons to (1) limit the subscript to a character vector and (2) not support duplicated subscripts:

I don't mind, but I don't like the look of those empty names in the subset_and_expand_List(a, LETTERS[1:5]) example. This seems to cause more problems than they're worth, e.g., #47. Also, it means that I can't get an empty vector by using $C, even though I just expanded it.

This would be a big game changer though so would need to be considered very cautiously.

That probably warrants a bigger discussion about handling of missing values in Vectors.

hpages · 2019-10-01T18:19:25Z

I purposely left the question of names propagation aside for my first shot at subset_and_expand_List() (note that base R doesn't care to propagate the names either when the subscript is a character vector (e.g. as.list(a)[LETTERS[1:5]]), which I also find ugly). Given that Vector derivatives support subsetting by a character- or factor- Rle, a version of subset_and_expand_List() that propagates the names would look something like:

subset_and_expand_List <- function(x, i)
{
    stopifnot(is(x, "list_OR_List"))
    i0 <- normalizeSingleBracketSubscript(i, x, allow.append=TRUE)
    if (length(i0) == 0L)
        return(extractROWS(x, i0))
    y_len <- max(i0) - length(x)
    if (y_len <= 0L)
        return(extractROWS(x, i0))
    ## Append 'y_len' empty list elements to 'x'.
    empty_element <- unlist(extractROWS(x, 0L), use.names=FALSE)
    partitioning <- PartitioningByEnd(integer(y_len))
    y <- relist(empty_element, partitioning)
    x <- c(x, y)
    ans <- extractROWS(x, i0)
    if (is(i, "Rle")) {
        i_runvals <- runValue(i)
        set_names <- is.character(i_runvals) || is.factor(i_runvals)
    } else {
        set_names <- is.character(i) || is.factor(i)
    }
    if (set_names)
        names(ans) <- i
    ans
}

Anyway, what do you have in mind? Did you just want to discuss alternate implementations for your expandList() function or are you advocating for adding something like this to S4Vectors? If the latter it's important that it extends [ i.e. that is behaves like [ for any subscript that selects existing list elements. Also the function would need a better name.

H.

LTLA · 2019-10-01T21:00:08Z

I was thinking that something like this could be added to S4Vectors; I have already encountered a need for it in two different contexts (one for GRLs, another time for DataFrameLists).

hpages · 2019-10-01T21:39:43Z

I could give this a shot. What does @lawremi think about integrating this to extractROWS (and thus to [)? If that doesn't seem like a good option then we would need a good name for this dedicated tolerant-to-missing-list-elements subsetting function. Suggestions? Also, to make the function easier to discover, the error message one gets when trying to extract non-existing List elements with extractROWS or [ could be modified to suggest the use of this dedicated function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy way of adding empty entries to a List #53

Easy way of adding empty entries to a List #53

LTLA commented Oct 1, 2019

hpages commented Oct 1, 2019 •

edited

Loading

LTLA commented Oct 1, 2019

hpages commented Oct 1, 2019 •

edited

Loading

LTLA commented Oct 1, 2019

hpages commented Oct 1, 2019 •

edited

Loading

Easy way of adding empty entries to a List #53

Easy way of adding empty entries to a List #53

Comments

LTLA commented Oct 1, 2019

hpages commented Oct 1, 2019 • edited Loading

LTLA commented Oct 1, 2019

hpages commented Oct 1, 2019 • edited Loading

LTLA commented Oct 1, 2019

hpages commented Oct 1, 2019 • edited Loading

hpages commented Oct 1, 2019 •

edited

Loading

hpages commented Oct 1, 2019 •

edited

Loading

hpages commented Oct 1, 2019 •

edited

Loading