drop dataframes that don't match condition R - r

I have a list of 310 data.frames, mrns[[i]], that I am subsetting based on the value of a factor, mrns[[i]]$ar.cat. I am able to use subset on them all in a way that those data.frames that don't match the condition are left with 0 observations, but I would like the code to just remove these data.frames rather than leave them in the new list as empty.
My code is:
arlow <- lapply(mrns, function(x) subset(x, x$ar.cat[1] == "Arousals Index: LOW"))
Which gives me:
length(arlow)
[1] 310
When I see the contents of the arlow list, I see this for the data.frames that don't meet the condition:
[[98]]
[1] raw.Number raw.Reading_Status raw.Month raw.Day raw.Year
[6] raw.Hour raw.Minute raw.Systolic raw.Diastolic raw.MAP
[11] raw.PP raw.HR raw.Event_Code raw.Edit_Status raw.Diary_Activity
[16] na.strings raw.facility raw.lastname raw.firstname raw.id
[21] raw.hookup raw.datetime raw.mrn unis ar.value
[26] ar.cat ID avg.hr.prhr avg.sys.prhr avg.dias.prhr
[31] avg.map.prhr avg.pp.prhr time time_60 raw.Minutee
<0 rows> (or 0-length row.names)
Let's say that the x$ar.cat[1] == "Arousals Index: LOW" condition is only met in 180 of my 310 mrns[[i]] data.frames, I would want the result of length(arlow) to equal 180.
Anyone have any suggestions on how to remove those data.frames not matching the condition?
Thanks!

How about that
arlow <- lapply( lapply(mrns, function(x) subset(x, x$ar.cat[1] == "Arousals Index: LOW")), function(y) nrow(y) >0)
first you filter that you did and then take frames only with data.

So you want to remove the NULLs from arlow?
Try:
arlow <- arlow[[!is.null(arlow)]]
As in:
lst <- list(data.frame(x=1:10,y=rnorm(10)), NULL, data.frame(x=1:10,y=rnorm(10)))
length(lst)
# [1] 3
result <- lst[[!is.null(lst)]]
length(result)
# [1] 2
Here's another way:
result <- Filter(Negate(is.null), lst)
length(result)
# [1] 2

edit: Actually, my answer does not make much sense. I did not do the subsetting in each data frame that you want. I still think which() is useful to subset without NA and NULL values, though.
mrns[which(sapply(1:length(mrns), function(x) mrns[x]$ar.cat == "Arousals Index: LOW"))]
This solution tests if the category (ar.cat) has the answer "Arousals Index: LOW" for each data frame in your list of data frames. The resulting vector should have 320 elements, where elements that met the condition are true.
Now we use which() to get the indices of the true values. These indices should ignore any NULL or NA values that occur in the vector we produced.
As a last step we subset the list of data frames with the indices we want.

Thanks everyone for your responses! I found the added the following code and gave me what I was looking for.
> arlow <- arlow[sapply(arlow, function(x) dim(x)[1]) > 0]
> length(arlow)
[1] 103

Related

How do you test if a matrix exists in a matrix list? (Wordle Project)

I am an infrequent R users so my apologies if any of my terminology is incorrect. I am working on a project around the game Wordle to see if a given Wordle submission in my family group chat is unique or if they have already been submitted before. The inspiration for this came from the twitter account "Scorigami" which tracks every NFL game and tweets whether or not that score has occurred before in the history of the league.
To load the Wordle entries into R, I've decided to turn each submission into a Matrix where 0 = incorrect letter, 1 = right letter/wrong position, and 2 = right letter/correct position. In R this looks like this:
wordle_brendan <- rbind(c(1,0,0,0,0),c(2,2,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
wordle_jack <- rbind(c(2,0,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
I then combine them into a list that will be used to check against any future Wordle submissions to see if they have been previously submitted.
list <- list(wordle_brendan, wordle_jack)
I think I am on the right track, but I don't know how to create a new wordle matrix to test whether that submission has been given before. Say I recreated "wordle_brendan" with the same values but under a different name... How would I then get R to check if that matrix exists in my preexisting list of matrices? I've tried using the %in% function 1,000 different ways but can't get it to work.. Any help would be much appreciated! Thanks! (And if you can think of a better way to do this, please let me know!)
There are multiple ways to do this, but this is pretty simple. We need some samples to check:
new1 <- list[[2]] # The same as your second matrix
new2 <- new1
new2[3, 5] <- 0 # Change one position from 2 to 0.
To compare
any(sapply(list, identical, y=new1))
# [1] TRUE
any(sapply(list, identical, y=new2))
# [1] FALSE
So new1 matches an existing matrix, but new2 does not. To see which matrix:
which(sapply(list, identical, y=new1))
# [1] 2
which(sapply(list, identical, y=new2))
# integer(0)
So new1 matches the second matrix in list, but new2 does not match any matrix.
Here is a way with a matequal function. Base function identical compares objects, not values and if the matrices have the same values but different attributes, such as names, identical returns FALSE.
This is many times too strict. A function that compares values only will return TRUE in these cases.
I will use dcarlson's new1 to illustrate this point.
matequal <- function(x, y) {
ok <- is.matrix(x) && is.matrix(y) && all(dim(x) == dim(y))
ok && all(x == y)
}
wordle_brendan <- rbind(c(1,0,0,0,0),c(2,2,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
wordle_jack <- rbind(c(2,0,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
list <- list(wordle_brendan, wordle_jack)
new1 <- list[[2]] # The same as your second matrix
wordle_john <- wordle_jack
dimnames(wordle_john) <- list(1:3, letters[1:5])
list2 <- list(wordle_brendan, wordle_jack, wordle_john)
sapply(list2, identical, y=new1)
#> [1] FALSE TRUE FALSE
sapply(list2, matequal, y=new1)
#> [1] FALSE TRUE TRUE
Created on 2022-09-27 with reprex v2.0.2
Edit
identical is not a function to compare two objects' values, it's a function to compare the objects themselves. In the following example identical returns FALSE though x and y have equal values, in the usual sense of equal.
matequal <- function(x, y) {
ok <- is.matrix(x) && is.matrix(y) && all(dim(x) == dim(y))
ok && all(x == y)
}
x <- matrix(1:5, ncol = 1)
y <- matrix(1 + 0:4, ncol = 1)
all(x == y)
#> [1] TRUE
identical(x, y)
#> [1] FALSE
matequal(x, y)
#> [1] TRUE
Created on 2022-09-28 with reprex v2.0.2
This is because the internal representations of x and y, borrowed from the C language, correspond to different class attributes. One of the objects stores elements of class "integer" and the other elements of class "numeric". The matrices both have the same class attribute ("matrix" "array"), the matrices elements' storage type is the main difference.
In a comment it is asked
Thank you and dcarlson for the response! Regarding the your two sapply lines, can you explain what the use would be behind using matequal as opposed to identical? Is the only difference that matequal takes into account the column and row names?
So the answer to the question in comment is no, the attributes, in this case dimnames, are not the only reason why identical is some or many times not ideal to compare R objects.
typeof(x)
#> [1] "integer"
typeof(y)
#> [1] "double"
class(x[1])
#> [1] "integer"
class(y[2])
#> [1] "numeric"
class(x)
#> [1] "matrix" "array"
class(y)
#> [1] "matrix" "array"
Created on 2022-09-28 with reprex v2.0.2

R: loop over list of lists to retrieve headers of sublists that contain a hit

I have a list of lists in R. Each sublist in the list of lists contains multiple elements. These sublists do not necessarily all have the same length. All sublists have a specific header name. Like this:
#create list of lists
vector1 = c("apple","banana","cherry")
vector2 = c("banana","date","fig")
vector3 = c("fig","jackfruit","mango","plum")
listoflists = list(vector1 , vector2, vector3)
names(listoflists) = c("listA", "listB", "listC")
The list of lists looks like this:
listoflists
$listA
[1] "apple" "banana" "cherry"
$listB
[1] "banana" "date" "fig"
$listC
[1] "fig" "jackfruit" "mango" "plum"
Next, I have a vector that contains elements that can also be found within the sublists. Like this:
wanted = c("apple","banana","fig")
wanted
[1] "apple" "banana" "fig"
For each element in the vector wanted I want to extract the header names of each sublist in the list of lists that contains this particular element. For the here presented example the output should look something like this:
#desired output
apple listA
banana listA listB
fig listB listC
I thought about putting this into a for loop to obtain something like this:
output_list = list()
for (i in wanted){
output = EXTRACT LIST HEADER WHEN i IS PRESENT IN SUBLIST
output_list[[i]] = output
}
However, it is not clear whether I can, and if yes how to, loop over the list of lists to extract header names of only those sublists that contain the element in the vector wanted. I looked into using the unlist function but that did not seem to be useful for this problem. I looked on stackoverflow, as well as other forums but could not find any question outlining a similar problem. It would thus be really helpful if someone can point me into the right direction to solve this issue.
Thanks already!
There are multiple ways to get the output.
1) An option is to loop over the 'listoflists', subset the vector based on the 'wanted' values, stack it to a two column data.frame and split into a list again by 'values'
with(stack(lapply(listoflists, function(x)
x[x %in% wanted])), split(as.character(ind), values))
#$apple
#[1] "listA"
#$banana
#[1] "listA" "listB"
#$fig
#[1] "listB" "listC"
2) or we can stack first to a two column 'data.frame', then subset the rows, and split
with(subset(stack(listoflists), values %in% wanted),
split(as.character(ind), values))
#$apple
#[1] "listA"
#$banana
#[1] "listA" "listB"
#$fig
#[1] "listB" "listC"
3)) Or another option is to loop over the 'wanted' and get the names of the 'listoflists' based on a match
setNames(lapply(wanted, function(x)
names(which(sapply(listoflists, function(y) x %in% y)))), wanted)
#$apple
#[1] "listA"
#$banana
#[1] "listA" "listB"
#$fig
#[1] "listB" "listC"
Here is another base R option
u <- unlist(listoflists)
sapply(wanted, function(x) rep(names(listoflists),lengths(listoflists))[u %in% x])
which gives
$apple
[1] "listA"
$banana
[1] "listA" "listB"
$fig
[1] "listB" "listC"
You could use stack + unstack
unstack(subset(stack(listoflists), values%in%wanted), ind~values)
$apple
[1] "listA"
$banana
[1] "listA" "listB"
$fig
[1] "listB" "listC"

R List with sub-lists: Extract all elements that match a rule into array

I have a R list of objects which are again lists of various types. I want "cost" value for all objects whose category is "internal". What's a good way of achieving this?
If I had a data frame I'd have done something like
my_dataframe$cost[my_dataframe$category == "internal"]
What's the analogous idiom for a list?
mylist<-list(list(category="internal",cost=2),
list(category="bar",cost=3),list(category="internal",cost=4),
list(category='foo',age=56))
Here I'd want to get c(2,4). Subsetting like this does not work:
mylist[mylist$category == "internal"]
I can do part of this by:
temp<-sapply(mylist,FUN = function(x) x$category=="internal")
mylist[temp]
[[1]]
[[1]]$category
[1] "internal"
[[1]]$cost
[1] 2
[[2]]
[[2]]$category
[1] "internal"
[[2]]$cost
[1] 4
But how do I get just the costs so that I can (say) sum them up etc.? I tried this but does not help much:
unlist(mylist[temp])
category cost category cost
"internal" "2" "internal" "4"
Is there a neat, compact idiom for doing what I want?
The idiom you are looking for is
sapply(mylist, "[[", "cost")
which returns a list of the extracted vector, should it exist, and NULL if it does not.
[[1]]
[1] 2
[[2]]
[1] 3
[[3]]
[1] 4
[[4]]
NULL
If you just want the sum of categories that are internal you can do the following assuming you want a vector.
sum(sapply(mylist[temp], "[[", "cost"))
And if you want a list of the same result you can do
sapply(mylist,function(x) x[x$category == "internal"]$cost)
One of the beautiful, but challenging things about R is that there are so many ways to express the same language.
You might note from the other answers that you can interchange sapply and lapply since lists are just heterogenous vectors, the following will also return 6.
do.call("sum",lapply(mylist, function(x) x[x[["category"]] == "internal"]$cost))
Yet another attempt, this time using ?Filter and a custom function to do the necessary selecting:
sum(sapply(Filter(function(x) x$category=="internal", mylist), `[[`, "cost"))
#[1] 6
Could try something like this. For all sublists, if the category is "internal", get the cost, otherwise return NULL which will be ignored when you unlist the result:
sum(unlist(lapply(mylist, function(x) if(x$category == "internal") x$cost)))
# [1] 6
A safer way is to also check if category exists in the sublist by checking the length of category:
sum(unlist(lapply(mylist, function(x) if(length(x$category) && x$category == "internal") x$cost)))
# [1] 6
This will avoid raising an error if the sublist doesn't contain the category field.
I approached your question by rlist package. This method is similar to apurrr package method #alistaire mentioned.
library(rlist); library(dplyr)
mylist %>%
list.filter(category=="internal") %>%
list.mapv(cost) %>% sum()
# list.mapv returns each member of a list by an expression to a vector.
The purrr package has some nice utilities for manipulating lists. Here, keep lets you specify a predicate function that returns a Boolean for whether to keep a list element:
library(purrr)
mylist %>%
keep(~.x[['category']] == 'internal') %>%
# now select the `cost` element of each, and simplify to numeric
map_dbl('cost') %>%
sum()
## [1] 6
The predicate structure with ~ and .x is a shorthand equivalent to
function(x){x[['category']] == 'internal'}
Here's a dplyr option:
library(dplyr)
bind_rows(mylist) %>%
filter(category == 'internal') %>%
summarize(total = sum(cost))
# A tibble: 1 x 1
total
<dbl>
1 6

ifelse() with unexpected results in R

I have a several bins in my data frame.
[1] "bin1" "bin2" "bin3" "bin4" "bin5" "bin6"
I have a bin number, and would like to exclude everything EXCEPT that bin and the previous bin. If bin=1, I would only like to exclude everything except bin1 (bin0 does not exist).
To produce a vector of names of bins to exclude later from my data frame, I produce:
BinsToDelete <- ifelse(i>1, paste("bin",1:6,sep="")[-((i-1):i)],paste("bin",1:6,sep="")[-i])
For ease of understanding
> i=3
> paste("bin",1:6,sep="")[-((i-1):i)]
[1] "bin1" "bin4" "bin5" "bin6"
> paste("bin",1:6,sep="")[-i]
[1] "bin1" "bin2" "bin4" "bin5" "bin6"
Weirdly an ifelse statement produces this:
> i=3
> BinsToDelete <- ifelse(i==1, paste("bin",1:6,sep="")[-i],paste("bin",1:6,sep="")[-((i-1):i)])
> BinsToDelete
[1] "bin1"
What happened there?
A normal if-else statement gives the desired results:
> if(i==1){
BinsToDelete <- paste("bin",1:6,sep="")[-i]
} else { BinsToDelete <- paste("bin",1:6,sep="")[-((i-1):i)]}
> BinsToDelete
[1] "bin1" "bin4" "bin5" "bin6"
Thanks for helping me understand how ifelse arrives to this conclusion.
From ?ifelse
Value:
A vector of the same length and attributes (including dimensions
and ‘"class"’) as ‘test’ and data values from the values of ‘yes’
or ‘no’.
In your case:
> i <- 3
> length(i)
[1] 1
So you got a length 1 output
I generally avoid ifelse when possible because I think the resulting code is aesthetically unpleasing. The fact that it removes class attributes and makes handling factors and Dates and data-times difficult is a further reason to avoid it. The grep` funciton is designed to return a vector suitable for indexed selection:
> z=3
> grep( paste0("bin",(z-1):z, collapse="|") , x)
[1] 2 3
> x[ grep( paste0("bin",(z-1):z, collapse="|") , x)]
[1] "bin2" "bin3"
> z=1
> x[ grep( paste0("bin",(z-1):z, collapse="|") , x)]
[1] "bin1"
My understanding is that the dplyr if_else function addresses some of those issues.

How do I use `[` correctly with (l|s)apply to select a specific column from a list of matrices?

Consider the following situation where I have a list of n matrices (this is just dummy data in the example below) in the object myList
mat <- matrix(1:12, ncol = 3)
myList <- list(mat1 = mat, mat2 = mat, mat3 = mat, mat4 = mat)
I want to select a specific column from each of the matrices and do something with it. This will get me the first column of each matrix and return it as a matrix (lapply() would give me a list either is fine).
sapply(myList, function(x) x[, 1])
What I can't seem able to do is use [ directly as a function in my sapply() or lapply() incantations. ?'[' tells me that I need to supply argument j as the column identifier. So what am I doing wrong that this does't work?
> lapply(myList, `[`, j = 1)
$mat1
[1] 1
$mat2
[1] 1
$mat3
[1] 1
$mat4
[1] 1
Where I would expect this:
$mat1
[1] 1 2 3 4
$mat2
[1] 1 2 3 4
$mat3
[1] 1 2 3 4
$mat4
[1] 1 2 3 4
I suspect I am getting the wrong [ method but I can't work out why? Thoughts?
I think you are getting the 1 argument form of [. If you do lapply(myList, `[`, i =, j = 1) it works.
After two pints of Britain's finest ale and a bit of cogitation, I realise that this version will work:
lapply(myList, `[`, , 1)
i.e. don't name anything and treat it like I had done mat[ ,1]. Still don't grep why naming j doesn't work...
...actually, having read ?'[' more closely, I notice the following section:
Argument matching:
Note that these operations do not match their index arguments in
the standard way: argument names are ignored and positional
matching only is used. So ‘m[j=2,i=1]’ is equivalent to ‘m[2,1]’
and *not* to ‘m[1,2]’.
And that explains my quandary above. Yeah for actually reading the documentation.
It's because [ is a .Primitive function. It has no j argument. And there is no [.matrix method.
> `[`
.Primitive("[")
> args(`[`)
NULL
> methods(`[`)
[1] [.acf* [.AsIs [.bibentry* [.data.frame
[5] [.Date [.difftime [.factor [.formula*
[9] [.getAnywhere* [.hexmode [.listof [.noquote
[13] [.numeric_version [.octmode [.person* [.POSIXct
[17] [.POSIXlt [.raster* [.roman* [.SavedPlots*
[21] [.simple.list [.terms* [.ts* [.tskernel*
Though this really just begs the question of how [ is being dispatched on matrix objects...

Resources