How to chain 2 lapply functions to subset dataframes in R? - r

I have a list containing 3 dataframes and another list containing 3 vectors of IDs. I'd like to subset each dataframe by checking if the IDs in the 1st dataframe match the ones in the first vector. Same for the second df and 2nd vector and 3rd df and 3rd vector. I can do it using lapply but I get a list of 3 lists, each containing a dataframe subsetted according to each of the 3 values in the list of IDs.
I want to get a list of 3 dataframes, the 1st one resulting of the rows in the 1st dataframe that have id in the 1st vector of IDs, the 2nd one resulting of the rows in the 2nd dataframe that have id in the 2ndvector of IDs... etc
n <- seq(1:20)
id <- paste0("ID_", n)
df1 <-data.frame(replicate(3,sample(0:10,10,rep=TRUE)))
df1$id <- replicate(10, sample(id, 1, replace = TRUE))
df2 <-data.frame(replicate(3,sample(0:10,7,rep=TRUE)))
df2$id <- replicate(7, sample(id, 1, replace = TRUE))
df3 <-data.frame(replicate(3,sample(0:10,8,rep=TRUE)))
df3$id <- replicate(8, sample(id, 1, replace = TRUE))
list_df <- list(df1, df2, df3)
list_id <- list(c("ID_13", "ID_1", "ID_5"), c("ID_1", "ID_17", "ID_4",
"ID_9"), c("ID_12", "ID_18"))
subset_df <- lapply(list_df, function(x){
lapply(list_id, function(y) x[x$id %in% y,])
})
Thanks for your help!

As Nicola suggested, you can use Map or mapply in R. Mapply takes multiple vectors/lists of same lengths as parameters and pass the values corresponding to same index in the vector/lists to the function.
In your example, mapply will pass 1st list of list_df and 1 vector of list_id to df and id respectively and do the required processing and will continue for i=2,3 ...
mapply(function(df,id){ df[df$id %in% id,]},list_df,list_id,SIMPLIFY = FALSE)

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Merging two data.frames by two columns each

I have a huge data.frame that I want to reorder. The idea was to split it in half (as the first half contains different information than the second half) and create a third data frame which would be the combination of the two. As I always need the first two columns of the first data frame followed by the first two columns of the second data frame, I need help.
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
df3<-data.frame()
The new data frame should look like the following:
new3[new1[1],new1[2],new2[1],new2[2],new1[3],new1[4],new2[3],new2[4],new1[5],new1[6],new2[5],new2[6], etc.].
Pseudoalgorithmically, cbind 2 columns from data frame new1 then cbind 2 columns from data frame new2 etc.
I tried the following now (thanks to Akrun):
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
new1<-as.data.frame(new1, stringsAsFactors =FALSE)
new2<-as.data.frame(new2, stringsAsFactors =FALSE)
df3<-data.frame()
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
lst1 <- split.default(new1, f1(ncol(new1), 2))
lst2 <- split.default(new2, f1(ncol(new2), 2))
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
However, giving me a "undefined columns selected error".
See whether the below code helps
library(tidyverse)
# Two sample data frames of equal number of columns and rows
df1 = mtcars %>% select(-1)
df2 = diamonds %>% slice(1:32)
# get the column names
dn1 = names(df1)
dn2 = names(df2)
# create new ordered list
neworder = map(seq(1,length(dn1),2), # sequence with interval 2
~c(dn1[.x:(.x+1)], dn2[.x:(.x+1)])) %>% # a vector of two columns each
unlist %>% # flatten the list
na.omit # remove NAs arising from odd number of columns
# Get the data frame ordered
df3 = bind_cols(df1, df2) %>%
select(neworder)
It is not clear without a reproducible example. Based on the description, we can split the dataset columns into a list of datasets and use Map to cbind the columns of corresponding datasets, unlist and use that to order the third dataset
1) Create a function to return a grouping column for splitting the dataset
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
2) split the datasets into a list
lst1 <- split.default(df1, f1(ncol(df1), 2))
lst2 <- split.default(df2, f1(ncol(df2), 2))
3) Map through the corresponding list elements, cbind and unlist and use that to subset the columns of 'df3'
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
data
df1 <- as.data.frame(matrix(letters[1:10], 2, 5), stringsAsFactors = FALSE)
df2 <- as.data.frame(matrix(1:10, 2, 5))

Turning a data.frame into a list of smaller data.frames in R

Suppose I have a data.frame like THIS (or see my code below). As you can see, after every some number of continuous rows, there is a row with all NAs.
I was wondering how I could split THIS data.frame based on every row of NA?
For example, in my code below, I want my original data.frame to be split into 3 smaller data.frames as there are 2 rows of NAs in the original data.frame.
Here is is what I tried with no success:
## The original data.frame:
DF <- read.csv("https://raw.githubusercontent.com/izeh/i/master/m.csv", header = T)
## the index number of rows with "NA"s; Here rows 7 and 14:
b <- as.numeric(rownames(DF[!complete.cases(DF), ]))
## split DF by rows that have "NA"s; that is rows 7 and 14:
split(DF, b)
If we also need the NA rows, create a group with cumsum on the 'study.name' column which is blank (or NA)
library(dplyr)
DF %>%
group_split(grp = cumsum(lag(study.name == "", default = FALSE)), keep = FALSE)
Or with base R
split(DF, cumsum(c(FALSE, head(DF$study.name == "", -1))))
Or with NA
i1 <- rowSums(is.na(DF))== ncol(DF)
split(DF, cumsum(c(FALSE, head(i1, -1))))
Or based on 'b'
DF1 <- DF[setdiff(seq_len(nrow(DF)), b), ]
split(DF1, as.character(DF1$study.name))
You can find occurrence of b in sequence of rows in DF and use cumsum to create groups.
split(DF, cumsum(seq_len(nrow(DF)) %in% b))

R - How to subset all dataframes stored in a list according to a vector of conditions

This is my first time asking a question here so please let me know if I need to change the way I am doing this. I have been looking for awhile and I haven't been able to find what I need.
I have a list of 3 dataframes. They have the same structure (variables) but not the same number of observations. I would like to get several subsets for each dataframe in my list, according to several conditions stored in a vector.
So if I have 5 conditions, I need to get, for each of the 3 dataframes in my list, 5 subsets of these dataframes, so 15 total.
For instance:
df1 <-data.frame(replicate(3,sample(0:10,10,rep=TRUE)))
df2 <-data.frame(replicate(3,sample(0:10,7,rep=TRUE)))
df3 <-data.frame(replicate(3,sample(0:10,8,rep=TRUE)))
my_list <- list(df1, df2, df3)
conditions <- c(2, 5, 7, 4, 6)
I know how to subset for one of the conditions using lapply
list_subset <- lapply(my_list, function(x) x[which(x$X1 == conditions[1]), ])
But I would like to do that for all the values in the vector conditions.
I hope it makes sense.
Just lapply again, this time over the conditions:
df1 <-data.frame(replicate(3,sample(0:10,10,rep=TRUE)))
df2 <-data.frame(replicate(3,sample(0:10,7,rep=TRUE)))
df3 <-data.frame(replicate(3,sample(0:10,8,rep=TRUE)))
my_list <- list(df1, df2, df3)
conditions <- c(2, 5, 7, 4, 6)
list_subset <- lapply(my_list, function(x) x[which(x$X1 == conditions[1]), ])
#One Way, Conditions on first list
list.of.list_subsets <- lapply(conditions,function(y){
lapply(my_list, function(x) x[which(x$X1 == y), ])
})
#The other way around
list.of.list_subsets2 <- lapply(my_list,function(x){
lapply(conditions, function(y) x[which(x$X1 == y), ])
})
An option would be to filter with %in% and then split based on the 'X1' column
lapply(my_list, function(x) {x1 <- subset(x, X1 %in% conditions); split(x1, x1$X1)})

Vectorizing finding the row-wise mean of data frames within lists within lists

I have a list of sublists. Each sublist contains an identical data frame (identical except for the data inside it) and a 'yes/no' label. I'd like to find the row-wise mean of the data frames, if the yes/no label is TRUE.
#Create the data frames
id <- c("a", "b", "c")
df1 <- data.frame(id=id, data=c(1, 2, 3))
df2 <- df1
df3 <- data.frame(id=id, data=c(1000, 2000, 3000))
#Create the sublists that will store the data frame and the yes/no variable
sub1 <- list(data=df1, useMe=TRUE)
sub2 <- list(data=df2, useMe=TRUE)
sub3 <- list(data=df3, useMe=FALSE)
#Store the sublists in a main list
main <- list(sub1, sub2, sub3)
I want a vectorized function that will return the row-wise average of the data frames, but only if $useMe==TRUE, like so:
> desiredFun(main)
id data
1 a 1
2 b 2
3 c 3
Here's a fairly general way to approach this problem:
# Extract the "data" portion of each "main" list element
# (using lapply to return a list)
AllData <- lapply(main, "[[", "data")
# Extract the "useMe" portion of each "main" list element
# using sapply to return a vector)
UseMe <- sapply(main, "[[", "useMe")
# Select the "data" list elements where the "useMe" vector elements are TRUE
# and rbind all the data.frames together
Data <- do.call(rbind, AllData[UseMe])
library(plyr)
# Aggregate the resulting data.frame
Avg <- ddply(Data, "id", summarize, data=mean(data))

Resources