iterate over a set of dataframes in R - r

Let's say I have a set of dataframes: df1, df2, d3, df4. I want to apply some sort of behaviour over each of these dataframes. Rather than copying the code repeatedly, I want to do this through some sort of for loop. For example, let's say I want to take the df and re assign it so that the first column is row names. The normal way I'd do this is:
df1_b <- df1[,-1]
rownames(df1_b) <- df1[,1]
How would I go about doing this to all four dataframes that I have. I imagine I'd need to somehow make group the dataframes into a single set and then do something like
for (i in set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
I tried to do this with a cbind:
df_set <- c(df1, df2, df3, df4)
for (i in df_set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
But of course that didn't work (I'm pretty sure R does not do string concatenation like this).
Any help would be appreciated!

We can use mget to get the values of the multiple objects into a list and then do the processing in the list by looping over the list with lapply
lst1 <- lapply(mget(paste0("df", 1:4)), function(x) {
row.names(x) <- x[,1]
x[,-1]
})
If we want to change the original objects (not recommended)
list2env(lst1, .GlobalEnv)
Or another option is tidyverse
library(purrr)
library(tibble)
library(dplyr)
mget(ls(pattern = "^df\\d+$")) %>%
map(~ .x %>%
column_to_rownames(names(.)[1]))

You can apply a function this way for example:
# getting some dummy data
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
df4 <- mtcars
lst <- list(df1, df2, df3, df4)
# example of applying the function row.names to the data
Map(row.names, lst)

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Merge dataframes stored in two lists of the same length

I have two long lists of large dataframes that are equal in length. I want to merge Dataframe1 (from list1) with Dataframe1 (from list2) and Dataframe2 (from list1) with Dataframe2 (from list2) etc...
Below is a minimal reproducible example and some attempts.
#### EXAMPLE
#Create Dataframes
df_1 <- data.frame(c("Bah",NA,2,3,4),c("Bug",NA,5,6,NA))
df_2 <- data.frame(c("Blu",7,8,9,10),c(NA,NA,NA,12,13))
df_3 <- data.frame(c("Bah",NA,21,32,43),c("Rgh",NA,51,63,NA))
df_4 <- data.frame(c("Gar",7,8,9,10),c("Ghh",NA,NA,121,131))
#Create Lists
list1 <- list(df_1,df_2)
list2 <- list(df_3,df_4)
#Set column and row names for each dataframe
colnames(list1[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list1[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
rownames(list1[[1]]) <- c("1","2")
rownames(list1[[2]]) <- c("1","2")
rownames(list2[[1]]) <- c("1","2")
rownames(list2[[2]]) <- c("1","2")
My desired output is a list of the same length as the input lists but with each dataframe merged by position into a single dataframe. The following yields my desired output for the dataframes and list but is low throughput.
#### DESIRED OUTPUT
DesiredOutput_DF1_Format <- merge(list1[[1]],list2[[1]], all = TRUE, by = "SampleID")
DesiredOutput_DF2_Format <- merge(list1[[2]],list2[[2]], all = TRUE, by = "SampleID")
DesiredOutput_List <- list(DesiredOutput_DF1_Format, DesiredOutput_DF2_Format)
How can I generate an output list in my desired format in a highthroughput way using an apply-like approach?
#### ATTEMPTS
#Attempt1:
attempt1 <- mapply(cbind, list1, list2, simplify=FALSE)
#Attempt2:
My instinct is to use `lapply` but i cant figure how to make it iterate through two lists simultaneously.
#Attempt3: Works but the order of the output list appears inverted. This is not intuitive, though it is easily corrected... There has to be a cleaner way.
output_list <- list()
dataset_iterator <- 1:length(list1)
for (x in dataset_iterator) {
df1 <- data.frame(list1[[x]])
df2 <- data.frame(list2[[x]])
df_merged <- data.frame(merge(df1, df2, by = "Barcodes", all=TRUE))
output_list <- append(output_list, list(df_merged), 0)
Based on the code showed, we may need Map (or mapply with SIMPLIFY = FALSE)
out <- Map(merge, list1, list2, MoreArgs = list(all = TRUE, by = "SampleID"))
-checking with expected output
> identical(DesiredOutput_List, out)
[1] TRUE
Or using tidyverse
library(purrr)
library(dplyr)
map2(list1, list2, full_join, by = "SampleID")

Change name of the last column in df stored in a list

I have a list with 5 data.frames. Now I want to change the name of the last column of each data.frame.
And I don't know exactly how many columns are in the df.
Example-data:
library(tidyverse)
data(mtcars)
df1 <- tail(mtcars)
df2 <- mtcars[1:5, 2:10]
df3 <- mtcars
df4 <- head(mtcars)
list <- list(df1, df2, df3, df4)
Doing it one by one, this would be the command:
colnames(list$df1)[length(list$df1)] <- "rank"
Within a for loop, I would think that the command would then be:
for (i in seq_along(list)) {
colnames(i)[length(i)] <- "rank"
}
But here I get the error:
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) :
attempt to set 'colnames' on an object with less than two dimensions
Any idea how to solve this problem? Maybe by the map-command?
Here I don't know how to include the index/length(df) to assign the colnames-command to the last column of the dataframe.
Thank you for your help :)
Kathrin
You can use last_col() from dplyr within map:
library(tidyverse)
list <- map(list,~{
.x %>%
rename(rank = last_col())
})

How to iterate multiple data frames in R?

I am trying to remove rows that have missing values from 12 data frames.
I could use na.omit for each of them but that's to much syntax.
I've tried to do it in multiple ways:
Like this:
df <- list(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)
for (i in 1:length(df)){
df[i] <- na.omit(df[i])
}
And like this:
for (df in list(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)){
df <- na.omit(df)
}
None of this methods worked :)
Could someone please let me know what is missing here to properly iterate multiple data frames?
df1 <- df2 <- data.frame(a=c(NA, 1), b=1:2)
dfs <- list(df1, df2)
dfs[] <- lapply(dfs, na.omit)
As for why they don't work:
dfs <- list(df1, df2)
for (i in 1:length(dfs)){
dfs[i] <- na.omit(dfs[i])
}
Here you're using single square bracket subsetting, which returns a list, then calling na.omit on a list of length 1, where element 1 is the first df. Since the df is not NA, it's returned as-is. ie,
dfs[1]
#[[1]]
# a b
#1 NA 1
#2 1 2
And...
for (df in list(df1, df2)) {
df <- na.omit(df)
}
Here you're iterating over the dfs but storing the result of each in df. R doesn't really handle references (everything is copied on write) so df stores the result of na.omit(df1) after the first iteration and the result of na.omit(df2) when the loop ends.

How to create a string of vector names in R that will then be able to be evaluated?

In R, I currently have 100 dataframes, named df.1, ...,df.100. I would like to be able to rbind them but it is costly to write out:
rbind(df.1, df.2, etc)
So, I have tried:
rbind(eval(as.symbol(paste0("df.",1:84, collapse = ", "))))
However, this returns errors. Does anyone know how I can make the dataframes usable? thanks.
You can rbind them one at a time in a loop.
df.1 = iris
df.2 = iris
df.3 = iris
DF = df.1
for(i in 2:3) {
DF = rbind(DF, eval(as.symbol(paste("df", i, sep=".")))) }
Using mget and then do.call or dplyr's bind_rows should work.
df.1 = iris[1:20,]
df.2 = iris[21:50,]
do.call("rbind",mget(paste0("df.",1:2)))
library(dplyr)
bind_rows(mget(paste0("df.",1:2)))

Resources