Let's say I have a set of dataframes: df1, df2, d3, df4. I want to apply some sort of behaviour over each of these dataframes. Rather than copying the code repeatedly, I want to do this through some sort of for loop. For example, let's say I want to take the df and re assign it so that the first column is row names. The normal way I'd do this is:
df1_b <- df1[,-1]
rownames(df1_b) <- df1[,1]
How would I go about doing this to all four dataframes that I have. I imagine I'd need to somehow make group the dataframes into a single set and then do something like
for (i in set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
I tried to do this with a cbind:
df_set <- c(df1, df2, df3, df4)
for (i in df_set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
But of course that didn't work (I'm pretty sure R does not do string concatenation like this).
Any help would be appreciated!
We can use mget to get the values of the multiple objects into a list and then do the processing in the list by looping over the list with lapply
lst1 <- lapply(mget(paste0("df", 1:4)), function(x) {
row.names(x) <- x[,1]
x[,-1]
})
If we want to change the original objects (not recommended)
list2env(lst1, .GlobalEnv)
Or another option is tidyverse
library(purrr)
library(tibble)
library(dplyr)
mget(ls(pattern = "^df\\d+$")) %>%
map(~ .x %>%
column_to_rownames(names(.)[1]))
You can apply a function this way for example:
# getting some dummy data
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
df4 <- mtcars
lst <- list(df1, df2, df3, df4)
# example of applying the function row.names to the data
Map(row.names, lst)
Related
I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)
I have two long lists of large dataframes that are equal in length. I want to merge Dataframe1 (from list1) with Dataframe1 (from list2) and Dataframe2 (from list1) with Dataframe2 (from list2) etc...
Below is a minimal reproducible example and some attempts.
#### EXAMPLE
#Create Dataframes
df_1 <- data.frame(c("Bah",NA,2,3,4),c("Bug",NA,5,6,NA))
df_2 <- data.frame(c("Blu",7,8,9,10),c(NA,NA,NA,12,13))
df_3 <- data.frame(c("Bah",NA,21,32,43),c("Rgh",NA,51,63,NA))
df_4 <- data.frame(c("Gar",7,8,9,10),c("Ghh",NA,NA,121,131))
#Create Lists
list1 <- list(df_1,df_2)
list2 <- list(df_3,df_4)
#Set column and row names for each dataframe
colnames(list1[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list1[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
rownames(list1[[1]]) <- c("1","2")
rownames(list1[[2]]) <- c("1","2")
rownames(list2[[1]]) <- c("1","2")
rownames(list2[[2]]) <- c("1","2")
My desired output is a list of the same length as the input lists but with each dataframe merged by position into a single dataframe. The following yields my desired output for the dataframes and list but is low throughput.
#### DESIRED OUTPUT
DesiredOutput_DF1_Format <- merge(list1[[1]],list2[[1]], all = TRUE, by = "SampleID")
DesiredOutput_DF2_Format <- merge(list1[[2]],list2[[2]], all = TRUE, by = "SampleID")
DesiredOutput_List <- list(DesiredOutput_DF1_Format, DesiredOutput_DF2_Format)
How can I generate an output list in my desired format in a highthroughput way using an apply-like approach?
#### ATTEMPTS
#Attempt1:
attempt1 <- mapply(cbind, list1, list2, simplify=FALSE)
#Attempt2:
My instinct is to use `lapply` but i cant figure how to make it iterate through two lists simultaneously.
#Attempt3: Works but the order of the output list appears inverted. This is not intuitive, though it is easily corrected... There has to be a cleaner way.
output_list <- list()
dataset_iterator <- 1:length(list1)
for (x in dataset_iterator) {
df1 <- data.frame(list1[[x]])
df2 <- data.frame(list2[[x]])
df_merged <- data.frame(merge(df1, df2, by = "Barcodes", all=TRUE))
output_list <- append(output_list, list(df_merged), 0)
Based on the code showed, we may need Map (or mapply with SIMPLIFY = FALSE)
out <- Map(merge, list1, list2, MoreArgs = list(all = TRUE, by = "SampleID"))
-checking with expected output
> identical(DesiredOutput_List, out)
[1] TRUE
Or using tidyverse
library(purrr)
library(dplyr)
map2(list1, list2, full_join, by = "SampleID")
I have a list with 5 data.frames. Now I want to change the name of the last column of each data.frame.
And I don't know exactly how many columns are in the df.
Example-data:
library(tidyverse)
data(mtcars)
df1 <- tail(mtcars)
df2 <- mtcars[1:5, 2:10]
df3 <- mtcars
df4 <- head(mtcars)
list <- list(df1, df2, df3, df4)
Doing it one by one, this would be the command:
colnames(list$df1)[length(list$df1)] <- "rank"
Within a for loop, I would think that the command would then be:
for (i in seq_along(list)) {
colnames(i)[length(i)] <- "rank"
}
But here I get the error:
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) :
attempt to set 'colnames' on an object with less than two dimensions
Any idea how to solve this problem? Maybe by the map-command?
Here I don't know how to include the index/length(df) to assign the colnames-command to the last column of the dataframe.
Thank you for your help :)
Kathrin
You can use last_col() from dplyr within map:
library(tidyverse)
list <- map(list,~{
.x %>%
rename(rank = last_col())
})
I am trying to remove rows that have missing values from 12 data frames.
I could use na.omit for each of them but that's to much syntax.
I've tried to do it in multiple ways:
Like this:
df <- list(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)
for (i in 1:length(df)){
df[i] <- na.omit(df[i])
}
And like this:
for (df in list(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)){
df <- na.omit(df)
}
None of this methods worked :)
Could someone please let me know what is missing here to properly iterate multiple data frames?
df1 <- df2 <- data.frame(a=c(NA, 1), b=1:2)
dfs <- list(df1, df2)
dfs[] <- lapply(dfs, na.omit)
As for why they don't work:
dfs <- list(df1, df2)
for (i in 1:length(dfs)){
dfs[i] <- na.omit(dfs[i])
}
Here you're using single square bracket subsetting, which returns a list, then calling na.omit on a list of length 1, where element 1 is the first df. Since the df is not NA, it's returned as-is. ie,
dfs[1]
#[[1]]
# a b
#1 NA 1
#2 1 2
And...
for (df in list(df1, df2)) {
df <- na.omit(df)
}
Here you're iterating over the dfs but storing the result of each in df. R doesn't really handle references (everything is copied on write) so df stores the result of na.omit(df1) after the first iteration and the result of na.omit(df2) when the loop ends.
In R, I currently have 100 dataframes, named df.1, ...,df.100. I would like to be able to rbind them but it is costly to write out:
rbind(df.1, df.2, etc)
So, I have tried:
rbind(eval(as.symbol(paste0("df.",1:84, collapse = ", "))))
However, this returns errors. Does anyone know how I can make the dataframes usable? thanks.
You can rbind them one at a time in a loop.
df.1 = iris
df.2 = iris
df.3 = iris
DF = df.1
for(i in 2:3) {
DF = rbind(DF, eval(as.symbol(paste("df", i, sep=".")))) }
Using mget and then do.call or dplyr's bind_rows should work.
df.1 = iris[1:20,]
df.2 = iris[21:50,]
do.call("rbind",mget(paste0("df.",1:2)))
library(dplyr)
bind_rows(mget(paste0("df.",1:2)))