How to iterate multiple data frames in R? - r

I am trying to remove rows that have missing values from 12 data frames.
I could use na.omit for each of them but that's to much syntax.
I've tried to do it in multiple ways:
Like this:
df <- list(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)
for (i in 1:length(df)){
df[i] <- na.omit(df[i])
}
And like this:
for (df in list(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)){
df <- na.omit(df)
}
None of this methods worked :)
Could someone please let me know what is missing here to properly iterate multiple data frames?

df1 <- df2 <- data.frame(a=c(NA, 1), b=1:2)
dfs <- list(df1, df2)
dfs[] <- lapply(dfs, na.omit)
As for why they don't work:
dfs <- list(df1, df2)
for (i in 1:length(dfs)){
dfs[i] <- na.omit(dfs[i])
}
Here you're using single square bracket subsetting, which returns a list, then calling na.omit on a list of length 1, where element 1 is the first df. Since the df is not NA, it's returned as-is. ie,
dfs[1]
#[[1]]
# a b
#1 NA 1
#2 1 2
And...
for (df in list(df1, df2)) {
df <- na.omit(df)
}
Here you're iterating over the dfs but storing the result of each in df. R doesn't really handle references (everything is copied on write) so df stores the result of na.omit(df1) after the first iteration and the result of na.omit(df2) when the loop ends.

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Filter out all data frames which don't have the column Z in a list of data frames?

I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:
for(i in 1:length(df)){
if(!("Z" %in% colnames(df[[i]])))
{
df[[i]] = NULL
}
}
This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?
The base Filter function works well here:
df <- Filter(\(x) "Z" %in% names(x), df)
As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.
If df is your list of 6 dataframes, you can do this:
df <- df[sapply(df, \(i) "Z" %in% colnames(i))]
The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.
Using discard:
list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))
Output:
[[1]]
A B
1 1 3
2 3 4
As you can see it removed the first dataframe which had column Z.
data
df1 <- data.frame(A = c(1,2),
Z = c(1,4))
df2 <- data.frame(A = c(1,3),
B = c(3,4))

iterate over a set of dataframes in R

Let's say I have a set of dataframes: df1, df2, d3, df4. I want to apply some sort of behaviour over each of these dataframes. Rather than copying the code repeatedly, I want to do this through some sort of for loop. For example, let's say I want to take the df and re assign it so that the first column is row names. The normal way I'd do this is:
df1_b <- df1[,-1]
rownames(df1_b) <- df1[,1]
How would I go about doing this to all four dataframes that I have. I imagine I'd need to somehow make group the dataframes into a single set and then do something like
for (i in set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
I tried to do this with a cbind:
df_set <- c(df1, df2, df3, df4)
for (i in df_set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
But of course that didn't work (I'm pretty sure R does not do string concatenation like this).
Any help would be appreciated!
We can use mget to get the values of the multiple objects into a list and then do the processing in the list by looping over the list with lapply
lst1 <- lapply(mget(paste0("df", 1:4)), function(x) {
row.names(x) <- x[,1]
x[,-1]
})
If we want to change the original objects (not recommended)
list2env(lst1, .GlobalEnv)
Or another option is tidyverse
library(purrr)
library(tibble)
library(dplyr)
mget(ls(pattern = "^df\\d+$")) %>%
map(~ .x %>%
column_to_rownames(names(.)[1]))
You can apply a function this way for example:
# getting some dummy data
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
df4 <- mtcars
lst <- list(df1, df2, df3, df4)
# example of applying the function row.names to the data
Map(row.names, lst)

Access variable dataframe in R loop

If I am working with dataframes in a loop, how can I use a variable data frame name (and additionally, variable column names) to access data frame contents?
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE))
for (i in seq_along(dfnames)){
curr.dfname <- dfnames[i]
#how can I do this:
curr.dfname$X <- 42:52
#...this
dfnames[i]$X <- 42:52
#or even this doubly variable call
for (j in 1_seq_along(colnames(curr.dfname)){
curr.dfname$[colnames(temp[j])] <- 42:52
}
}
You can use get() to return a variable reference based on a string of its name:
> x <- 1:10
> get("x")
[1] 1 2 3 4 5 6 7 8 9 10
So, yes, you could iterate through dfnames like:
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE))
for (cur.dfname in dfnames)
{
cur.df <- get(cur.dfname)
# for a fixed column name
cur.df$X <- 42:52
# iterating through column names as well
for (j in colnames(cur.df))
{
cur.df[, j] <- 42:52
}
}
I really think that this is gonna be a painful approach, though. As the commenters say, if you can get the data frames into a list and then iterate through that, it'll probably perform better and be more readable. Unfortunately, get() isn't vectorised as far as I'm aware, so if you only have a string list of data frame names, you'll have to iterate through that to get a data frame list:
# build data frame list
df.list <- list()
for (i in 1:length(dfnames))
{
df.list[[i]] <- get(dfnames[i])
}
# iterate through data frames
for (cur.df in df.list)
{
cur.df$X <- 42:52
}
Hope that helps!
2018 Update: I probably wouldn't do something like this anymore. Instead, I'd put the data frames in a list and then use purrr:map(), or, the base equivalent, lapply():
library(tidyverse)
stuff_to_do = function(mydata) {
mydata$somecol = 42:52
# … anything else I want to do to the current data frame
mydata # return it
}
df_list = list(df1, df2)
map(df_list, stuff_to_do)
This brings back a list of modified data frames (although you can use variants of map(), map_dfr() and map_dfc(), to automatically bind the list of processed data frames row-wise or column-wise respectively. The former uses column names to join, rather than column positions, and it can also add an ID column using the .id argument and the names of the input list. So it comes with some nice added functionality over lapply()!

rbinding a list of lists of dataframes based on nested order

I have a dataframe, df and a function process that returns a list of two dataframes, a and b. I use dlply to split up the df on an id column, and then return a list of lists of dataframes. Here's sample data/code that approximates the actual data and methods:
df <- data.frame(id1=rep(c(1,2,3,4), each=2))
process <- function(df) {
a <- data.frame(d1=rnorm(1), d2=rnorm(1))
b <- data.frame(id1=df$id1, a=rnorm(nrow(df)), b=runif(nrow(df)))
list(a=a, b=b)
}
require(plyr)
output <- dlply(df, .(id1), process)
output is a list of lists of dataframes, the nested list will always have two dataframes, named a and b. In this case the outer list has a length 4.
What I am looking to generate is a dataframe with all the a dataframes, along with an id column indicating their respective value (I believe this is left in the list as the split_labels attribute, see str(output)). Then similarly for the b dataframes.
So far I have in part used this question to come up with this code:
list <- unlist(output, recursive = FALSE)
list.a <- lapply(1:4, function(x) {
list[[(2*x)-1]]
})
all.a <- rbind.fill(list.a)
Which gives me the final a dataframe (and likewise for b with a different subscript into list), however it doesn't have the id column I need and I'm pretty sure there's got to be a more straightforward or elegant solution. Ideally something clean using plyr.
Not very clean but you can try something like this (assuming the same data generation process).
list.aID <- lapply(1:4, function(x) {
cbind(list[[(2*x) - 1]], list[[2*x]][1, 1, drop = FALSE])
})
all.aID <- rbind.fill(list.aID)
all.aID
all.aID
d1 d2 id1
1 0.68103 -0.74023 1
2 -0.50684 1.23713 2
3 0.33795 -0.37277 3
4 0.37827 0.56892 4

Resources