Access variable dataframe in R loop - r

If I am working with dataframes in a loop, how can I use a variable data frame name (and additionally, variable column names) to access data frame contents?
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE))
for (i in seq_along(dfnames)){
curr.dfname <- dfnames[i]
#how can I do this:
curr.dfname$X <- 42:52
#...this
dfnames[i]$X <- 42:52
#or even this doubly variable call
for (j in 1_seq_along(colnames(curr.dfname)){
curr.dfname$[colnames(temp[j])] <- 42:52
}
}

You can use get() to return a variable reference based on a string of its name:
> x <- 1:10
> get("x")
[1] 1 2 3 4 5 6 7 8 9 10
So, yes, you could iterate through dfnames like:
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE))
for (cur.dfname in dfnames)
{
cur.df <- get(cur.dfname)
# for a fixed column name
cur.df$X <- 42:52
# iterating through column names as well
for (j in colnames(cur.df))
{
cur.df[, j] <- 42:52
}
}
I really think that this is gonna be a painful approach, though. As the commenters say, if you can get the data frames into a list and then iterate through that, it'll probably perform better and be more readable. Unfortunately, get() isn't vectorised as far as I'm aware, so if you only have a string list of data frame names, you'll have to iterate through that to get a data frame list:
# build data frame list
df.list <- list()
for (i in 1:length(dfnames))
{
df.list[[i]] <- get(dfnames[i])
}
# iterate through data frames
for (cur.df in df.list)
{
cur.df$X <- 42:52
}
Hope that helps!
2018 Update: I probably wouldn't do something like this anymore. Instead, I'd put the data frames in a list and then use purrr:map(), or, the base equivalent, lapply():
library(tidyverse)
stuff_to_do = function(mydata) {
mydata$somecol = 42:52
# … anything else I want to do to the current data frame
mydata # return it
}
df_list = list(df1, df2)
map(df_list, stuff_to_do)
This brings back a list of modified data frames (although you can use variants of map(), map_dfr() and map_dfc(), to automatically bind the list of processed data frames row-wise or column-wise respectively. The former uses column names to join, rather than column positions, and it can also add an ID column using the .id argument and the names of the input list. So it comes with some nice added functionality over lapply()!

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

How can lapply work with addressing columns as unknown variables?

So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))

How to add many data frame columns efficiently in R

I need to add several thousand columns to a data frame. Currently, I have a list of 93 lists, where each of the embedded lists contains 4 data frames, each with 19 variables. I want to add each column of all those data frames to an outside file. My code looks like:
vars <- c('tmin_F','tavg_F','tmax_F','pp','etr_grass','etr_alfalfa','vpd','rhmin','rhmax','dtr_F','us','shum','pp_def_grass','pp_def_alfalfa','rw_tot','fdd28_F0','fdd32_F0','fdd35_F0',
'fdd356_F0','fdd36_F0','fdd38_F0','fdd39_F0','fdd392_F0','fdd40_F0','fdd41_F0','fdd44_F0','fdd45_F0','fdd464_F0','fdd48_F0','fdd50_F0','fdd52_F0','fdd536_F0','fdd55_F0',
'fdd57_F0','fdd59_F0','fdd60_F0','fdd65_F0','fdd70_F0','fdd72_F0','hdd40_F0','hdd45_F0','hdd50_F0','hdd55_F0','hdd57_F0','hdd60_F0','hdd65_F0','hdd45_F0',
'cdd45_F0','cdd50_F0','cdd55_F0','cdd57_F0','cdd60_F0','cdd65_F0','cdd70_F0','cdd72_F0',
'gdd32_F0','gdd35_F0','gdd356_F0','gdd38_F0','gdd39_F0','gdd392_F0','gdd40_F0','gdd41_F0','gdd44_F0','gdd45_F0',
'gdd464_F0','gdd48_F0','gdd50_F0','gdd52_F0','gdd536_F0','gdd55_F0','gdd57_F0','gdd59_F0','gdd60_F0','gdd65_F0','gdd70_F0','gdd72_F0',
'gddmod_32_59_F0','gddmod_32_788_F0','gddmod_356_788_F0','gddmod_392_86_F0','gddmod_41_86_F0','gddmod_464_86_F0','gddmod_48_86_F0','gddmod_50_86_F0','gddmod_536_95_F0',
'sdd77_F0','sdd86_F0','sdd95_F0','sdd97_F0','sdd99_F0','sdd104_F0','sdd113_F0')
windows <- c(15,15,15,29,29,29,15,15,15,15,29,29,29,29,15,rep(15,78))
perc_list <- c('obs','smoothed_obs','windowed_obs','smoothed_windowed_obs')
percs <- c('00','02','05','10','20','25','30','33','40','50','60','66','70','75','80','90','95','98','100')
vcols <- seq(1,19,1)
for (v in 1:93){
for (pl in 1:4){
for (p in 1:19){
normals_1981_2010 <- normals_1981_2010 %>% mutate(!!paste0(vars[v],'_daily',perc_list[pl],'_perc',percs[p]) := percents[[v]][[pl]][,vcols[p]])}}
print(v)}
The code starts fast, but very quickly slows to a crawl as the outside data frame grows in size. I didn't realize this would be problem. How do I add all these extra columns efficiently? Is there a better way to do this than by using mutate? I've tried add_column, but that does not work. Maybe it doesn't like the loop or something.
Your example is not reproducible as is (the object normals_1981_2010 doesn't exist but is called within the loop, so I am unsure I understood your question.
If I did though, this should work:
First, I am reproducing your dataset structure, except that instead of 93 list, I set it up to have 5, instead of 4 nested tables within, I set it up to have 3 tables, and instead of each tables having 19 columns, I set them up to have 3 columns.
df_list <- vector("list", 5) # Create an empty list vector, then fill it in.
for(i in 1:5) {
df_list[[i]] <- vector("list", 3)
for(j in 1:3) {
df_list[[i]][[j]] <- data.frame(a = 1:12,
b = letters[1:12],
c = month.abb[1:12])
colnames(df_list[[i]][[j]]) <- paste0(colnames(df_list[[i]][[j]]), "_nest_", i, "subnest_", j)
}
}
df_list # preview the structure.
Then, answering your question:
# Now, how to bind everything together:
df_out <- vector("list", 5)
for(i in 1:5) {
df_out[[i]] <- bind_cols(df_list[[i]])
}
# Final step
df_out <- bind_cols(df_out)
ncol(df_out) # Here I have 5*3*3 = 45 columns, but you will have 93*4*19 = 7068 columns
# [1] 45

Rename the same column in a list of identical data frames in r

I'm fairly new to R and I was wondering if someone could help me?
I have a list of identical data frames (df1, df2, ..., df9) and I'm trying to rename one of the columns, 'value', in all the data frames to be 'value_dataframename'- the renamed column in all 9 data frames should be value_df1 in df1, value_df2 in df2, ..., value_df9 in df9.
Any help would be much appreciated!
Below code with example list (auto.list) that does what you want. Run it to check.
To use it for your list:
skip the code till the your.list <- ... line,
save your list as your.list object,
assign to term your "value".
auto.list <- list()
for (i in seq_len(10)) {
auto.list[[i]] <- data.frame("a" = 1:i, "value" = sample(letters, i))
names(auto.list)[i] <- paste0("df", i)
}
your.list <- auto.list # assign to your.list your own list
term <- "value" # assign your own "value"
for (i in seq_along(your.list)) {
colnames(your.list[[i]])[colnames(your.list[[i]]) == term] <- paste0(term, "_", names(your.list)[i])
}
Try this out:
## these two are my sample data frames for this example
df_1 <- data.frame(first = rbinom(10,size = 2,prob = 0.3), second = rnorm(10))
df_2 <- data.frame(first = rbinom(10,size = 2,prob = 0.3), second = rnorm(10))
# R stores data frames as list, so you can retrieve all your data frames thus:
all_df_names = ls.str(mode = "list")
# to check: all_df_names[1] - the first element - will give you "df_1", which is the name of the first data frame
# be careful though - 'ls.str(mode = "list")' will pick ALL the lists currently in your environment
# if you don't want to use this ls method, it might be wiser to manually create a variable 'all_df_names' and put all your data frame names there yourself.
# rename
for(i in 1:length(all_df_names)) {
# get the actual content via its variable name, and store it in a temporary variable 'x'
x = get(all_df_names[i])
# rename the column you want
names(x)[2] = paste0(names(x)[2], "_", i) # this will replace the column with the previous name plus a '_' and the current iteration
# resave that dataframe, with the new content
assign(all_df_names[i], x)
}
# to remove variables we no longer need when done:
# rm(x, i)
# confirm
# names(df_1) = "first" "second_1"
# names(df_2) = "first" "second_2"

Adding data frames into a list within a forloop

I have a for loop that generates a dataframe every time it loops through. I am trying to create a list of data frames but I cannot seem to figure out a good way to do this.
For example, with vectors I usually do something like this:
my_numbers <- c()
for (i in 1:4){
my_numbers <- c(my_numbers,i)
}
This will result in a vector c(1,2,3,4). I want to do something similar with dataframes, but accessing the list of data frames is quite difficult when i use:
my_dataframes <- list(my_dataframes,DATAFRAME).
Help please. The main goal is just to create a list of dataframes that I can later on access dataframe by dataframe. Thank you.
I'm sure you've noticed that list does not do what you want it to do, nor should it. c also doesn't work in this case because it flattens data frames, even when recursive=FALSE.
You can use append. As in,
data_frame_list = list()
for( i in 1:5 ){
d = create_data_frame(i)
data_frame_list = append(data_frame_list,)
}
Better still, you can assign directly to indexed elements, even if those elements don't exist yet:
data_frame_list = list()
for( i in 1:5 ){
data_frame_list[[i]] = create_data_frame(i)
}
This applies to vectors, too. But if you want to create a vector c(1,2,3,4) just use 1:4, or its underlying function seq.
Of course, lapply or the *lply functions from plyr are often better than looping depending on your application.
Continuing with your for loop method, here's a little example of creating and accessing.
> my_numbers <- vector('list', 4)
> for (i in 1:4) my_numbers[[i]] <- data.frame(x = seq(i))
And we can access the first column of each data frame with,
> sapply(my_numbers, "[", 1)
# $x
# [1] 1
#
# $x
# [1] 1 2
#
# $x
# [1] 1 2 3
#
# $x
# [1] 1 2 3 4
Other ways of accessing the data is my_numbers[[1]] for the first data set,
lapply(my_numbers, "[", 1,) to access the first row of each data frame, etc.
You can use operator [[ ]] for this purpose.
l <- list()
df1 <- data.frame(name = 'df1', a = 1:5 , b = letters[1:5])
df2 <- data.frame(name = 'df2', a = 6:10 , b = letters[6:10])
df3 <- data.frame(name = 'df3', a = 11:20 , b = letters[11:20])
df <- rbind(df1,df2,df3)
for(df_name in unique(df$name)){
l[[df_name]] <- df[df$name == df_name,]
}
In this example, there are three separate data frames and in order to store them
in a list using a for loop, we place them in one. Using the operator [[ we can even name the data frame in the list as we want and store it in the list normally.

Resources