How to add many data frame columns efficiently in R - r
I need to add several thousand columns to a data frame. Currently, I have a list of 93 lists, where each of the embedded lists contains 4 data frames, each with 19 variables. I want to add each column of all those data frames to an outside file. My code looks like:
vars <- c('tmin_F','tavg_F','tmax_F','pp','etr_grass','etr_alfalfa','vpd','rhmin','rhmax','dtr_F','us','shum','pp_def_grass','pp_def_alfalfa','rw_tot','fdd28_F0','fdd32_F0','fdd35_F0',
'fdd356_F0','fdd36_F0','fdd38_F0','fdd39_F0','fdd392_F0','fdd40_F0','fdd41_F0','fdd44_F0','fdd45_F0','fdd464_F0','fdd48_F0','fdd50_F0','fdd52_F0','fdd536_F0','fdd55_F0',
'fdd57_F0','fdd59_F0','fdd60_F0','fdd65_F0','fdd70_F0','fdd72_F0','hdd40_F0','hdd45_F0','hdd50_F0','hdd55_F0','hdd57_F0','hdd60_F0','hdd65_F0','hdd45_F0',
'cdd45_F0','cdd50_F0','cdd55_F0','cdd57_F0','cdd60_F0','cdd65_F0','cdd70_F0','cdd72_F0',
'gdd32_F0','gdd35_F0','gdd356_F0','gdd38_F0','gdd39_F0','gdd392_F0','gdd40_F0','gdd41_F0','gdd44_F0','gdd45_F0',
'gdd464_F0','gdd48_F0','gdd50_F0','gdd52_F0','gdd536_F0','gdd55_F0','gdd57_F0','gdd59_F0','gdd60_F0','gdd65_F0','gdd70_F0','gdd72_F0',
'gddmod_32_59_F0','gddmod_32_788_F0','gddmod_356_788_F0','gddmod_392_86_F0','gddmod_41_86_F0','gddmod_464_86_F0','gddmod_48_86_F0','gddmod_50_86_F0','gddmod_536_95_F0',
'sdd77_F0','sdd86_F0','sdd95_F0','sdd97_F0','sdd99_F0','sdd104_F0','sdd113_F0')
windows <- c(15,15,15,29,29,29,15,15,15,15,29,29,29,29,15,rep(15,78))
perc_list <- c('obs','smoothed_obs','windowed_obs','smoothed_windowed_obs')
percs <- c('00','02','05','10','20','25','30','33','40','50','60','66','70','75','80','90','95','98','100')
vcols <- seq(1,19,1)
for (v in 1:93){
for (pl in 1:4){
for (p in 1:19){
normals_1981_2010 <- normals_1981_2010 %>% mutate(!!paste0(vars[v],'_daily',perc_list[pl],'_perc',percs[p]) := percents[[v]][[pl]][,vcols[p]])}}
print(v)}
The code starts fast, but very quickly slows to a crawl as the outside data frame grows in size. I didn't realize this would be problem. How do I add all these extra columns efficiently? Is there a better way to do this than by using mutate? I've tried add_column, but that does not work. Maybe it doesn't like the loop or something.
Your example is not reproducible as is (the object normals_1981_2010 doesn't exist but is called within the loop, so I am unsure I understood your question.
If I did though, this should work:
First, I am reproducing your dataset structure, except that instead of 93 list, I set it up to have 5, instead of 4 nested tables within, I set it up to have 3 tables, and instead of each tables having 19 columns, I set them up to have 3 columns.
df_list <- vector("list", 5) # Create an empty list vector, then fill it in.
for(i in 1:5) {
df_list[[i]] <- vector("list", 3)
for(j in 1:3) {
df_list[[i]][[j]] <- data.frame(a = 1:12,
b = letters[1:12],
c = month.abb[1:12])
colnames(df_list[[i]][[j]]) <- paste0(colnames(df_list[[i]][[j]]), "_nest_", i, "subnest_", j)
}
}
df_list # preview the structure.
Then, answering your question:
# Now, how to bind everything together:
df_out <- vector("list", 5)
for(i in 1:5) {
df_out[[i]] <- bind_cols(df_list[[i]])
}
# Final step
df_out <- bind_cols(df_out)
ncol(df_out) # Here I have 5*3*3 = 45 columns, but you will have 93*4*19 = 7068 columns
# [1] 45
Related
Using a loop to select a column names from a list
I've been struggling with column selection with lists in R. I've loaded a bunch of csv's (all with different column names and different number of columns) with the goal of extracting all the columns that have the same name (just phone_number, subregion, and phonetype) and putting them together into a single data frame. I can get the columns I want out of one list element with this; var<-data[[1]] %>% select("phone_number","Subregion", "PhoneType") But I cannot select the columns from all the elements in the list this way, just one at a time. I then tried a for loop that looks like this: new.function <- function(a) { for(i in 1:a) { tst<-datas[[i]] %>% select("phone_number","Subregion", "PhoneType") } print(tst) } But when I try: new.function(5) I'll only get the columns from the 5th element. I know this might seem like a noob question for most, but I am struggling to learn lists and loops and R. I'm sure I'm missing something very easy to make this work. Thank you for your help.
Another way you could do this is to make a function that extracts your columns and apply it to all data.frames in your list with lapply: library(dplyr) extractColumns = function(x){ select(x,"phone_number","Subregion", "PhoneType") #or x[,c("phone_number","Subregion","PhoneType")] } final_df = lapply(data,extractColumns) %>% bind_rows()
The way you have your loop set up currently is only saving the last iteration of the loop because tst is not set up to store more than a single value and is overwritten with each step of the loop. You can establish tst as a list first with: tst <- list() Then in your code be explicit that each step is saved as a seperate element in the list by adding brackets and an index to tst. Here is a full example the way you were doing it. #Example data.frame that could be in datas df_1 <- data.frame("not_selected" = rep(0, 5), "phone_number" = rep("1-800", 5), "Subregion" = rep("earth", 5), "PhoneType" = rep("flip", 5)) # Another bare data.frame that could be in datas df_2 <- data.frame("also_not_selected" = rep(0, 5), "phone_number" = rep("8675309", 5), "Subregion" = rep("mars", 5), "PhoneType" = rep("razr", 5)) # Datas is a list of data.frames, we want to pull only specific columns from all of them datas <- list(df_1, df_2) #create list to store new data.frames in once columns are selected tst <- list() #Function for looping through 'a' elements new.function <- function(a) { for(i in 1:a) { tst[[i]] <- datas[[i]] %>% select("phone_number","Subregion", "PhoneType") } print(tst) } #Proof of concept for 2 elements new.function(2)
R: Transpose the a results table and add column headers
Setting the scene: So I have a directory with 50 .csv files in it. All files have unique names e.g. 1.csv 2.csv ... The contents of each may vary in the number of rows but always have 4 columns The column headers are: Date Result 1 Result 2 ID I want them all to be merged together into one dataframe (mydf) and then I'd like to ignore any rows where there is an NA value. So that I can count how many complete instances of an "ID" there were. By calling for example; myfunc("my_files", 1) myfunc("my_files", c(2,4,6)) My code so far: myfunc <- function(directory, id = 1:50) { files_list <- list.files(directory, full.names=T) mydf <- data.frame() for (i in 1:50) { mydf <- rbind(mydf, read.csv(files_list[i])) } mydf_subset <- mydf[which(mydf[, "ID"] %in% id),] mydf_subna <- na.omit(mydf_subset) table(mydf_subna$ID) } My issues and where I need help: My results come out this way 2 4 6 200 400 600 and I'd like to transpose them to be like this. I'm not sure if calling a table is right or should I call it as.matrix perhaps? 2 100 4 400 8 600 I'd also like to have either the headers from the original files or assign new ones ID Count 2 100 4 400 8 600 Any and all advice is welcome Matt Additional update I tried amending to incorperate some of the helpful comments below, so I also have a set of code that looks like this; myfunc <- function(directory, id = 1:50) { files_list <- list.files(directory, full.names=T) mydf <- data.frame() for (i in 1:50) { mydf <- rbind(mydf, read.csv(files_list[i])) } mydf_subset <- mydf[which(mydf[, "ID"] %in% id),] mydf_subna <- na.omit(mydf_subset) result <- data.frame(mydf_subna$ID) transposed_result <- t(result) colnames(transposed_result) <- c("ID","Count") } which I try to call with this: myfunc("myfiles", 1) myfunc("myfiles", c(2, 4, 6)) but I get this error > myfunc("myfiles", c(2, 4, 6)) Error in `colnames<-`(`*tmp*`, value = c("ID", "Count")) : length of 'dimnames' [2] not equal to array extent I wonder if perhaps I'm not creating this data.frame correctly and should be using a cbind or not summing the rows by ID maybe?
You need want to change your function to create a data frame rather than a table and then transpose that data frame. Change the line table(mydf_subna$ID) to be instead result <- data.frame(mydf_subna$ID) then use the t() function which transposes your data frame transposed_result <- t(result) colnames(transposed_result) <- c("ID","Count")
Welcome to Stack Overflow. I am assuming that the function that you have written returns the table which is saved in variable ans. You may give a try to this code: ans <- myfunc("my_files", c(2,4,6)) ans2 <- data.frame(ans) colnames(ans2) <- c('ID' ,'Count')
Access variable dataframe in R loop
If I am working with dataframes in a loop, how can I use a variable data frame name (and additionally, variable column names) to access data frame contents? dfnames <- c("df1","df2") df1 <- df2 <- data.frame(X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE)) for (i in seq_along(dfnames)){ curr.dfname <- dfnames[i] #how can I do this: curr.dfname$X <- 42:52 #...this dfnames[i]$X <- 42:52 #or even this doubly variable call for (j in 1_seq_along(colnames(curr.dfname)){ curr.dfname$[colnames(temp[j])] <- 42:52 } }
You can use get() to return a variable reference based on a string of its name: > x <- 1:10 > get("x") [1] 1 2 3 4 5 6 7 8 9 10 So, yes, you could iterate through dfnames like: dfnames <- c("df1","df2") df1 <- df2 <- data.frame(X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE)) for (cur.dfname in dfnames) { cur.df <- get(cur.dfname) # for a fixed column name cur.df$X <- 42:52 # iterating through column names as well for (j in colnames(cur.df)) { cur.df[, j] <- 42:52 } } I really think that this is gonna be a painful approach, though. As the commenters say, if you can get the data frames into a list and then iterate through that, it'll probably perform better and be more readable. Unfortunately, get() isn't vectorised as far as I'm aware, so if you only have a string list of data frame names, you'll have to iterate through that to get a data frame list: # build data frame list df.list <- list() for (i in 1:length(dfnames)) { df.list[[i]] <- get(dfnames[i]) } # iterate through data frames for (cur.df in df.list) { cur.df$X <- 42:52 } Hope that helps! 2018 Update: I probably wouldn't do something like this anymore. Instead, I'd put the data frames in a list and then use purrr:map(), or, the base equivalent, lapply(): library(tidyverse) stuff_to_do = function(mydata) { mydata$somecol = 42:52 # … anything else I want to do to the current data frame mydata # return it } df_list = list(df1, df2) map(df_list, stuff_to_do) This brings back a list of modified data frames (although you can use variants of map(), map_dfr() and map_dfc(), to automatically bind the list of processed data frames row-wise or column-wise respectively. The former uses column names to join, rather than column positions, and it can also add an ID column using the .id argument and the names of the input list. So it comes with some nice added functionality over lapply()!
Adding data frames into a list within a forloop
I have a for loop that generates a dataframe every time it loops through. I am trying to create a list of data frames but I cannot seem to figure out a good way to do this. For example, with vectors I usually do something like this: my_numbers <- c() for (i in 1:4){ my_numbers <- c(my_numbers,i) } This will result in a vector c(1,2,3,4). I want to do something similar with dataframes, but accessing the list of data frames is quite difficult when i use: my_dataframes <- list(my_dataframes,DATAFRAME). Help please. The main goal is just to create a list of dataframes that I can later on access dataframe by dataframe. Thank you.
I'm sure you've noticed that list does not do what you want it to do, nor should it. c also doesn't work in this case because it flattens data frames, even when recursive=FALSE. You can use append. As in, data_frame_list = list() for( i in 1:5 ){ d = create_data_frame(i) data_frame_list = append(data_frame_list,) } Better still, you can assign directly to indexed elements, even if those elements don't exist yet: data_frame_list = list() for( i in 1:5 ){ data_frame_list[[i]] = create_data_frame(i) } This applies to vectors, too. But if you want to create a vector c(1,2,3,4) just use 1:4, or its underlying function seq. Of course, lapply or the *lply functions from plyr are often better than looping depending on your application.
Continuing with your for loop method, here's a little example of creating and accessing. > my_numbers <- vector('list', 4) > for (i in 1:4) my_numbers[[i]] <- data.frame(x = seq(i)) And we can access the first column of each data frame with, > sapply(my_numbers, "[", 1) # $x # [1] 1 # # $x # [1] 1 2 # # $x # [1] 1 2 3 # # $x # [1] 1 2 3 4 Other ways of accessing the data is my_numbers[[1]] for the first data set, lapply(my_numbers, "[", 1,) to access the first row of each data frame, etc.
You can use operator [[ ]] for this purpose. l <- list() df1 <- data.frame(name = 'df1', a = 1:5 , b = letters[1:5]) df2 <- data.frame(name = 'df2', a = 6:10 , b = letters[6:10]) df3 <- data.frame(name = 'df3', a = 11:20 , b = letters[11:20]) df <- rbind(df1,df2,df3) for(df_name in unique(df$name)){ l[[df_name]] <- df[df$name == df_name,] } In this example, there are three separate data frames and in order to store them in a list using a for loop, we place them in one. Using the operator [[ we can even name the data frame in the list as we want and store it in the list normally.
rbinding a list of lists of dataframes based on nested order
I have a dataframe, df and a function process that returns a list of two dataframes, a and b. I use dlply to split up the df on an id column, and then return a list of lists of dataframes. Here's sample data/code that approximates the actual data and methods: df <- data.frame(id1=rep(c(1,2,3,4), each=2)) process <- function(df) { a <- data.frame(d1=rnorm(1), d2=rnorm(1)) b <- data.frame(id1=df$id1, a=rnorm(nrow(df)), b=runif(nrow(df))) list(a=a, b=b) } require(plyr) output <- dlply(df, .(id1), process) output is a list of lists of dataframes, the nested list will always have two dataframes, named a and b. In this case the outer list has a length 4. What I am looking to generate is a dataframe with all the a dataframes, along with an id column indicating their respective value (I believe this is left in the list as the split_labels attribute, see str(output)). Then similarly for the b dataframes. So far I have in part used this question to come up with this code: list <- unlist(output, recursive = FALSE) list.a <- lapply(1:4, function(x) { list[[(2*x)-1]] }) all.a <- rbind.fill(list.a) Which gives me the final a dataframe (and likewise for b with a different subscript into list), however it doesn't have the id column I need and I'm pretty sure there's got to be a more straightforward or elegant solution. Ideally something clean using plyr.
Not very clean but you can try something like this (assuming the same data generation process). list.aID <- lapply(1:4, function(x) { cbind(list[[(2*x) - 1]], list[[2*x]][1, 1, drop = FALSE]) }) all.aID <- rbind.fill(list.aID) all.aID all.aID d1 d2 id1 1 0.68103 -0.74023 1 2 -0.50684 1.23713 2 3 0.33795 -0.37277 3 4 0.37827 0.56892 4