Multiple bind_rows with R Dplyr - r

I need to bind_row 27 excel files. Although I can do it manually, I want to do it with a loop. The problem with the loop is that it will bind the first file to i and then the first file to i+1, hence losing i. How can I fix this?
nm <- list.files(path="sample/sample/sample/")
df <- data.frame()
for(i in 1:27){
my_data <- bind_rows(df, read_excel(path = nm[i]))
}

We could loop over the files with map, read the data with read_excel and rbind with _dfr
library(purrr)
my_data <- map_dfr(nm, read_excel)
In the Op's code, the issue is that in each iteration, it is creating a temporary dataset 'my_data', instead, it should be binded to the original 'df' already created
for(i in 1:27){
df <- rbind(df, read_excel(path = nm[i]))
}

We could use sapply
result <- sapply(files, read_excel, simplify=FALSE) %>%
bind_rows(.id = "id")

You can still use your for loop:
my_data<-vector('list', 27)
for(i in 1:27){
my_data[i] <- read_excel(path = nm[i])
}
do.call(rbind, my_data)

Related

Loop through multiple files in different folders in R

I have folders named 2021ClientA, 2021ClientB.... each folder has a summary textfile named summary.csv
I want to create a loop that locates folders starting with 2021 then extract their summary.csv file and rbind at the end to create one dataframe
Setting the current directory to the folder containing the 2021XXX folders, you can do:
folders <- list.files(pattern = "^2021")
n <- length(folders)
dfs <- vector("list", length = n)
for(i in 1:n){
dfs[[i]] <- read.csv(file.path(folders[i], "summary.csv"))
}
bigdf <- do.call(rbind, dfs)
iris %>% write.csv('~/Desktop/Stack/2021CustomerA/summary.csv')
iris %>% write.csv('~/Desktop/Stack/2021CustomerB/summary.csv')
iris %>% write.csv('~/Desktop/Stack/2021CustomerC/summary.csv')
iris %>% write.csv('~/Desktop/Stack/2022CustomerA/summary.csv')
setwd('~/Desktop/Stack/')
read_csv_files <- data.frame()
for(i in list.files()){
if(grepl('2021',i)){
setwd(sprintf('~/Desktop/Stack/%s',i))
}else{
next
}
new_df <- read.csv('summary.csv')
read_csv_files <- rbind(read_csv_files,new_df)
setwd('~/Desktop/Stack/')
}

stack different dataframes in a function

I am having troubles to stack a variable number of data-frames whithin a for-loop. Can someone please help me?
# Load libraries
library(dplyr)
library(tidyverse)
library(here)
A function that should open all excel-files and create one single file using plyr::rbind.fill():
stackDfs <- function(filenames){
for(fn in filenames){
df1 <- openxlsx::read.xlsx(here::here("folder1", "folder2", fn), sheet=6, detectDates = TRUE)
# ... do some additional mutations here
}
all_dfs <- (plyr::rbind.fill(df1, df2, df3, df4, ...)
return(all_dfs)
}
Here I am defining which files should be opened and call the stacking-function. The number of files to be stacked should be variable.
filenames <- c("filexy-20210202.xlsx", "filexy-2021-20210205.xlsx")
stackDfs(filenames)
With the current setup, we can initialize a list, read the data into the list and then apply rbind.fill within do.call
stackDfs <- function(filenames){
out_lst <- vector('list', length(filenames))
names(out_list) <- filenames
for(fn in filenames){
out_list[[fn]] <- openxlsx::read.xlsx(here::here("folder1",
"folder2", fn), sheet=6, detectDates = TRUE)
#....
}
all_dfs <- do.call(plyr::rbind.fill, out_lst)
all_dfs
}
In line with the comment posted about using purrr, if you put all the files in the same folder you can use a package called fs instead of manually entering all the file names:
library(tidyverse)
library(here)
library(fs)
filenames <- fs::dir_ls(here("files"))
all_dfs <- filenames %>%
map_dfr(openxlsx:read.xlsx, sheet = 6, detectDates = TRUE)
all_dfs

Writing a csv file in R with parameter in the file name

I am doing a small log processing project in R. I am trying to write a function that gets a dataframe, and writes it in a csv file with some parameters (dataframe name, today's date.. etc)
I have made some progress but didn't manage to write the csv. I hope the code is reproducible and good.
library(dplyr)
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- paste0('"',"./logs/",dfname, "_", Sys.Date(),'.csv"')
dfpath <- as.data.frame(dfpath)
df %>% write_excel_csv(dfpath)
}
wrt_csv(mtcars)
EDIT- this is a final version that works well. Thanks to Ronak Shah.
wd<- getwd()
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- paste0(wd,'/logs/',dfname, '_', Sys.Date(),'.csv')
df %>% write_excel_csv(dfpath)
}
I do however now have a bunch of dataframes that i want to run the function with them. should I make them as a list? this didn't quite work
l <- list(df1,df2)
lapply(l , wrt_csv)
Any thoughts?
Thanks!
Keep dfpath as string. Try :
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- paste0('/logs/',dfname, '_', Sys.Date(),'.csv')
write.csv(df, dfpath, row.names = FALSE)
#Or same as OP
#df %>% write_excel_csv(dfpath)
}
wrt_csv(mtcars)
We can also do
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- sprintf('/logs/%s_%s.csv', dfname, Sys.Date())
write.csv(df, dfpath, row.names = FALSE)
}
wrt_csv(mtcars)

R efficiently bind_rows over many dataframes stored on harddrive

I have roughly 50000 .rda files. Each contains a dataframe named results with exactly one row. I would like to append them all into one dataframe.
I tried the following, which works, but is slow:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
load(files[1])
results_table = results
rm(results)
for(i in c(2:length(files))) {
print(paste("We are at step ", i,sep=""))
load(files[i])
results_table= bind_rows(list(results_table, results))
rm(results)
}
Is there a more efficient way to do this?
Using .rds is a little bit easier. But if we are limited to .rda the following might be useful. I'm not certain if this is faster than what you have done:
library(purrr)
library(dplyr)
library(tidyr)
## make and write some sample data to .rda
x <- 1:10
fake_files <- function(x){
df <- tibble(x = x)
save(df, file = here::here(paste0(as.character(x),
".rda")))
return(NULL)
}
purrr::map(x,
~fake_files(x = .x))
## map and load the .rda files into a single tibble
load_rda <- function(file) {
foo <- load(file = file) # foo just provides the name of the objects loaded
return(df) # note df is the name of the rda returned object
}
rda_files <- tibble(files = list.files(path = here::here(""),
pattern = "*.rda",
full.names = TRUE)) %>%
mutate(data = pmap(., ~load_rda(file = .x))) %>%
unnest(data)
This is untested code but should be pretty efficient:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
data_list <- lapply("mydata.rda", function(f) {
message("loading file: ", f)
name <- load(f) # this should capture the name of the loaded object
return(eval(parse(text = name))) # returns the object with the name saved in `name`
})
results_table <- data.table::rbindlist(data_list)
data.table::rbindlist is very similar to dplyr::bind_rows but a little faster.

Need help converting a for loop to lapply or sapply

I am working on R and learning how to code. I have written a piece of code, utilizing a for loop and I find it very slow. I was wondering if I can get some assistance to convert it to use either the sapply or lapply function. Here is my working R code:
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE) #creates a list of files
dat <- data.frame() #creates an empty data frame
for (i in seq_along(files_list)) {
#loops through the files, rbinding them together
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
mean(dat_subset[, pollutant], na.rm=TRUE) #identifies the Mean of a Pollutant
}
pollutantmean("specdata", "sulfate", 1:10)
This code takes almost 20 seconds to return, which is unacceptable for 332 records. Imagine if I have a dataset with 10K records and wanted to get the mean of those variables?
You can rbind all elements in a list using do.call, and you can read in all the files into that list using lapply:
mean(
filter( # here's the filter that will be applied to the rbind-ed data
do.call("rbind", # call "rbind" on all elements of a list
lapply( # create a list by reading in the files from list.files()
# add any necessary args to read.csv:
list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...)
)
)
), ID %in% id)$pollutant, # make sure id is replaced with what you want
na.rm = TRUE
)
The reason your code is slow because you are incrementally growing your dataframe in the loop. One way to do this using dplyr and map_df from purrr can be
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
purrr::map_df(files_list, read.csv) %>%
filter(ID %in% id) %>%
summarise_at(pollutant, mean, na.rm = TRUE)
}

Resources