Loop through multiple files in different folders in R - r

I have folders named 2021ClientA, 2021ClientB.... each folder has a summary textfile named summary.csv
I want to create a loop that locates folders starting with 2021 then extract their summary.csv file and rbind at the end to create one dataframe

Setting the current directory to the folder containing the 2021XXX folders, you can do:
folders <- list.files(pattern = "^2021")
n <- length(folders)
dfs <- vector("list", length = n)
for(i in 1:n){
dfs[[i]] <- read.csv(file.path(folders[i], "summary.csv"))
}
bigdf <- do.call(rbind, dfs)

iris %>% write.csv('~/Desktop/Stack/2021CustomerA/summary.csv')
iris %>% write.csv('~/Desktop/Stack/2021CustomerB/summary.csv')
iris %>% write.csv('~/Desktop/Stack/2021CustomerC/summary.csv')
iris %>% write.csv('~/Desktop/Stack/2022CustomerA/summary.csv')
setwd('~/Desktop/Stack/')
read_csv_files <- data.frame()
for(i in list.files()){
if(grepl('2021',i)){
setwd(sprintf('~/Desktop/Stack/%s',i))
}else{
next
}
new_df <- read.csv('summary.csv')
read_csv_files <- rbind(read_csv_files,new_df)
setwd('~/Desktop/Stack/')
}

Related

Multiple bind_rows with R Dplyr

I need to bind_row 27 excel files. Although I can do it manually, I want to do it with a loop. The problem with the loop is that it will bind the first file to i and then the first file to i+1, hence losing i. How can I fix this?
nm <- list.files(path="sample/sample/sample/")
df <- data.frame()
for(i in 1:27){
my_data <- bind_rows(df, read_excel(path = nm[i]))
}
We could loop over the files with map, read the data with read_excel and rbind with _dfr
library(purrr)
my_data <- map_dfr(nm, read_excel)
In the Op's code, the issue is that in each iteration, it is creating a temporary dataset 'my_data', instead, it should be binded to the original 'df' already created
for(i in 1:27){
df <- rbind(df, read_excel(path = nm[i]))
}
We could use sapply
result <- sapply(files, read_excel, simplify=FALSE) %>%
bind_rows(.id = "id")
You can still use your for loop:
my_data<-vector('list', 27)
for(i in 1:27){
my_data[i] <- read_excel(path = nm[i])
}
do.call(rbind, my_data)

stack different dataframes in a function

I am having troubles to stack a variable number of data-frames whithin a for-loop. Can someone please help me?
# Load libraries
library(dplyr)
library(tidyverse)
library(here)
A function that should open all excel-files and create one single file using plyr::rbind.fill():
stackDfs <- function(filenames){
for(fn in filenames){
df1 <- openxlsx::read.xlsx(here::here("folder1", "folder2", fn), sheet=6, detectDates = TRUE)
# ... do some additional mutations here
}
all_dfs <- (plyr::rbind.fill(df1, df2, df3, df4, ...)
return(all_dfs)
}
Here I am defining which files should be opened and call the stacking-function. The number of files to be stacked should be variable.
filenames <- c("filexy-20210202.xlsx", "filexy-2021-20210205.xlsx")
stackDfs(filenames)
With the current setup, we can initialize a list, read the data into the list and then apply rbind.fill within do.call
stackDfs <- function(filenames){
out_lst <- vector('list', length(filenames))
names(out_list) <- filenames
for(fn in filenames){
out_list[[fn]] <- openxlsx::read.xlsx(here::here("folder1",
"folder2", fn), sheet=6, detectDates = TRUE)
#....
}
all_dfs <- do.call(plyr::rbind.fill, out_lst)
all_dfs
}
In line with the comment posted about using purrr, if you put all the files in the same folder you can use a package called fs instead of manually entering all the file names:
library(tidyverse)
library(here)
library(fs)
filenames <- fs::dir_ls(here("files"))
all_dfs <- filenames %>%
map_dfr(openxlsx:read.xlsx, sheet = 6, detectDates = TRUE)
all_dfs

R efficiently bind_rows over many dataframes stored on harddrive

I have roughly 50000 .rda files. Each contains a dataframe named results with exactly one row. I would like to append them all into one dataframe.
I tried the following, which works, but is slow:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
load(files[1])
results_table = results
rm(results)
for(i in c(2:length(files))) {
print(paste("We are at step ", i,sep=""))
load(files[i])
results_table= bind_rows(list(results_table, results))
rm(results)
}
Is there a more efficient way to do this?
Using .rds is a little bit easier. But if we are limited to .rda the following might be useful. I'm not certain if this is faster than what you have done:
library(purrr)
library(dplyr)
library(tidyr)
## make and write some sample data to .rda
x <- 1:10
fake_files <- function(x){
df <- tibble(x = x)
save(df, file = here::here(paste0(as.character(x),
".rda")))
return(NULL)
}
purrr::map(x,
~fake_files(x = .x))
## map and load the .rda files into a single tibble
load_rda <- function(file) {
foo <- load(file = file) # foo just provides the name of the objects loaded
return(df) # note df is the name of the rda returned object
}
rda_files <- tibble(files = list.files(path = here::here(""),
pattern = "*.rda",
full.names = TRUE)) %>%
mutate(data = pmap(., ~load_rda(file = .x))) %>%
unnest(data)
This is untested code but should be pretty efficient:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
data_list <- lapply("mydata.rda", function(f) {
message("loading file: ", f)
name <- load(f) # this should capture the name of the loaded object
return(eval(parse(text = name))) # returns the object with the name saved in `name`
})
results_table <- data.table::rbindlist(data_list)
data.table::rbindlist is very similar to dplyr::bind_rows but a little faster.

Extract rows from csv files in R

I want to extract lat/long data + file name from csv
I have done the following:
#libraries-----
library(readr)
library("dplyr")
library("tidyverse")
# set wd-----EXAMPLE
setwd("F:/mydata/myfiles/allcsv")
# have R read files as list -----
list <- list.files("F:/mydata/myfiles/allcsv", pattern=NULL, all.files=FALSE,
full.names=FALSE)
list
]
#lapply function
row.names<- c("Date=0", "Time=3", "Type=2", "Model=1", "Coordinates=nextrow", "Latitude = 38.3356", "Longitude = 51.3323")
AllData <- lapply(list, read.table,
skip=5, header=FALSE, sep=";", row.names=row.names, col.names=NULL)
PulledRows <-
lapply(AllData, function(DF)
DF[fileone$Latitude==38.3356, fileone$Longitude==51.3323]
)
# maybe i need to specify a for loop?
how my data looks
Thank you.
This should work for you. You may have to change the path location if the .csv files are not in your working directory. And the location to save the final results.
results <- data.frame(Latitude=NA,Longitude=NA,FileName=NA) #create empty dataframe
for(i in 1:length(list)){ # loop through each file obtained from list (called above)
dat <- read_csv(list[i],col_names = FALSE) # read in the ith dataset
df <- data.frame(dat[6,1],dat[7,1],list[i]) # create new dataframe with values from dat
df[,1] <- as.numeric(str_remove(df[,1],'Latitude=')) # remove text and make numeric
df[,2] <- as.numeric(str_remove(df[,2],'Longitude='))
names(df) <- names(results) # having the same column names allows next line
results <- rbind(results,df) # 'stacks' the results dataframe and df dataframe
}
results <- na.omit(results) # remove missing values (first row)
write_csv(results,'desired/path')

Need help converting a for loop to lapply or sapply

I am working on R and learning how to code. I have written a piece of code, utilizing a for loop and I find it very slow. I was wondering if I can get some assistance to convert it to use either the sapply or lapply function. Here is my working R code:
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE) #creates a list of files
dat <- data.frame() #creates an empty data frame
for (i in seq_along(files_list)) {
#loops through the files, rbinding them together
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
mean(dat_subset[, pollutant], na.rm=TRUE) #identifies the Mean of a Pollutant
}
pollutantmean("specdata", "sulfate", 1:10)
This code takes almost 20 seconds to return, which is unacceptable for 332 records. Imagine if I have a dataset with 10K records and wanted to get the mean of those variables?
You can rbind all elements in a list using do.call, and you can read in all the files into that list using lapply:
mean(
filter( # here's the filter that will be applied to the rbind-ed data
do.call("rbind", # call "rbind" on all elements of a list
lapply( # create a list by reading in the files from list.files()
# add any necessary args to read.csv:
list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...)
)
)
), ID %in% id)$pollutant, # make sure id is replaced with what you want
na.rm = TRUE
)
The reason your code is slow because you are incrementally growing your dataframe in the loop. One way to do this using dplyr and map_df from purrr can be
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
purrr::map_df(files_list, read.csv) %>%
filter(ID %in% id) %>%
summarise_at(pollutant, mean, na.rm = TRUE)
}

Resources