stack different dataframes in a function - r

I am having troubles to stack a variable number of data-frames whithin a for-loop. Can someone please help me?
# Load libraries
library(dplyr)
library(tidyverse)
library(here)
A function that should open all excel-files and create one single file using plyr::rbind.fill():
stackDfs <- function(filenames){
for(fn in filenames){
df1 <- openxlsx::read.xlsx(here::here("folder1", "folder2", fn), sheet=6, detectDates = TRUE)
# ... do some additional mutations here
}
all_dfs <- (plyr::rbind.fill(df1, df2, df3, df4, ...)
return(all_dfs)
}
Here I am defining which files should be opened and call the stacking-function. The number of files to be stacked should be variable.
filenames <- c("filexy-20210202.xlsx", "filexy-2021-20210205.xlsx")
stackDfs(filenames)

With the current setup, we can initialize a list, read the data into the list and then apply rbind.fill within do.call
stackDfs <- function(filenames){
out_lst <- vector('list', length(filenames))
names(out_list) <- filenames
for(fn in filenames){
out_list[[fn]] <- openxlsx::read.xlsx(here::here("folder1",
"folder2", fn), sheet=6, detectDates = TRUE)
#....
}
all_dfs <- do.call(plyr::rbind.fill, out_lst)
all_dfs
}

In line with the comment posted about using purrr, if you put all the files in the same folder you can use a package called fs instead of manually entering all the file names:
library(tidyverse)
library(here)
library(fs)
filenames <- fs::dir_ls(here("files"))
all_dfs <- filenames %>%
map_dfr(openxlsx:read.xlsx, sheet = 6, detectDates = TRUE)
all_dfs

Related

Use map() function to read and merge another files list

I have three files list
filesl <- list.files(pattern = 'RP-L-.*\\.xls', full.names = TRUE)
filesr <- list.files(pattern = 'RP-R-.*\\.xls', full.names = TRUE)
filesv <- list.files(pattern = 'RP-V-.*\\.xls', full.names = TRUE)
And I've attemped to read and merge two of them as follows:
data <- map2_dfr(filesl, filesr, ~ {
ldat <- readxl::read_excel(.x)
rdat <- readxl::read_excel(.y)
df <- rbind.data.frame(ldat, rdat, by = c('ID', 'GR', 'SES', 'COND'))
})
If I would like to add the third one list filesv what am I supposed to set as function? I guess I should use one from package purrr enabling function iteration on multiple elements (such pmap, imap and so on, but I am not able to change the commands properly).
Any suggestions as regards
Thanks in advance
data <- pmap_dfr(list(filesl, filesr, filesv), ~ {
ldat <- readxl::read_excel(..1)
rdat <- readxl::read_excel(..2)
vdat <- readxl::read_excel(..3)
df <- rbind.data.frame(ldat, rdat, vdat, by = c('ID', 'GR', 'SES', 'COND'))
})
pmap takes a list of lists to iterate on. Then the arguments can be referred to as ..1, ..2, etc in the anonymous function.

Create a list of tibbles with unique names using a for loop

I'm working on a project where I want to create a list of tibbles containing data that I read in from Excel. The idea will be to call on the columns of these different tibbles to perform analyses on them. But I'm stuck on how to name tibbles in a for loop with a name that changes based on the for loop variable. I'm not certain I'm going about this the correct way. Here is the code I've got so far.
filenames <- list.files(path = getwd(), pattern = "xlsx")
RawData <- list()
for(i in filenames) {
RawData <- list(i <- tibble(read_xlsx(path = i, col_names = c('time', 'intesity'))))
}
I've also got the issue where, right now, the for loop overwrites RawData with each turn of the loop but I think that is something I can remedy if I can get the naming convention to work. If there is another method or data structure that would better suite this task, I'm open to suggestions.
Cheers,
Your code overwrites RawData in each iteration. You should use something like this to add the new tibble to the list RawData <- c(RawData, read_xlsx(...)).
A simpler way would be to use lapply instead of a for loop :
RawData <-
lapply(
filenames,
read_xlsx,
col_names = c('time', 'intesity')
)
Here is an approach with map from package purrr
library(tidyverse)
filenames <- list.files(path = getwd(), pattern = "xlsx")
mylist <- map(filenames, ~ read_xlsx(.x, col_names = c('time', 'intesity')) %>%
set_names(filenames)
Similar to the answer by #py_b, but add a column with the original file name to each element of the list.
filenames <- list.files(path = getwd(), pattern = "xlsx")
Raw_Data <- lapply(filenames, function(x) {
out_tibble <- read_xlsx(path = x, col_names = c('time', 'intesity'))
out_tibble$source_file <- basename(x) # add a column with the excel file name
return(out_tibble)
})
If you want to merge the list of tibbles into one big one you can use do.call('rbind', Raw_Data)

import multible CSV files and use file name as column [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 14 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)

R efficiently bind_rows over many dataframes stored on harddrive

I have roughly 50000 .rda files. Each contains a dataframe named results with exactly one row. I would like to append them all into one dataframe.
I tried the following, which works, but is slow:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
load(files[1])
results_table = results
rm(results)
for(i in c(2:length(files))) {
print(paste("We are at step ", i,sep=""))
load(files[i])
results_table= bind_rows(list(results_table, results))
rm(results)
}
Is there a more efficient way to do this?
Using .rds is a little bit easier. But if we are limited to .rda the following might be useful. I'm not certain if this is faster than what you have done:
library(purrr)
library(dplyr)
library(tidyr)
## make and write some sample data to .rda
x <- 1:10
fake_files <- function(x){
df <- tibble(x = x)
save(df, file = here::here(paste0(as.character(x),
".rda")))
return(NULL)
}
purrr::map(x,
~fake_files(x = .x))
## map and load the .rda files into a single tibble
load_rda <- function(file) {
foo <- load(file = file) # foo just provides the name of the objects loaded
return(df) # note df is the name of the rda returned object
}
rda_files <- tibble(files = list.files(path = here::here(""),
pattern = "*.rda",
full.names = TRUE)) %>%
mutate(data = pmap(., ~load_rda(file = .x))) %>%
unnest(data)
This is untested code but should be pretty efficient:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
data_list <- lapply("mydata.rda", function(f) {
message("loading file: ", f)
name <- load(f) # this should capture the name of the loaded object
return(eval(parse(text = name))) # returns the object with the name saved in `name`
})
results_table <- data.table::rbindlist(data_list)
data.table::rbindlist is very similar to dplyr::bind_rows but a little faster.

Merge multiple excel files into R taking only 2nd sheet, retaining file name as 'data source'

I'm trying to merge multiple excel files into a single data.frame in R - all files are pulled from a common folder, pulling only the 2nd sheet, which will always have a specific name ('Value Assessment').
In addition be able to retain each file name in a column, so the source of merged data is maintained.
I've been able to load the files and merge into one data.frame, but can't figure out how to retain file name as 'source name'.
setwd(/.)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list,read_excel)
df <- rbindlist(df.list, idcol = "id")
Using setNames():
file.list <- list.files(pattern = '*.xlsx')
file.list <- setNames(file.list, file.list)
df.list <- lapply(file.list, read_excel, sheet = 2)
df.list <- Map(function(df, name) {
df$source_name <- name
df
}, df.list, names(df.list))
df <- rbindlist(df.list, idcol = "id")
(Note: probably a typo, you were missing sheet = 2).
Try this: Merge All Data from All Excel Files:
library(xlsx)
setwd("C:/Users/your_path_here/excel_files")
data.files = list.files(pattern = "*.xlsx")
data <- lapply(data.files, function(x) read.xlsx(x, sheetIndex = 2))
for (i in data.files) {
data <- rbind(data, read.xlsx(i, sheetIndex = 1))
}

Resources