How to consolidate several excel files with several sheets in R? - r

I want to consolidate 7 excel files with several sheets in R. All sheets have the same structure. I try to use a for but the result is the last workbook or error. The code is:
files <- list.files(pattern = ".xlsx")
sheets <- excel_sheets(files)
library(xlsx)
setwd("C:/Users/User/Documents")
for(i in 1:7){
file <- files
vari <- sheets %>%
set_names() %>%
map_df(~ read_excel(path,skip = 5 ,sheet = .x), .id = "sheet")
}
Thanks...

I will offer such a solution.
You can compose a data.frame with a list of files and sheets in them. Then use map2_dfr.
library(tidyverse)
setwd("C:/tmp")
path <- list.files(".", pattern = ".xlsx", full.names = T)
df_files <- data.frame(files = path) %>%
rowwise() %>%
mutate(sheets = list(excel_sheets(files))) %>%
unnest(sheets)
df <- map2_dfr(.x = df_files$files, .y = df_files$sheets, readxl::read_xlsx)

Related

Write_csv for a list of csv.files mantaining original file names

I have a df list (df_list) and i want a tydiverse approach to write all csv files from my list while mantaining their original file names.
So far i did:
df = dir(pattern = "\\.csv$", full.names = TRUE)
df_list = vector("list",length(df))
for(i in seq_along(networks))
{
df_list[[i]] = read.csv(df[[i]], sep = ";")
}
imap(df_list, ~write_csv(.x, paste0(.y,".csv")))
my current output is:
1.csv; 2.csv; 3.csv ...
The below will read in a set of files from an example directory, apply a function to those files, then save the files with the exact same names.
library(purrr)
library(dplyr)
# Create example directory with example .csv files
dir.create(path = "example")
data.frame(x1 = letters) %>% write.csv(., file = "example/example1.csv")
data.frame(x2 = 1:20) %>% write.csv(., file = "example/example2.csv")
# Get relative paths of all .csv files in the example subdirectory
path_list <- list.files(pattern = "example.*csv", recursive = TRUE) %>%
as.list()
# Read every file into list
file_list <- path_list %>%
map(~ read.csv(.x, sep = ","))
# Do something to the data
file_list_updated <- file_list %>%
map( ~ .x %>% mutate(foo = 5))
# Write the updated files to the old file names
map2(.x = file_list_updated,
.y = path_list,
~ write.csv(x = .x, file = .y))

Apply function to files from different folders (R)

I have discovered R a couple of years ago and it has been very handy to clean up dataframes, prepare some data and to handle other basic tasks.
Now I would like to try using R to apply basic treatments but on many different files stored in different folders at once.
Here is the script I would like to improve into one function that would loop through my folder "dataset_2006" and "dataset_2007" to do all the work.
library(dplyr)
library(readr)
library(sf)
library(purrr)
setwd("C:/Users/Downloads/global_data/dataset_2006")
shp2006 <- list.files(pattern = 'data_2006.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2006, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2006_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2006 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2006 , "new_shp_2006.shp", overwrite = TRUE)
setwd("C:/Users/Downloads/global_data/dataset_2007")
shp2007 <- list.files(pattern = 'data_2007.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2007, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2007_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2007 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2007 , "new_shp_2007.shp", overwrite = TRUE)
This is easy to achieve with a for-loop to loop over multiple items. To allow us to use wildcards, we can also add the function Sys.glob():
myfunction <- function(directories) {
for(dir in Sys.glob(directories)) {
# do something with a single dir
print(dir)
}
}
# you can specify multiple directories manually:
myfunction(c('C:/Users/Downloads/global_data/dataset_2006',
'C:/Users/Downloads/global_data/dataset_2007'))
# or use a wildcard to automatically get all files/directories that match the pattern:
myfunction('C:/Users/Downloads/global_data/dataset_200*')

How can I read multiple csvs and retain the number in the file name for each?

I have multiple csv files in a folder none of which have a header. I want to preserve the order set out by the number at the end of the file. The file names are "output-1.csv", "output-2.csv" and so on. Is there a way to include the file name of each csv so I know which data corresponds to which file. The answer [here][1] gets close to what I want.
library(tidyverse)
#' Load the data ----
mydata <-
list.files(path = "C:\\Users\\Documents\\Manuscripts\\experiment1\\output",
pattern = "*.csv") %>%
map_df( ~ read_csv(., col_names = F))
mydata
You can use:
library(tidyverse)
mydata <- list.files("C:\\Users\\Documents\\Manuscripts\\experiment1\\output",
pattern = ".csv$", full.names = T) %>%
set_names(str_sub(basename(.), 1, -5)) %>%
map_dfr(read_csv, .id = "file")

Read multiple xlsx files with multiple sheets into one R data frame - set_names function issue [duplicate]

This question already has answers here:
Read multiple xlsx files with multiple sheets into one R data frame
(4 answers)
Closed 3 years ago.
I would like to read in multiple xlsx files into R that have multiple sheets, I have a header (colnames) for the first sheet of each file but I don't have any for the rest of the sheets, however, it's the exact same columns.
I found that solution in a post:
dir_path <- "~/test_dir/" # target directory path where the xlsx files are located.
re_file <- "^test[0-9]\\.xlsx" # regex pattern to match the file name format, in this case 'test1.xlsx', 'test2.xlsx' etc, but could simply be 'xlsx'.
read_sheets <- function(dir_path, file){
xlsx_file <- paste0(dir_path, file)
xlsx_file %>%
excel_sheets() %>%
set_names() %>%
map_df(read_excel, path = xlsx_file, .id = 'sheet_name') %>%
mutate(file_name = file) %>%
select(file_name, sheet_name, everything())
}
df <- list.files(dir_path, re_file) %>%
map_df(~ read_sheets(dir_path, .))
but I can't figure out why it won't work, I get this error.
Error in set_names(.) : 1 argument passed to 'names<-' which requires 2
I've created this readxl solution with 2 excel workbooks, each having 2 sheets with the same columns. In your problem the 2nd sheet (and further) doesn't have the colnames, so they needed to be set with an additional if statement. It's probably not the fastest solution, but it works:
library(readxl)
#Set path
inputFolder <- "test/"
#Get list of files
fileList <- list.files(path = inputFolder, recursive=T, pattern='*.xlsx')
#Read in each sheet from each excel
for (f in 1:length(fileList)){
#Find the number of sheets in this workbook
sheetList <- excel_sheets(paste(inputFolder, fileList[f], sep = ""))
#Get the sheets of this workbook
for (s in 1:length(sheetList)) {
tempSheet <- read_excel(paste(inputFolder, fileList[f], sep = ""), sheet = sheetList[s])
if (f == 1 & s == 1) {
df <- tempSheet
}
else {
if(s != 1) {
names(tempSheet) <- names(df)
}
df <- rbind(df,tempSheet)
}
}
}
That seems to work. Here is a different means to the same end.
library(XLConnect)
testDir <- "C:\\your_path_here\\"
re_file <- ".+\\.xls.?"
testFiles <- list.files(testDir, re_file, full.names = TRUE)
# This function rbinds in a single dataframe
# the content of multiple sheets in the same workbook
# (assuming that all the sheets have the same column types)
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, function(sheet) {
readWorksheet(wb, sheet)
})
)
}
# Getting a single dataframe for all the Excel files
result <- do.call(rbind, lapply(testFiles, rbindAllSheets))

Import sheets from Excel files located in different folder in R

Basically I have two Excel files with the same name "Checklist" in two different folder (one is 2018 and the other one is 2019). Checklist has different sheets, one for each month : "January", "February" etc... Of course, all the sheets have exactly the same variables.
I would like to put in the same data frame all the sheets from both Excel files.
For now, I can gather the sheets from one Excel File with :
library(readxl)
library(tibble)
read_excel_allsheets <- function(filename, tibble = TRUE) {
sheets <- readxl::excel_sheets(filename)
x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
if(!tibble) x <- lapply(x, as.data.frame)
names(x) <- sheets
x
}
mysheets <-read_excel_allsheets("C:/Users/Thiphaine/Documents/2018/Checklist.xlsx")
library(dplyr)
mysheets<-bind_rows(mysheets, .id = "column_label")
I just don't know how to create a loop that will go through the folder 2018 and 2019 to gather all the sheets from both Excel file. The idea will also be that in 2020, I will have another folder "2020" that should be included... Any idea? Thanks
Try this:
library(dplyr)
allsheets <- list()
for(file in list.files(path = "PATH/TO/DCUMENTS/",
recursive = TRUE, pattern = "*.xlsx", full.names = TRUE)) {
mysheets <- read_excel_allsheets(file)
mysheets <- bind_rows(mysheets, .id = "column_label")
allsheets[[file]] <- mysheets
}
where PATH/TO/DOCUMENTS is probably something like "C:/Users/Thiphaine/Documents/ for you.
If you'd like can also vectorize it using the tidyverse approach. Especially because all of your files are the same (column names) and you want to end up with a data.frame.
require(tidyverse)
df <- list.files(path = "your_path",
full.names = TRUE,
recursive = TRUE,
pattern = "*.xls") %>%
tbl_df() %>%
mutate(sheetName = map(value, readxl::excel_sheets)) %>%
unnest(sheetName) %>%
mutate(myFiles = purrr::map2(value, sheetName, function(x,y) {
readxl::read_excel(x, sheet = paste(y))})) %>%
unnest(myFiles)

Resources