I am trying to append multiple sheets from multiple Excel files. For instance, each Excel file has 10 sheets (different format), but the 10 sheets of an Excel file have the same names and format as the associated 10 sheets of another Excel file. Essentially, each Excel file holds the different types of information of a different country, but the types of information collected are the same for each country (population, pollution index, GDP, etc.). And I have many countries so I'm thinking of using a loop.
I use "report_1h" as the master Excel file, and append sheets of other Excel files into the master file's sheets.
library(rio)
x1_data <- import_list("report_1h.xlsx")
report_list <- list.files(path = 'E:/Report_folder', pattern = '*.xlsx')
sheet_ <- data.frame()
for (file in report_list){
book <- import_list(file)
for (i in 1:31){
sheet_[i] <- rbind(x1_data[[i]][,],book[[i]][,])
x1_data[[i]][,] <- sheet_[i]
}
}
The loop is intended to append sheets from each Excel file to sheets of the master file "report_1h". But it gives error:
Error in `[<-.data.frame`(`*tmp*`, i, value = c("Data Source(s):", "Data Source(s):", :
replacement has 2 rows, data has 0
Can someone tell me why?
Here's a way to do this -
library(tidyverse)
#get the all the filenames
report_list <- list.files(path = 'E:/Report_folder', pattern = '*.xlsx$')
#Create a list of dataframes
map(report_list, function(x) {
sheets <-excel_sheets(x)
map(sheets, function(y) read_excel(x, y))
}) %>% transpose() %>%
map(bind_rows) -> result
#assign sheet names
names(result) <- paste0('Sheet', seq_along(result))
#Write master excel file
writexl::write_xlsx(result, 'master_excel.xlsx')
Related
So, I have many excel files in a folder, and each file has multiple sheets. If the name of the excel file is 'xyz', I want each sheet of each excel file to contain a 'new_column' such that each row of the new column will contain the excel file name (in this example, 'xyz').
Is there any direct way to do that? I would prefer to directly alter the files in the folder without creating new dataframes within rstudio.
Thanks.
You can use double lapply -
library(readxl)
library(writexl)
#Get a vector of xlsx filenames
filenames <- list.files(pattern = '.xlsx', full.names = TRUE)
lapply(filenames, function(x) {
#Read the sheet names
sheetname <- excel_sheets(x)
#For each sheet read the data and create list of dataframe
lapply(sheetname, function(y) {
cbind(read_xlsx(x, y), filename = x)
}) -> res
#Assign names to the list
names(res) <- sheetname
#Write the data back
write_xlsx(res, x)
})
I have 36 excel files and need to read only one sheet from each of them into a dataframe. The sheet names are not all the same but they share a common string (e.g., "Expense Audit", "Jan Expense Audit", "19 Expense Audit"). I would like to write a function to list.files and then use read_excel to pull only the sheets containing the "Expense Audit" string into a single dataframe.
You can try :
library(purrr)
library(readxl)
#List all the excel files
file_path <- list.files(path = '/path/to/excel/files/', pattern = '\\.xlsx$', full.names = TRUE)
#Read each excel file and combine them in one dataframe
map_df(file_path, ~{
#get all the names of the sheet
sheets <- excel_sheets(.x)
#Select the one which has 'Expense Audit' in them
sheet_name <- grep('Expense Audit', sheets, value = TRUE)
#Read the excel with the sheet name
read_excel(.x, sheet_name)
}) -> data
data
I have written a function that, after giving the direction of the folder, takes all the excel files inside it and merges them into a data frame with some modest modifications.
Yet I have two small things I would like to add but struggle with:
Each file has a country code in the name, and I would like the function to create an additional column in the data frame, "Country", where each observation would be assigned such country code. name example: BGR_CO2_May12
Each file is composed of many sheets, with each sheet representing the year; these sheets are also called by these years. I would like the function to create another column, "Year", where each observation would be assigned the name of the sheet that it comes from.
Is there a neat way to do it? Possibly without modifying the current function?
multmerge_xls_TEST <- function(mypath2) {
library(dplyr)
library(readxl)
library(XLConnect)
library(XLConnectJars)
library(stringr)
# This function gets the list of files in a given folder
re_file <- ".+\\.xls.?"
testFiles <- list.files(path = mypath2,
pattern = re_file,
full.names = TRUE)
# This function rbinds in a single dataframe the content of multiple sheets in the same workbook
# (assuming that all the sheets have the same column types)
# It also removes the first sheet (no data there)
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, function(sheet) {
readWorksheet(wb, sheet)
})
)
}
# Getting a single dataframe for all the Excel files and cutting out the unnecessary variables
result <- do.call(rbind, lapply(testFiles, rbindAllSheets))
result <- result[,c(1,2,31)]
Try making a wrapper around readWorksheet(). This would store the file name into the variable Country and the sheet name into Year. You would need to do some regex on the file though to get the code only.
You could also skip the wrapper and simply add the mutate() line within your current function. Note this uses the dplyr package, which you already have referenced.
read_worksheet <- function(sheet, wb, file) {
readWorksheet(wb, sheet) %>%
mutate(Country = file,
Year = sheet)
}
So then you could do something like this within the function you already have.
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, read_worksheet, wb = wb, file = file)
)
}
As another note, bind_rows() is another dplyr function which can take the place of your do.call(rbind, ...) calls.
bind_rows(lapply(sheets, read_worksheet, wb = wb, file = file))
I am supposed to load the data for my master-thesis in an R-dataframe, which is stored in 74 excel workbooks. Every workbook has 4 worksheets each, called: animals, features, r_words, verbs. All of the worksheets have the same 12 variables(starttime, word, endtime, ID,... etc.). I want to concatenate every worksheet under the one before, so the resulting dataframe should have 12 columns and the number of rows is depending on how many answers the 74 subjects produced.
I want to use the readxl-package of the tidyverse and followed this article: https://readxl.tidyverse.org/articles/articles/readxl-workflows.html#csv-caching-and-iterating-over-sheets.
The first problem I face is how to read all 4 worksheets with read_excel(path, sheet = "animals", "features", "r_words", "verbs"). This only works with the first worksheet, so I tried to make a list with all the sheet-names (object sheet). This is also not working. And when I try to use the following code with just one worksheet, the next line throws an error:
Error in basename(.) : a character vector argument expected
So, here is a part of my code, hopefully fulfilling the requirements:
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
# indices
subfile_nos <- 1:length(filenames)
# function to read all the sheets in at once and cache to csv
read_then_csv <- function(sheet, path) {
for (i in 1:length(filenames)){
sheet <- excel_sheets(filenames[i])
len.sheet <- 1:length(sheet)
path <- read_excel(filenames[i], sheet = sheet[i]) #only reading in the first sheet
pathbase <- path %>%
basename() %>% #Error in basename(.) : a character vector argument expected
tools::file_path_sans_ext()
path %>%
read_excel(sheet = sheet) %>%
write_csv(paste0(pathbase, "-", sheet, ".csv"))
}
}
You should do a double loop or a nested map, like so:
library(dplyr)
library(purrr)
library(readxl)
# I suggest looking at
?purrr::map_df
# Function to read all the sheets in at once and save as csv
read_then_csv <- function(input_filenames, output_file) {
# Iterate over files and concatenate results
map_df(input_filenames, function(f){
# Iterate over sheets and concatenate results
excel_sheets(f) %>%
map_df(function(sh){
read_excel(f, sh)
})
}) %>%
# Write csv
write_csv(output_file)
}
# Test function
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
read_then_csv(filenames, 'my_output.csv')
You say...'I want to concatenate every worksheet under the one before'... The script below will combine all sheets from all files. Test it on a COPY of your data, in case it doesn't do what you want/need it to do.
# load names of excel files
files = list.files(path = "C:\\your_path_here\\", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)
I need to read specific sheets from a list of excel files.
I have >500 excel files, ".xls" and ".xlsx".
Each file can have different sheets, but I just want to read each sheets containing a specific expresion, like pattern = "^Abc", and not all files have sheets with this pattern.
I've created a code to read one file, but when I try to translate to multiple files, allways returns an error.
# example with 3rd file
# 2 sheets have the pattern
list_excels <- list.files(path = "path_to_folder", pattern = ".xls*"
sheet_names <- excel_sheets(list_excels[[3]])
list_sheets <- lapply(excel_sheets(list_excels[[3]]), read_excel, path = list_excels[[3]])
names(list_sheets) <- sheet_names
do.call("rbind", list_sheets[grepl(pattern = "^Abc", sheet names)])
But when I try to code for read multiple excels files, I have an error or something in the loop that slows a lot the computation.
There are some examples
This is a loop that doesn't return an error, but takes 30 seconds at least for each element of the list, I've never waited to finishing .
for (i in seq_along(list_excels)) {
sheet_names <- excel_sheets(list_excels[[i]])
list_sheets <- lapply(excel_sheets(list_excels[[i]]), read_excel, path = list_excels[[i]])
names(list_sheets) <- sheet_names[i] list_sheets[grepl(pattern = "^Abc", sheet_names)]
}
In this loop is missing the final part, the merging sheets with this code
list_sheets[grepl(pattern = "^Abc", sheet_names)]
I've tried to sum the rows of each sheet and store it in an vector, but I think that the loop is broken when there is a sheet that doesn't have the pattern.
x <- c()
for(i in seq_along(list_excels)) {
x[i] <- nrow(do.call("rbind",
lapply(excel_sheets(list_excels[[i]]),
read_excel,
path = list_excels[[i]])[grepl(pattern = "^Abc",
excel_sheets(list_excels[[i]]))]))
Also with purrr library, trying to read all, the same result with first loop example.
list_test <- list()
for(i in seq_along(list_excels)) {
list_test[[i]] <- excel_sheets(list_excels[[i]]) %>%
set_names() %>%
map(read_excel, path = list_excels[[i]])
}
Last example, that works with one excel file, but not with multiple. Just reading named sheet.
# One file works
data.frame(readWorksheetFromFile(list_excels[[1]], sheet = "Abc"))
#Multiple file returns an error
for(i in seq_along(list_excels)) {
data.frame(readWorksheetFromFile(list_excels[[i]], sheet = "Abc"))
#Returns the following error
#Error: IllegalArgumentException (Java): Sheet index (-1) is out of range (0..1)
Some one could help me?