I need to read specific sheets from a list of excel files.
I have >500 excel files, ".xls" and ".xlsx".
Each file can have different sheets, but I just want to read each sheets containing a specific expresion, like pattern = "^Abc", and not all files have sheets with this pattern.
I've created a code to read one file, but when I try to translate to multiple files, allways returns an error.
# example with 3rd file
# 2 sheets have the pattern
list_excels <- list.files(path = "path_to_folder", pattern = ".xls*"
sheet_names <- excel_sheets(list_excels[[3]])
list_sheets <- lapply(excel_sheets(list_excels[[3]]), read_excel, path = list_excels[[3]])
names(list_sheets) <- sheet_names
do.call("rbind", list_sheets[grepl(pattern = "^Abc", sheet names)])
But when I try to code for read multiple excels files, I have an error or something in the loop that slows a lot the computation.
There are some examples
This is a loop that doesn't return an error, but takes 30 seconds at least for each element of the list, I've never waited to finishing .
for (i in seq_along(list_excels)) {
sheet_names <- excel_sheets(list_excels[[i]])
list_sheets <- lapply(excel_sheets(list_excels[[i]]), read_excel, path = list_excels[[i]])
names(list_sheets) <- sheet_names[i] list_sheets[grepl(pattern = "^Abc", sheet_names)]
}
In this loop is missing the final part, the merging sheets with this code
list_sheets[grepl(pattern = "^Abc", sheet_names)]
I've tried to sum the rows of each sheet and store it in an vector, but I think that the loop is broken when there is a sheet that doesn't have the pattern.
x <- c()
for(i in seq_along(list_excels)) {
x[i] <- nrow(do.call("rbind",
lapply(excel_sheets(list_excels[[i]]),
read_excel,
path = list_excels[[i]])[grepl(pattern = "^Abc",
excel_sheets(list_excels[[i]]))]))
Also with purrr library, trying to read all, the same result with first loop example.
list_test <- list()
for(i in seq_along(list_excels)) {
list_test[[i]] <- excel_sheets(list_excels[[i]]) %>%
set_names() %>%
map(read_excel, path = list_excels[[i]])
}
Last example, that works with one excel file, but not with multiple. Just reading named sheet.
# One file works
data.frame(readWorksheetFromFile(list_excels[[1]], sheet = "Abc"))
#Multiple file returns an error
for(i in seq_along(list_excels)) {
data.frame(readWorksheetFromFile(list_excels[[i]], sheet = "Abc"))
#Returns the following error
#Error: IllegalArgumentException (Java): Sheet index (-1) is out of range (0..1)
Some one could help me?
Related
I am trying to append multiple sheets from multiple Excel files. For instance, each Excel file has 10 sheets (different format), but the 10 sheets of an Excel file have the same names and format as the associated 10 sheets of another Excel file. Essentially, each Excel file holds the different types of information of a different country, but the types of information collected are the same for each country (population, pollution index, GDP, etc.). And I have many countries so I'm thinking of using a loop.
I use "report_1h" as the master Excel file, and append sheets of other Excel files into the master file's sheets.
library(rio)
x1_data <- import_list("report_1h.xlsx")
report_list <- list.files(path = 'E:/Report_folder', pattern = '*.xlsx')
sheet_ <- data.frame()
for (file in report_list){
book <- import_list(file)
for (i in 1:31){
sheet_[i] <- rbind(x1_data[[i]][,],book[[i]][,])
x1_data[[i]][,] <- sheet_[i]
}
}
The loop is intended to append sheets from each Excel file to sheets of the master file "report_1h". But it gives error:
Error in `[<-.data.frame`(`*tmp*`, i, value = c("Data Source(s):", "Data Source(s):", :
replacement has 2 rows, data has 0
Can someone tell me why?
Here's a way to do this -
library(tidyverse)
#get the all the filenames
report_list <- list.files(path = 'E:/Report_folder', pattern = '*.xlsx$')
#Create a list of dataframes
map(report_list, function(x) {
sheets <-excel_sheets(x)
map(sheets, function(y) read_excel(x, y))
}) %>% transpose() %>%
map(bind_rows) -> result
#assign sheet names
names(result) <- paste0('Sheet', seq_along(result))
#Write master excel file
writexl::write_xlsx(result, 'master_excel.xlsx')
This question already has answers here:
How to import multiple .csv files at once?
(15 answers)
Closed 2 years ago.
I have two lists, one with the excel file paths that I would like to read and another list with the file names that I would like to assign to each as a dataframe. Trying to create a loop using the below code but the loop only creates a single dataframe with name n. Any idea how to make this work?
files <- c("file1.xlsx","file2.xlsx")
names <- c('name1','name2')
for (f in files) {
for (n in names) {
n <- read_excel(path = f)
}
}
You are overwriting n on each iteration of the loop
Edit:
#Parfait commented that we shouldn't use assign if we can avoid it, and he is right (e.g. why-is-using-assign-bad)
This does not use assign and puts the data in a neat list:
files <- c("file1.xlsx","file2.xlsx")
names <- c('name1','name2')
result <- list()
for (i in seq_along(files)) {
result[names[i]] <- read_excel(path = files[i]))
}
Old and not recommended answer (only left here for transparency reasons):
We can use assign to use a character string as variable name:
files <- c("file1.xlsx","file2.xlsx")
names <- c('name1','name2')
for (i in seq_along(files)) {
assign(names[i], read_excel(path = files[i]))
}
An alternative is to loop through all Excel files in a folder, rather than a list. I'm assuming they exist in some kind of folder, somewhere.
# load names of excel files
files = list.files(path = "C:/your_path_here/", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)
Updated...
I have written a function that, after giving the direction of the folder, takes all the excel files inside it and merges them into a data frame with some modest modifications.
Yet I have two small things I would like to add but struggle with:
Each file has a country code in the name, and I would like the function to create an additional column in the data frame, "Country", where each observation would be assigned such country code. name example: BGR_CO2_May12
Each file is composed of many sheets, with each sheet representing the year; these sheets are also called by these years. I would like the function to create another column, "Year", where each observation would be assigned the name of the sheet that it comes from.
Is there a neat way to do it? Possibly without modifying the current function?
multmerge_xls_TEST <- function(mypath2) {
library(dplyr)
library(readxl)
library(XLConnect)
library(XLConnectJars)
library(stringr)
# This function gets the list of files in a given folder
re_file <- ".+\\.xls.?"
testFiles <- list.files(path = mypath2,
pattern = re_file,
full.names = TRUE)
# This function rbinds in a single dataframe the content of multiple sheets in the same workbook
# (assuming that all the sheets have the same column types)
# It also removes the first sheet (no data there)
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, function(sheet) {
readWorksheet(wb, sheet)
})
)
}
# Getting a single dataframe for all the Excel files and cutting out the unnecessary variables
result <- do.call(rbind, lapply(testFiles, rbindAllSheets))
result <- result[,c(1,2,31)]
Try making a wrapper around readWorksheet(). This would store the file name into the variable Country and the sheet name into Year. You would need to do some regex on the file though to get the code only.
You could also skip the wrapper and simply add the mutate() line within your current function. Note this uses the dplyr package, which you already have referenced.
read_worksheet <- function(sheet, wb, file) {
readWorksheet(wb, sheet) %>%
mutate(Country = file,
Year = sheet)
}
So then you could do something like this within the function you already have.
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, read_worksheet, wb = wb, file = file)
)
}
As another note, bind_rows() is another dplyr function which can take the place of your do.call(rbind, ...) calls.
bind_rows(lapply(sheets, read_worksheet, wb = wb, file = file))
I am supposed to load the data for my master-thesis in an R-dataframe, which is stored in 74 excel workbooks. Every workbook has 4 worksheets each, called: animals, features, r_words, verbs. All of the worksheets have the same 12 variables(starttime, word, endtime, ID,... etc.). I want to concatenate every worksheet under the one before, so the resulting dataframe should have 12 columns and the number of rows is depending on how many answers the 74 subjects produced.
I want to use the readxl-package of the tidyverse and followed this article: https://readxl.tidyverse.org/articles/articles/readxl-workflows.html#csv-caching-and-iterating-over-sheets.
The first problem I face is how to read all 4 worksheets with read_excel(path, sheet = "animals", "features", "r_words", "verbs"). This only works with the first worksheet, so I tried to make a list with all the sheet-names (object sheet). This is also not working. And when I try to use the following code with just one worksheet, the next line throws an error:
Error in basename(.) : a character vector argument expected
So, here is a part of my code, hopefully fulfilling the requirements:
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
# indices
subfile_nos <- 1:length(filenames)
# function to read all the sheets in at once and cache to csv
read_then_csv <- function(sheet, path) {
for (i in 1:length(filenames)){
sheet <- excel_sheets(filenames[i])
len.sheet <- 1:length(sheet)
path <- read_excel(filenames[i], sheet = sheet[i]) #only reading in the first sheet
pathbase <- path %>%
basename() %>% #Error in basename(.) : a character vector argument expected
tools::file_path_sans_ext()
path %>%
read_excel(sheet = sheet) %>%
write_csv(paste0(pathbase, "-", sheet, ".csv"))
}
}
You should do a double loop or a nested map, like so:
library(dplyr)
library(purrr)
library(readxl)
# I suggest looking at
?purrr::map_df
# Function to read all the sheets in at once and save as csv
read_then_csv <- function(input_filenames, output_file) {
# Iterate over files and concatenate results
map_df(input_filenames, function(f){
# Iterate over sheets and concatenate results
excel_sheets(f) %>%
map_df(function(sh){
read_excel(f, sh)
})
}) %>%
# Write csv
write_csv(output_file)
}
# Test function
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
read_then_csv(filenames, 'my_output.csv')
You say...'I want to concatenate every worksheet under the one before'... The script below will combine all sheets from all files. Test it on a COPY of your data, in case it doesn't do what you want/need it to do.
# load names of excel files
files = list.files(path = "C:\\your_path_here\\", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)
I am new to R and am currently trying to apply a function to worksheets of the same index in different workbooks using the package XLConnect.
So far I have managed to read in a folder of excel files:
filenames <- list.files( "file path here", pattern="\\.xls$", full.names=TRUE)
and I have looped through the different files, reading each worksheet of each file
for (i in 1:length(filenames)){
tmp<-loadWorkbook(file.path(filenames[i],sep=""))
lst<- readWorksheet(tmp,
sheet = getSheets(tmp), startRow=5, startCol=1, header=TRUE)}
What I think I want to do is to loop through the files in filenames and then take the worksheets with the same index (eg. the 1st worksheet of the all the files, then the 2nd worksheet of all the files etc.) and save these to a new workbook (the first workbook containing all the 1st worksheets, then a second workbook with all the 2nd worksheets etc.), with a new sheet for each original sheet that was taken from the previous files and then use
for (sheet in newlst){
Count<-t(table(sheet$County))}
and apply my function to the parameter Count.
Does anyone know how I can do this or offer me any guidance at all? Sorry if it is not clear, please ask and I will try to explain further! Thanks :)
If I understand your question correctly, the following should solve your problem:
require(XLConnect)
# Example: workbooks w1.xls - w[n].xls each with sheets S1 - S[m]
filenames = list.files("file path here", pattern = "\\.xls$", full.names = TRUE)
# Read data from all workbooks and all worksheets
data = lapply(filenames, function(f) {
wb = loadWorkbook(f)
readWorksheet(wb, sheet = getSheets(wb)) # read all sheets in one go
})
# Assumption for this example: all workbooks contain same number of sheets
nWb = sapply(data, length)
stopifnot(diff(range(nWb)) == 0)
nWb = min(nWb)
for(i in seq(length.out = nWb)) {
# List of data.frames of all i'th sheets
dout = lapply(data, "[[", i)
# Note: write all collected sheets in one go ...
writeWorksheetToFile(file = paste0("wout", i, ".xls"), data = dout,
sheet = paste0("Sout", seq(length.out = length(data))))
}
using gdata/xlsx might be for the best. Even though it's the slowest out there, it's one of the more intuitive ones.
This method fails if the excels are sorted differently.
Given there's no real example, here's some food for thought.
Gdata requires perl to be installed.
library(gdata)
filenames <- list.files( "file path here", pattern="\\.xls$", full.names=TRUE)
amountOfSheets <- #imput something
#This snippet gets the first sheet out of all the files and combining them
readXlsSheet <- function(whichSheet, files){
for(i in seq(files)){
piece <- read.xls(files[i], sheet=whichSheet)
#rbinding frames should work if the sheets are similar, use merge if not.
if(i == 1) complete <- piece else complete <- rbind(complete, piece)
}
complete
}
#Now looping through all the sheets using the previous method
animals <- lapply(seq(amountOfSheets), readXlsSheet, files=filenames)
#The output is a list where animals[[n]] has the all the nth sheets combined.
xlsx might only work on 32bit R, and has it's own fair share of issues.
library(xlsx)
filenames <- list.files(pattern="\\.xls$", full.names=TRUE)
amountOfSheets <- 2#imput something
#This snippet gets the first sheet out of all the files and combining them
readXlsSheet <- function(whichSheet, files){
for(i in seq(files)){
piece <- read.xlsx(files[i], sheetIndex=whichSheet)
#rbinding frames should work if the sheets are similar, use merge if not.
if(i == 1) complete <- piece else complete <- rbind(complete, piece)
}
complete
}
readXlsSheet(1, filenames)
#Now looping through all the sheets using the previous method
animals <- lapply(seq(amountOfSheets), readXlsSheet, files=filenames)