How to iterate over excel-sheets with readxl - r

I am supposed to load the data for my master-thesis in an R-dataframe, which is stored in 74 excel workbooks. Every workbook has 4 worksheets each, called: animals, features, r_words, verbs. All of the worksheets have the same 12 variables(starttime, word, endtime, ID,... etc.). I want to concatenate every worksheet under the one before, so the resulting dataframe should have 12 columns and the number of rows is depending on how many answers the 74 subjects produced.
I want to use the readxl-package of the tidyverse and followed this article: https://readxl.tidyverse.org/articles/articles/readxl-workflows.html#csv-caching-and-iterating-over-sheets.
The first problem I face is how to read all 4 worksheets with read_excel(path, sheet = "animals", "features", "r_words", "verbs"). This only works with the first worksheet, so I tried to make a list with all the sheet-names (object sheet). This is also not working. And when I try to use the following code with just one worksheet, the next line throws an error:
Error in basename(.) : a character vector argument expected
So, here is a part of my code, hopefully fulfilling the requirements:
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
# indices
subfile_nos <- 1:length(filenames)
# function to read all the sheets in at once and cache to csv
read_then_csv <- function(sheet, path) {
for (i in 1:length(filenames)){
sheet <- excel_sheets(filenames[i])
len.sheet <- 1:length(sheet)
path <- read_excel(filenames[i], sheet = sheet[i]) #only reading in the first sheet
pathbase <- path %>%
basename() %>% #Error in basename(.) : a character vector argument expected
tools::file_path_sans_ext()
path %>%
read_excel(sheet = sheet) %>%
write_csv(paste0(pathbase, "-", sheet, ".csv"))
}
}

You should do a double loop or a nested map, like so:
library(dplyr)
library(purrr)
library(readxl)
# I suggest looking at
?purrr::map_df
# Function to read all the sheets in at once and save as csv
read_then_csv <- function(input_filenames, output_file) {
# Iterate over files and concatenate results
map_df(input_filenames, function(f){
# Iterate over sheets and concatenate results
excel_sheets(f) %>%
map_df(function(sh){
read_excel(f, sh)
})
}) %>%
# Write csv
write_csv(output_file)
}
# Test function
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
read_then_csv(filenames, 'my_output.csv')

You say...'I want to concatenate every worksheet under the one before'... The script below will combine all sheets from all files. Test it on a COPY of your data, in case it doesn't do what you want/need it to do.
# load names of excel files
files = list.files(path = "C:\\your_path_here\\", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)

Related

Append multiple sheets from multiple Excel files in R

I am trying to append multiple sheets from multiple Excel files. For instance, each Excel file has 10 sheets (different format), but the 10 sheets of an Excel file have the same names and format as the associated 10 sheets of another Excel file. Essentially, each Excel file holds the different types of information of a different country, but the types of information collected are the same for each country (population, pollution index, GDP, etc.). And I have many countries so I'm thinking of using a loop.
I use "report_1h" as the master Excel file, and append sheets of other Excel files into the master file's sheets.
library(rio)
x1_data <- import_list("report_1h.xlsx")
report_list <- list.files(path = 'E:/Report_folder', pattern = '*.xlsx')
sheet_ <- data.frame()
for (file in report_list){
book <- import_list(file)
for (i in 1:31){
sheet_[i] <- rbind(x1_data[[i]][,],book[[i]][,])
x1_data[[i]][,] <- sheet_[i]
}
}
The loop is intended to append sheets from each Excel file to sheets of the master file "report_1h". But it gives error:
Error in `[<-.data.frame`(`*tmp*`, i, value = c("Data Source(s):", "Data Source(s):", :
replacement has 2 rows, data has 0
Can someone tell me why?
Here's a way to do this -
library(tidyverse)
#get the all the filenames
report_list <- list.files(path = 'E:/Report_folder', pattern = '*.xlsx$')
#Create a list of dataframes
map(report_list, function(x) {
sheets <-excel_sheets(x)
map(sheets, function(y) read_excel(x, y))
}) %>% transpose() %>%
map(bind_rows) -> result
#assign sheet names
names(result) <- paste0('Sheet', seq_along(result))
#Write master excel file
writexl::write_xlsx(result, 'master_excel.xlsx')

Assigning the filename and sheet name to (multiple) observations in R

I have written a function that, after giving the direction of the folder, takes all the excel files inside it and merges them into a data frame with some modest modifications.
Yet I have two small things I would like to add but struggle with:
Each file has a country code in the name, and I would like the function to create an additional column in the data frame, "Country", where each observation would be assigned such country code. name example: BGR_CO2_May12
Each file is composed of many sheets, with each sheet representing the year; these sheets are also called by these years. I would like the function to create another column, "Year", where each observation would be assigned the name of the sheet that it comes from.
Is there a neat way to do it? Possibly without modifying the current function?
multmerge_xls_TEST <- function(mypath2) {
library(dplyr)
library(readxl)
library(XLConnect)
library(XLConnectJars)
library(stringr)
# This function gets the list of files in a given folder
re_file <- ".+\\.xls.?"
testFiles <- list.files(path = mypath2,
pattern = re_file,
full.names = TRUE)
# This function rbinds in a single dataframe the content of multiple sheets in the same workbook
# (assuming that all the sheets have the same column types)
# It also removes the first sheet (no data there)
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, function(sheet) {
readWorksheet(wb, sheet)
})
)
}
# Getting a single dataframe for all the Excel files and cutting out the unnecessary variables
result <- do.call(rbind, lapply(testFiles, rbindAllSheets))
result <- result[,c(1,2,31)]
Try making a wrapper around readWorksheet(). This would store the file name into the variable Country and the sheet name into Year. You would need to do some regex on the file though to get the code only.
You could also skip the wrapper and simply add the mutate() line within your current function. Note this uses the dplyr package, which you already have referenced.
read_worksheet <- function(sheet, wb, file) {
readWorksheet(wb, sheet) %>%
mutate(Country = file,
Year = sheet)
}
So then you could do something like this within the function you already have.
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, read_worksheet, wb = wb, file = file)
)
}
As another note, bind_rows() is another dplyr function which can take the place of your do.call(rbind, ...) calls.
bind_rows(lapply(sheets, read_worksheet, wb = wb, file = file))

Read specific sheets from an excel file list

I need to read specific sheets from a list of excel files.
I have >500 excel files, ".xls" and ".xlsx".
Each file can have different sheets, but I just want to read each sheets containing a specific expresion, like pattern = "^Abc", and not all files have sheets with this pattern.
I've created a code to read one file, but when I try to translate to multiple files, allways returns an error.
# example with 3rd file
# 2 sheets have the pattern
list_excels <- list.files(path = "path_to_folder", pattern = ".xls*"
sheet_names <- excel_sheets(list_excels[[3]])
list_sheets <- lapply(excel_sheets(list_excels[[3]]), read_excel, path = list_excels[[3]])
names(list_sheets) <- sheet_names
do.call("rbind", list_sheets[grepl(pattern = "^Abc", sheet names)])
But when I try to code for read multiple excels files, I have an error or something in the loop that slows a lot the computation.
There are some examples
This is a loop that doesn't return an error, but takes 30 seconds at least for each element of the list, I've never waited to finishing .
for (i in seq_along(list_excels)) {
sheet_names <- excel_sheets(list_excels[[i]])
list_sheets <- lapply(excel_sheets(list_excels[[i]]), read_excel, path = list_excels[[i]])
names(list_sheets) <- sheet_names[i] list_sheets[grepl(pattern = "^Abc", sheet_names)]
}
In this loop is missing the final part, the merging sheets with this code
list_sheets[grepl(pattern = "^Abc", sheet_names)]
I've tried to sum the rows of each sheet and store it in an vector, but I think that the loop is broken when there is a sheet that doesn't have the pattern.
x <- c()
for(i in seq_along(list_excels)) {
x[i] <- nrow(do.call("rbind",
lapply(excel_sheets(list_excels[[i]]),
read_excel,
path = list_excels[[i]])[grepl(pattern = "^Abc",
excel_sheets(list_excels[[i]]))]))
Also with purrr library, trying to read all, the same result with first loop example.
list_test <- list()
for(i in seq_along(list_excels)) {
list_test[[i]] <- excel_sheets(list_excels[[i]]) %>%
set_names() %>%
map(read_excel, path = list_excels[[i]])
}
Last example, that works with one excel file, but not with multiple. Just reading named sheet.
# One file works
data.frame(readWorksheetFromFile(list_excels[[1]], sheet = "Abc"))
#Multiple file returns an error
for(i in seq_along(list_excels)) {
data.frame(readWorksheetFromFile(list_excels[[i]], sheet = "Abc"))
#Returns the following error
#Error: IllegalArgumentException (Java): Sheet index (-1) is out of range (0..1)
Some one could help me?

Merge multiple Excel files starting at row in R

I have multiple Excel files that I need to merge into one, but only certain rows. The Excel files look like this...
The column headers are identical for all files. I also need to add a new column A to the newly generated file, so I created a separate Excel file with just the headers and the new column A. My script first reads in this file (below) and writes it to the workbook...
Next, I need to read each file, starting at row 9 and merge all the data, one after another. So the final result should look like this (minus the Member site column, I haven't attempted the logic for that yet, but thinking it will be a substring of the Specimen ID value)...
However, my current result is...
I am currently only using 3 files, each with a few dozen rows, to start, but the end goal is to merge about 15-30 files, each with 25 to 200 rows, give or take. So...
1) I know my code is incorrect, but not sure how to get the intended results. For one, my loop is overwriting data because it's constantly starting at row/column 2 when it writes. However, I can't think how to rewrite this.
2) The dates are returning in General format ("43008" instead of "9/30/2017")
3) Certain columns data is being placed under different columns (like Nucleic Acid Concentration has the values from the Date of Tissue Content).
Any advice or help would be greatly appreciated!
My code...
library(openxlsx) # Excel and csv files
library(svDialogs) # Dialog boxes
setwd("C:/Users/Work/Combined Manifest")
# Create and load Excel file
wb <- createWorkbook()
# Add worksheet
addWorksheet(wb, "Template")
# Read in & write header file
df.headers <- read.xlsx("headers.xlsx", sheet = "Template")
writeData(wb, "Template", df.headers, colNames = TRUE)
# Function to get user path
getPath <- function() {
# Ask for path
path <- dlgInput("Enter path to files: ", Sys.info()["user"])$res
if (dir.exists(path)) {
# If path exists, set the path as the working directory
return(path)
} else {
# If not, issue an error and recall the getPath function
dlg_message("Error: The path you entered is not a valid directory. Please try again.")$res
getPath()
}
}
# Call getPath function
folder <- getPath()
setwd(folder)
# Get list of files in directory
pattern.ext <- "\\.xlsx$"
files <- dir(folder, full=TRUE, pattern=pattern.ext)
# Get basenames and remove extension
files.nms <- basename(files)
files.nms <- gsub(pattern.ext, "", files.nms)
# Set the names
names(files) <- files.nms
# Iterate to read in files and write to new file
for (nm in files.nms) {
# Read in files
df <- read.xlsx((files[nm]), sheet = "Template", startRow = 9, colNames = FALSE)
# Write data to sheet
writeData(wb, "Template", df, startCol = 2, startRow = 2, colNames = FALSE)
}
saveWorkbook(wb, "Combined.xlsx", overwrite = TRUE)
EDIT:
So with the loop below, I am successfully reading in the files and merging them. Thanks for all the help!
for (nm in files.nms) {
# Read in files
df <- read.xlsx(files[nm], sheet = "Template", startRow = 8, colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE,
skipEmptyCols = FALSE)
# Append the data
allData <- rbind(allData, df)
}
EDIT: FINAL SOLUTION
Thanks to everyone for the help!!
library(openxlsx) # Excel and csv files
library(svDialogs) # Dialog boxes
# Create and load Excel file
wb <- createWorkbook()
# Add worksheet
addWorksheet(wb, "Template")
# Function to get user path
getPath <- function() {
# Ask for path
path <- dlgInput("Enter path to files: ", Sys.info()["user"])$res
if (dir.exists(path)) {
# If path exists, set the path as the working directory
return(path)
} else {
# If not, issue an error and recall the getPath function
dlg_message("Error: The path you entered is not a valid directory. Please try again.")$res
getPath()
}
}
# Call getPath function
folder <- getPath()
# Set working directory
setwd(folder)
# Get list of files in directory
pattern.ext <- "\\.xlsx$"
files <- dir(folder, full=TRUE, pattern=pattern.ext)
# Get basenames and remove extension
files.nms <- basename(files)
# Set the names
names(files) <- files.nms
# Create empty dataframe
allData <- data.frame()
# Create list (reserve memory)
f.List <- vector("list",length(files.nms))
# Look and load files
for (nm in 1:length(files.nms)) {
# Read in files
f.List[[nm]] <- read.xlsx(files[nm], sheet = "Template", startRow = 8, colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE,
skipEmptyCols = FALSE)
}
# Append the data
allData <- do.call("rbind", f.List)
# Add a new column as 'Member Site'
allData <- data.frame('Member Site' = "", allData)
# Take the substring of the Specimen.ID column for Memeber Site
allData$Member.Site <- sapply(strsplit(allData$Specimen.ID, "-"), "[", 2)
# Write data to sheet
writeData(wb, "Template", startCol = 1, allData)
# Save workbook
saveWorkbook(wb, "Combined.xlsx", overwrite = TRUE)
First of all, you are providing a lot of information in your question, which is generally a good thing, but I’m wondering if you could make your problems easier to solve by recreating the problem using fewer and smaller files. Could you figure out how to merge two files, each containing a small amount of data first?
With regards to the first challenge you raise:
1) Yes you are overwriting the workbook in each loop. I would suggest you load the data and append it to a data.frame and then store the end result after loading all the files. Have a look at the example below. Please note that this example uses rbind, which is inefficient if you are combining a large number of files. So if you have many files you may need to use a different structure.
# Create and empty data frame
allData <- data.frame()
# Loop and load files
for(nm in files.nms) {
# Read in files
df <- read.xlsx((files[nm]), sheet = "Template", startRow = 9, colNames = FALSE)
# Append the data
allData <- rbind(allData, df)
}
# Write data to sheet
writeData(wb, "Template", df, startCol = 2, startRow = 2, colNames = FALSE)
Hopefully this gets you closer to what you need!
Edit: Updating the answer to address the comments made
If you have more then a few files, rbind will get slow like #Parfait mentioned due to multiple copies of the data being made. The way to avoid this, is by first reserving space in memory by creating an empty list with enough space to hold your data, then fill in the list, and only at the end merge all the data together using do.call("rbind", ...) . I've compiled some sample code below that's in line with what you provided in your question.
# Create list (reserve memory)
f.List <- vector("list",length(files.nms))
# Loop and load files
for(eNr in 1:length(files.nms)) {
# Read in files
f.List[[eNr]] <- read.xlsx((files.nms[eNr]), sheet = "Template", startRow = 9)
}
# Append the data
allData <- do.call("rbind", f.List)
Below to illustrate this further, a small reproducible example. It uses just a couple of data frames, but it illustrates the process of creating a list, populating that list, and merging the data as the last step.
# Sample data
df1 <- data.frame(x=1:3, y=3:1)
df2 <- data.frame(y=4:6, x=3:1)
df.List <- list(df1,df2)
# Create list
d.List <- vector("list",length(df.List))
# Loop and add data
for(eNr in 1:length(df.List)) {
d.List[[eNr]] <- df.List[[eNr]]
}
# Bind all at once
dfAll <- do.call("rbind", d.List)
print(dfAll)
Hope this help! Thanks!

read_excel () and lapply()

I have a .xlsx file has two sheets and I want to generate a list of both excel sheets using read_excel from readxl package . I have used this code
my_work <- lapply(excel_sheets("data.xlsx"),
read_excel,
path = "data.xlsx")
The read_excel() function is called multiple times on the "data.xlsx" file and each sheet is loaded in one after the other. The result is a list of data frames, each data frame representing one of the sheets in data.xlsx.
My question is , why should I write the path argument in lapplyfunction , since the file is already in the working directory ?
Not sure this is the easiest way but you can create a short function modifying the read_excel() function to take in both a sheet name parameter and path, then lapply over that function.
library(readxl)
path <- "data.xlsx"
sheet_names <- excel_sheets(path)
# create function
read_excel_sheet <- function(sheet_name, path) {
x <- read_excel(path = path, sheet = sheet_name)
}
my_work <- lapply(sheet_names, read_excel_sheet, path = path)
The documentation:
read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)
The parameter path is a required argument. So you need to fill it in. Otherwise, an error will pop up.

Resources