read_excel () and lapply() - r

I have a .xlsx file has two sheets and I want to generate a list of both excel sheets using read_excel from readxl package . I have used this code
my_work <- lapply(excel_sheets("data.xlsx"),
read_excel,
path = "data.xlsx")
The read_excel() function is called multiple times on the "data.xlsx" file and each sheet is loaded in one after the other. The result is a list of data frames, each data frame representing one of the sheets in data.xlsx.
My question is , why should I write the path argument in lapplyfunction , since the file is already in the working directory ?

Not sure this is the easiest way but you can create a short function modifying the read_excel() function to take in both a sheet name parameter and path, then lapply over that function.
library(readxl)
path <- "data.xlsx"
sheet_names <- excel_sheets(path)
# create function
read_excel_sheet <- function(sheet_name, path) {
x <- read_excel(path = path, sheet = sheet_name)
}
my_work <- lapply(sheet_names, read_excel_sheet, path = path)

The documentation:
read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)
The parameter path is a required argument. So you need to fill it in. Otherwise, an error will pop up.

Related

Read second sheet of xlsx file from various subdirectories of a main directory R

I want to read the sheet that contains the word "All"or "all" of an excel workbook for every subdirectory based on a specific pattern.
I have tried list.files() but it does not work properly.
files_to_read = list.files(
path = common_path, # directory to search within
pattern = "X - GEN", # regex pattern, some explanation below
recursive = TRUE, # search subdirectories
full.names = TRUE # return the full path
)
data_lst = lapply(files_to_read, read.xlsx)
I am assuming your sub-directories have a similar name that can be identifiable?
Assumption, let's say:
your sub-directory starts with 'this' and
the files that are saved in sub-directory starts with the file name 'my_file'
the tab that you are trying to read in contains the word 'all'.
If the tab that you are reading in is located in same position (e.g. 2nd tab of every file) then it is easier as you can specify the sheet within read.xlsx as sheet = 2 but if this is not the case then one way you could do is by creating your own function that allows for this.
Then
library(openxlsx)
# getting the name of subdirectories starting with the word 'this'
my_dir <- list.files(pattern = "^this", full.names = TRUE)
# getting the name of the files starting with 'my_file', e.g. my_file.xlsx, my_file2.xlsx
my_files <- list.files(my_dir, pattern = "^my_file", full.names = TRUE)
my_read_xlsx <- function(files_to_read, sheets_to_read) {
# files to import
wb <- loadWorkbook(files_to_read)
# getting the sheet names that contain 'all' or any other strings that you specify
# ignore.case is there so that case is not sensitive when reading in excel tabs
ws <- names(wb)[grepl(sheets_to_read, names(wb), ignore.case = TRUE)]
# reading in the excel tab specified as above
xl_data <- read.xlsx(wb, ws)
return(xl_data)
}
# Using the function created above and import tabs containing 'all'
my_list <- lapply(my_files, FUN = function(x) my_read_xlsx(x, sheet = "ALL"))
# Converting the list into a data.frame
my_data <- do.call("rbind", my_list)

How do I get the file path of a file saved using write.xlsx or another function in R?

I am creating two dataframes and one graph on Rstudio. I wrote code to transfer them to an Excel file on different sheets, but each time I have to choose the file path using file.choose(). Is it possible to assign the file path to the variable when saving the file for the first time? If such a method exists, how can it be done?
I would also like to receive comments on how to more easily export my dataframes to an excel file. I shared my codes.
Thank you to everyone.
dataframe1 <- data.frame("A"=1, "B"=2)
dataframe2 <- data.frame("C"=3,"D"=4)
list_of_datasets <- list("Name of DataSheet1" = dataframe1, "Name of Datasheet2" = dataframe2, )
write.xlsx(list_of_datasets, file = "writeXLSX2.xlsx")
dflist <- list("Sonuçlar"=yazılacakdosya0, "Frame"=dtf, "Grafik"="")
edc <- write.xlsx(dflist, file.choose(new = T), colNames = TRUE,
borders = "surrounding",
firstRow = T,
headerStyle = hs)
require(ggplot2)
q1 <- qplot(hist(yazılacakdosya0$Puan))
print(q1)
insertPlot(wb=edc, sheet = "Grafik")
saveWorkbook(edc, file = file.choose(), overwrite = T)
Just save the file path before you call saveWorkbook
file = file.choose()
saveWorkbook(edc, file = file, overwrite = T)

Assigning the filename and sheet name to (multiple) observations in R

I have written a function that, after giving the direction of the folder, takes all the excel files inside it and merges them into a data frame with some modest modifications.
Yet I have two small things I would like to add but struggle with:
Each file has a country code in the name, and I would like the function to create an additional column in the data frame, "Country", where each observation would be assigned such country code. name example: BGR_CO2_May12
Each file is composed of many sheets, with each sheet representing the year; these sheets are also called by these years. I would like the function to create another column, "Year", where each observation would be assigned the name of the sheet that it comes from.
Is there a neat way to do it? Possibly without modifying the current function?
multmerge_xls_TEST <- function(mypath2) {
library(dplyr)
library(readxl)
library(XLConnect)
library(XLConnectJars)
library(stringr)
# This function gets the list of files in a given folder
re_file <- ".+\\.xls.?"
testFiles <- list.files(path = mypath2,
pattern = re_file,
full.names = TRUE)
# This function rbinds in a single dataframe the content of multiple sheets in the same workbook
# (assuming that all the sheets have the same column types)
# It also removes the first sheet (no data there)
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, function(sheet) {
readWorksheet(wb, sheet)
})
)
}
# Getting a single dataframe for all the Excel files and cutting out the unnecessary variables
result <- do.call(rbind, lapply(testFiles, rbindAllSheets))
result <- result[,c(1,2,31)]
Try making a wrapper around readWorksheet(). This would store the file name into the variable Country and the sheet name into Year. You would need to do some regex on the file though to get the code only.
You could also skip the wrapper and simply add the mutate() line within your current function. Note this uses the dplyr package, which you already have referenced.
read_worksheet <- function(sheet, wb, file) {
readWorksheet(wb, sheet) %>%
mutate(Country = file,
Year = sheet)
}
So then you could do something like this within the function you already have.
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, read_worksheet, wb = wb, file = file)
)
}
As another note, bind_rows() is another dplyr function which can take the place of your do.call(rbind, ...) calls.
bind_rows(lapply(sheets, read_worksheet, wb = wb, file = file))

How to iterate over excel-sheets with readxl

I am supposed to load the data for my master-thesis in an R-dataframe, which is stored in 74 excel workbooks. Every workbook has 4 worksheets each, called: animals, features, r_words, verbs. All of the worksheets have the same 12 variables(starttime, word, endtime, ID,... etc.). I want to concatenate every worksheet under the one before, so the resulting dataframe should have 12 columns and the number of rows is depending on how many answers the 74 subjects produced.
I want to use the readxl-package of the tidyverse and followed this article: https://readxl.tidyverse.org/articles/articles/readxl-workflows.html#csv-caching-and-iterating-over-sheets.
The first problem I face is how to read all 4 worksheets with read_excel(path, sheet = "animals", "features", "r_words", "verbs"). This only works with the first worksheet, so I tried to make a list with all the sheet-names (object sheet). This is also not working. And when I try to use the following code with just one worksheet, the next line throws an error:
Error in basename(.) : a character vector argument expected
So, here is a part of my code, hopefully fulfilling the requirements:
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
# indices
subfile_nos <- 1:length(filenames)
# function to read all the sheets in at once and cache to csv
read_then_csv <- function(sheet, path) {
for (i in 1:length(filenames)){
sheet <- excel_sheets(filenames[i])
len.sheet <- 1:length(sheet)
path <- read_excel(filenames[i], sheet = sheet[i]) #only reading in the first sheet
pathbase <- path %>%
basename() %>% #Error in basename(.) : a character vector argument expected
tools::file_path_sans_ext()
path %>%
read_excel(sheet = sheet) %>%
write_csv(paste0(pathbase, "-", sheet, ".csv"))
}
}
You should do a double loop or a nested map, like so:
library(dplyr)
library(purrr)
library(readxl)
# I suggest looking at
?purrr::map_df
# Function to read all the sheets in at once and save as csv
read_then_csv <- function(input_filenames, output_file) {
# Iterate over files and concatenate results
map_df(input_filenames, function(f){
# Iterate over sheets and concatenate results
excel_sheets(f) %>%
map_df(function(sh){
read_excel(f, sh)
})
}) %>%
# Write csv
write_csv(output_file)
}
# Test function
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
read_then_csv(filenames, 'my_output.csv')
You say...'I want to concatenate every worksheet under the one before'... The script below will combine all sheets from all files. Test it on a COPY of your data, in case it doesn't do what you want/need it to do.
# load names of excel files
files = list.files(path = "C:\\your_path_here\\", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)

Read specific sheets from an excel file list

I need to read specific sheets from a list of excel files.
I have >500 excel files, ".xls" and ".xlsx".
Each file can have different sheets, but I just want to read each sheets containing a specific expresion, like pattern = "^Abc", and not all files have sheets with this pattern.
I've created a code to read one file, but when I try to translate to multiple files, allways returns an error.
# example with 3rd file
# 2 sheets have the pattern
list_excels <- list.files(path = "path_to_folder", pattern = ".xls*"
sheet_names <- excel_sheets(list_excels[[3]])
list_sheets <- lapply(excel_sheets(list_excels[[3]]), read_excel, path = list_excels[[3]])
names(list_sheets) <- sheet_names
do.call("rbind", list_sheets[grepl(pattern = "^Abc", sheet names)])
But when I try to code for read multiple excels files, I have an error or something in the loop that slows a lot the computation.
There are some examples
This is a loop that doesn't return an error, but takes 30 seconds at least for each element of the list, I've never waited to finishing .
for (i in seq_along(list_excels)) {
sheet_names <- excel_sheets(list_excels[[i]])
list_sheets <- lapply(excel_sheets(list_excels[[i]]), read_excel, path = list_excels[[i]])
names(list_sheets) <- sheet_names[i] list_sheets[grepl(pattern = "^Abc", sheet_names)]
}
In this loop is missing the final part, the merging sheets with this code
list_sheets[grepl(pattern = "^Abc", sheet_names)]
I've tried to sum the rows of each sheet and store it in an vector, but I think that the loop is broken when there is a sheet that doesn't have the pattern.
x <- c()
for(i in seq_along(list_excels)) {
x[i] <- nrow(do.call("rbind",
lapply(excel_sheets(list_excels[[i]]),
read_excel,
path = list_excels[[i]])[grepl(pattern = "^Abc",
excel_sheets(list_excels[[i]]))]))
Also with purrr library, trying to read all, the same result with first loop example.
list_test <- list()
for(i in seq_along(list_excels)) {
list_test[[i]] <- excel_sheets(list_excels[[i]]) %>%
set_names() %>%
map(read_excel, path = list_excels[[i]])
}
Last example, that works with one excel file, but not with multiple. Just reading named sheet.
# One file works
data.frame(readWorksheetFromFile(list_excels[[1]], sheet = "Abc"))
#Multiple file returns an error
for(i in seq_along(list_excels)) {
data.frame(readWorksheetFromFile(list_excels[[i]], sheet = "Abc"))
#Returns the following error
#Error: IllegalArgumentException (Java): Sheet index (-1) is out of range (0..1)
Some one could help me?

Resources