Import multiple excel sheets using openxlsx - r

I am trying to import a large xlsx file into R that has many sheets of data. I was attempting to do this through XLConnect, but java memory problems (such as those described in this thread have prevented this technique from being successful.)
Instead, I am trying to use the openxlsx package, which I have read works much faster and avoids Java altogether. But is there a way to use its read.xlsx function within a loop to read in all of the sheets into separate dataframes? The technique I was using with the other package is no longer valid bc commands like loadWorkbook() and getSheets() can no longer be used.
Thank you for your help.

I think the getSheetNames() function is the right function to use. It will give you a vector of the worksheet names in a file. Then you can loop over this list to read in a list of data.frames.
read_all_sheets = function(xlsxFile, ...) {
sheet_names = openxlsx::getSheetNames(xlsxFile)
sheet_list = as.list(rep(NA, length(sheet_names)))
names(sheet_list) = sheet_names
for (sn in sheet_names) {
sheet_list[[sn]] = openxlsx::read.xlsx(xlsxFile, sheet=sn, ...)
}
return(sheet_list)
}
read_all_sheets(myxlsxFile)

Doing nothing more than perusing the documentation for openxlsx quickly leads one to the function sheets(), which it states is deprecated in place of names() which returns the names of all the worksheets in a workbook. You can then iterate over them in a simple for loop.
I'm not sure why you say that loadWorkbook cannot be used. Again, the documentation clearly shows a function in openxlsx by that name that does roughly the same thing as in XLConnect, although it's arguments are slightly different.
You can also look into the readxl package, which also does not have a Java dependency.

'sapply' also can be used.
read_all_sheets = function(xlsxFile, ...) {
sheet_names = openxlsx::getSheetNames(xlsxFile)
sheet_list = sapply(sheet_names, function(sn){openxlsx::read.xlsx(xlsxFile, sheet=sn, ...)}, USE.NAMES = TRUE)
return(sheet_list)
}

Related

Is there a way to pass an R object to read.csv?

I have an R function from a package that I need to pass a file path as an argument but it's expecting a csv and my file is an xlsx.
I've looked at the code for the function an it is using read.csv to load the file but unfortunately I can't make any changes to the package.
Is there a good way to read in the xlsx and pass it to the function without writing it to a csv and having the function read it back in?
I came across the text argument for read.csv here:
Is there a way to use read.csv to read from a string value rather than a file in R?
This seems like might be part way there but as I said I am unable to alter the function.
Maybe you could construct your own function checking if the file is xlsx, and in this case create a temporary csv file, feed it to your function, and delete it. Something like
yourfunction = function(path){
read.csv(path)
head(path)
}
library(readxl)
modified_function = function(path){
if(grepl{"\\.xlsx",path}){
tmp <- read_xlsx(path)
tmp_path <- paste0(gsub("\\.xlsx","",path),"_tmp.csv")
write.csv(tmp,file = tmp_path)
output <- yourfunction(tmp_path)
file.remove(tmp_path)
}else{
output <- yourfunction(path)
}
return(output)
}
If it is of help, here you can see how to modify only one function of a package: How to modify a function of a library in a module

Select columns when reading in files with st_read

I am trying to read 39 json files into a common sf dataset in R.
Here is the method I've been trying:
path <- "~/directory"
file.names <- as.list(dir(path, pattern='.json', full.names=T))
geodata <- do.call(rbind, lapply(file.names, st_read))
The problem is in the last line: rbind cannot work because the files have different numbers of columns. However, they all have three columns in common, and which I care about: MOVEMENT_ID, DISPLAY_NAME and geometry. How could I select only these three columns when running st_read?
I've tried running geodata<-do.call(rbind, lapply(file.names, st_read,select=c('MOVEMENT_ID', 'DISPLAY_NAME', 'geometry'))) but, in this case, st_read does not seem to recognise the geometry column (error: 'no simple features geometry column pressent').
I've also tried to use fread in place of st_read but this doesn't work as fread is not adapted to spatial data.
Run lapply over a function that calls st_read and then does what you need to it, something like:
read_my_json = function(f){
s = st_read(f)
return(s[,c("MOVEMENT_ID","DISPLAY_NAME")]
}
(I'm pretty sure you don't have to select the geometry as well, you get that for free when selecting columns of an sf spatial object)
then do.call(rbind, lapply(file.names, read_my_json)) should work.
no extra packages need to be included and it has the big advantage in that you can test this function to see how it works on a single item before throwing a thousand at it.

How to get the variable name in R function

I would like to cite the variable name as a string in a function, but couldn't achieve it.
For example, in one excel, i have 4 worksheets, i need to use the following line 4 times,
sales.df<- read_xlsx("abc.xlsx", sheet ="sales")
profit.df<- read_xlsx("abc.xlsx", sheet ="profit")
revenue.df<-read_xlsx("abc.xlsx", sheet ="revenue")
budget.df<- read_xlsx("abc.xlsx", sheet ="budget")
Instead, I want to write a function:
read_func = function(sheet_name){
sheet_name.df<- read_xlsx("abc.xlsx", sheet ="sheet_name"))
return(sheet_name.df)
}
The call the function
read_func(sales)
Unfortunately, it doesn't work. The sheet_name is not dynamically updated.
Thank you in advance for your kind help.
The readxl package has a function excel_sheets() to read all sheets in a file, which you can use with lapply to accomplish the same thing.
library(readxl)
lapply(excel_sheets("abc.xlsx"), read_excel, path = "abc.xlsx")
It is a part of the tidyverse so you can read more on it there.

Assign read.csv with some set parameters to a name, in order to pass it to a function

I want to read multiple files. To do this I use a generic function read_list
read_list(file_list, read_fun)
Assigning different read function to the argument read_fun I can read different kinds of files, i.e. read.csv for reading csv files, read_dta for STATA files, etc.
Now, I need to read some csv files where the first four lines need to be skipped. Thus, instead than passing read.csv as an argument to read_list, I would like to pass read.csv with the skip argument set to 4. Is it possible to do this in R? I tried
my_read_csv <- function(...){
read.csv(skip = 4, ...)
}
It seems to work, but I would like to confirm that this is the right way to do it. I think that functions being objects in R is a fantastic and very powerful feature of the language, but I'm not very familiar with R closures and scoping rules, thus I don't want to inadvertently make some big mistake.
You can simply rewrite your read_list to add the unnamed argument qualifier ... at the end and then replace the call to
read_fun(file) with read_fun(file, ...).
This will allow you to write the following syntax:
read_list(files, read.csv, skip = 4)
wich will be equivalent to using your current read_list with a cusom read function:
read_list(files, function(file)read.csv(file, skip = 4))
Also, be aware that read_list sounds awfully lot like a "reinvent the wheel" function. If you describe the behaviour of read_list a little more, I can expand.
Possible alternatives may be
read_list <- function(files, read_fun, ...)lapply(files, read_fun, ...)
# in this case read_list is identical to lapply
read_list <- function(files, read_fun, ...)do.call(rbind, lapply(files, read_fun, ...))
# This will rbind() all the files to one data.frame
I'm not sure if read_list is specialized to your specific task in some way but you can use lapply along with read.csv to read a list of files:
# generate fake file names
files <- paste0('file_', 1:10, '.csv')
# Read files using lapply
dfs <- lapply(files, read.csv, skip = 4)
The third argument of lapply is ... which allows you to pass additional arguments to the function you're applying. In this case, we can use ... to pass the skip = 4 argument to read.csv

XLConnect - readWorksheet with looping object

I am using R Studio version 3.1.2 with XLConnect package to load, read and write multiple xlsx files. I can do this with duplicating and creating multiple objects but I am trying to do it using 1 object(all files in the same folder). please see examples
I can do this listing each file but want to do it using a loop
tstA <- loadWorkbook("\\\\FS01\\DEPARTMENTFOLDERS$\\tst\\2015\\Apr\\DeptA.xlsx")
tstB <- loadWorkbook("\\\\FS01\\DEPARTMENTFOLDERS$\\tst\\2015\\Apr\\DeptB.xlsx")
This is the way im trying to do it but get an error
dept <- c("DeptA","DeptB","DeptC")
for(dp in 1:length(dept)){
dept[dp] <- loadWorkbook("\\\\FS01\\DEPARTMENTFOLDERS$\\tst\\2015\\Apr\\",dept[dp],".xlsx")}
After this I want to use the readWorksheet function from XLConnect.
Apologies for the lame question but I am struggling to workout how best to do this.
Thanks
You can read all the files into a list in one operation as follows (adjust pattern and sheet as needed to get the files/sheets you want):
path = "\\\\FS01\\DEPARTMENTFOLDERS$\\tst\\2015\\Apr\\"
df.list = lapply(list.files(path, pattern="xlsx$"), function(i) {
readWorksheetFromFile(paste0(path, i), sheet="YourSheetName")
})
If you want to combine all of the data frames into a single data frame, you can do this:
df = do.call(rbind, df.list)

Resources