I want to read a bunch of excel files all located in the same directory and store them in different sheets in a consolidated Excel file.
I am using a combination of xlsx and openxlsx to achieve this. The reason is, openxlsx can not read .xls file and xlsx is java based and runs out of memory throwing GC overhead limit exceeded error when trying to write large files.
Here is the code I am using:
library(openxlsx)
library(xlsx)
mnth="january"
outputFileName<-"Consolidated.xlsx"
files <- list.files(path="./Original Files", pattern=mnth, full.names=T, recursive=FALSE)
start_row<-1
lapply(files, function(x){
print(x)
xlFile<-read.xlsx2(x, sheetIndex = 1, startRow = 2, header =T) #Reads all columns as factors
#Write to Excel
write.xlsx(xlFile, file=outputFileName, sheetName = mnth, startRow = start_row)
start_row<- start_row + nrow(xlFile)
})
I am trying to read all the files with January (Ex: january2015, january 2016) and append the rows in the same sheet of a Consolidated xlsx file.
However I am getting the error:
Error in write.xlsx(xlFile, file = outputFileName, sheetName = mnth, :
unused argument (startRow = start_row)
The documentation clearly mentions that startRow is an optional parameter. Interestingly, sheetName is also an optional parameter but does not throw errors.
I have ran the code with startRow = start_row removed and it works as expected, i.e., the contents are repeatedly overwritten with only the contents of the last xls file prevailing.
UPDATE
I have changed the reading function from functions from XLConnect to avoid having the several functions with similar names, and I still get the same error:
Error in write.xlsx(xlFile, file = outputFileName, sheetName = placename, :
unused argument (startRow = start_row)
Here is the code with XLConnect:
lapply(files, function(x){
print(x)
xlFile<-readWorksheetFromFile(file = x, sheet=1, startRow=2)
str(xlFile)
l=list(dt,xlFile)
#Write to Excel
write.xlsx(xlFile, file=outputFileName, sheetName = mnth, startRow = start_row)
start_row<- start_row + nrow(xlFile)
})
Related
I am struggling to download an excel file and then loading it to R:
utils::download.file(
url = 'https://servicos.ibama.gov.br/ctf/publico/areasembargadas/downloadListaAreasEmbargadas.php',
destfile = 'C:/users/arthu/Desktop/fines.rar',
mode = "wb"
)
After unzipping and trying to load it into R:
utils::unzip(
zipfile = './fines.rar',
exdir = './ibama_data'
)
dados <- readxl::read_xls(
"./ibama_data/rel_areas_embargadas_0-65000_2020-12-10_080019.xls"),
skip = 6,
col_type = c(rep("guess", 13), "date", "guess", "date")
)
I get libxls error: Unable to open file.
If I try to rename the file as .xlsx as follows, I get an evaluation error when reading it with readxl::read_excel, saying unable to open file
file <- file.rename(
from = "./Desktop/ibama_data/rel_areas_embargadas_0-65000_2020-12-10_080019.xls",
to = "./Desktop/ibama_data/test.xlsx"
)
However, if I manually open such a file, excel throws me a warning saying that the file's extension does not match its type. After saving it as .xlsx, I can finally load it using read_excel
How can I solve this, given that I want to write a package with a function that downloads such data from the web and then loads it into R?
Edit
The .xls file you are trying to read isn't an Excel document, it's an HTML table.
You could read it using XML package :
library(XML)
doc <- htmlParse('rel_areas_embargadas_0-65000_2021-01-13_080018.xls')
tableNode <- getNodeSet(doc, '//table')
data <- XML::readHTMLTable(tableNode[[1]])
#Store header
header <- data[1:5,]
#Store colnames
colnames <- data[6,]
#Remove header
data <- data[-1:-6,]
#Set colnames
colnames(data)<-colnames
head(data)
I've to compare multiple files (Prod1, Beta1, Prod2, Beta2.. etc) and export the differences in an excel sheet if any. That should be in separate cells (Column C). I'm trying with below code library(xlsx) I can store the data only in the 1st cell.
library(xlsx)
for(i in 1:No_of_files){
prod_file_res_name <- sprintf("R/Results/F_Query_Prod_%s.txt", i)
beta_file_res_name <- sprintf("R/Results/F_Query_Beta_%s.txt", i)
if (file.exists(prod_file_res_name) && file.exists(beta_file_res_name))
{
res <- tools::Rdiff(prod_file_res_name, beta_file_res_name, Log = TRUE)
if(res[2] != "character(0)"){
write.xlsx(toString(res[2]), file = "C:/R/diff.xlsx", sheetName = "Sheet1", col.names = FALSE, row.names =FALSE, append = TRUE)
}
else{
com <- "No Difference found"
write.xlsx(com, file = "C:/R/diff.xlsx", sheetName = "ExtractFormulaHistory", col.names = FALSE, row.names =FALSE, append = TRUE)
}
}
else {
print("File doesnt exist")
}
}
Can anyone help me to save the difference in column 5 but different rows(example: 1 to X no of files)? Thanks in Advance.
The easiest way would be to create a tibble, or list to hold you output and then in the last step of your code output it to Excel using the xlsx library.
Alternatively, you could use the writeWorksheet from XLConnect package to write to an Excel file as you work out the differences.
From documentation of the XLConnect package:
# Load workbook (create if not existing)
wb <- loadWorkbook("writeWorksheet.xlsx", create = TRUE)
# Create a worksheet called 'CO2'
createSheet(wb, name = "CO2")
# Write built-in data set 'CO2' to the worksheet created above;
# offset from the top left corner and with default header = TRUE
writeWorksheet(wb, CO2, sheet = "CO2", startRow = 4, startCol = 2)
# Save workbook (this actually writes the file to disk)
saveWorkbook(wb)
I am trying to read value from a .xlsx file using openxlsx package in R. In simple words, I need to write a row of data, which then populates some output cell that has to be read back in R. I will share an example to better explain the problem.
Initial state of the .xlsx file:
I'm now trying to write new values to the cell : A2:A3 = c("c", 5). So ideally, I'm expecting A6 = 15
Below is the code used :
require(openxlsx)
path <- "C:/path_to_file/for_SO1.xlsx"
input_row <- c("c", 5)
# Load workbook; create if not existing
wb <- loadWorkbook(path)
# createSheet(wb, name = "1")
writeData(wb,
sheet = "Sheet1",
x = data.frame(input_row),
startCol=1,
startRow=1
)
data_IM <- read.xlsx(wb,
sheet = "Sheet1",
rows = c(5,6),
cols = c(1))
# Save workbook
saveWorkbook(wb, file = path, overwrite = TRUE)
#> data_IM
# output_row
#1 3
But I get the inital value(3). However, If i open the .xlsx file, I can see the 15 residing there:
What could be the reason for not able to read this cell? I tried saving it after writing to the file and again reading it but even that failed. openxlsx is the only option I have due to JAVA errors from XLConnect etc.
?read.xlsx
Formulae written using writeFormula to a Workbook object will not get
picked up by read.xlsx(). This is because only the formula is written
and left to be evaluated when the file is opened in Excel. Opening,
saving and closing the file with Excel will resolve this.
So the file needs to be opened in Excel and then saved, I can verify that this does work. However this may not be suitable for you.
XLConnect seems to have the desired functionality
# rjava can run out of memory sometimes, this can help.
options(java.parameters = "-Xmx1G")
library(XLConnect)
file_path = "test.xlsx"
input_row <- c("c", 5)
wb <- loadWorkbook(file_path, create=F)
writeWorksheet(wb, 1, startRow = 1, startCol = 1, data = data.frame(input_row))
setForceFormulaRecalculation(wb, 1, TRUE)
saveWorkbook(wb)
# checking
wb <- loadWorkbook(file_path, create=F)
readWorksheet(wb, 1)
The file https://cran.r-project.org/web/packages/openxlsx/openxlsx.pdf says
Workbook object will not get picked up by read.xlsx().
This is because only the formula is written and left to be evaluated when the file is opened in Excel.
Opening, saving and closing the file with Excel will resolve this.
So if you are using windows then
save following file vbs file to for example opensaveexcel.vbs
Set objExcel = CreateObject("Excel.Application")
Set objWorkbook = objExcel.Workbooks.Open("D:\Book2.xlsx")
objWorkbook.Save
objWorkbook.Close
objExcel.Quit
Set objExcel = Nothing
Set objWorkbook = Nothing
and then you can write R code as cell A4 has formula in Book1.xlsx as =A3*5
mywritexlsx(fname="d:/Book1.xlsx",data = 20,startCol = 1,startRow = 3)
system("cp d:\\Book1.xlsx d:\\Book2.xlsx")
system("cscript //nologo d:\\opensaveexcel.vbs")
tdt1=read.xlsx(xlsxFile = "d:/Book1.xlsx",sheet = "Sheet1",colNames = FALSE)
tdt2=read.xlsx(xlsxFile = "d:/Book2.xlsx",sheet = "Sheet1",colNames = FALSE)
Works for me by the way mywritexlsx is as
mywritexlsx<-function(fname="temp.xlsx",sheetname="Sheet1",data,
startCol = 1, startRow = 1, colNames = TRUE, rowNames = FALSE)
{
if(!file.exists(fname))
{
wb = openxlsx::createWorkbook()
sheet = openxlsx::addWorksheet(wb, sheetname)
}
else
{
wb <- openxlsx::loadWorkbook(file =fname)
if(!(sum(openxlsx::getSheetNames(fname)==sheetname)))
sheet = openxlsx::addWorksheet(wb, sheetname)
else
sheet=sheetname
}
openxlsx::writeData(wb,sheet,data,startCol = startCol, startRow = startRow,
colNames = colNames, rowNames = rowNames)
openxlsx::saveWorkbook(wb, fname,overwrite = TRUE)
}
I am trying to read the contents of a score of Excel files into R with XLConnect. This is a simplified version of my code:
# point to a folder
path <- "/path/to/folder"
# get all the Excel files in that folder
files <- list.files(path, pattern = "*.xlsx")
# create an empty data frame
dat <- data.frame(var.1 = character(), var.2 = numeric())
# load XLConnect
library("XLConnect")
# loop over the files
for (i in seq_along(files)) {
# read each Excel file
wb <- loadWorkbook(paste(pfad, files[i], sep = "/"))
# fill the data frame with data from the Excel file
dat[i, 1:2] <- readWorksheet(wb, "Table1", startRow = 1, startCol = 1, endRow = 2, endCol = 1, header = FALSE)
rm(wb)
}
I can read in a single file when I specify it with loadWorkbook(paste(pfad, files[1], sep = "/")), but when I loop over the file list with files[i], the code inside the for-loop returns the following error:
Error: InvalidFormatException (Java):
Your InputStream was neither an OLE2 stream, nor an OOXML stream
What am I doing wrong?
The problem had nothing to do with my code.
I had some of the files in that folder open in Excel. When you open a file in Excel, Excel creates an invisible file named "~$filename.xlsx". Since my regular expression searched for files with the suffix ".xlsx", these files were found, too, and since these files are not spreadsheet files, XLConnect couldn't read them and threw an error.
I solved the problem by closing those files in Excel.
Another solution would be to exclude files that begin with a tilde in the regular expression, with something like:
list.files(path, pattern = "^[^~].+\\.xlsx")
I want to read a bunch of excel files all located in the same directory and store them in different sheets in a consolidated Excel file.
I initially tried using XLConnect but kept getting the error GC overhead limit exceeded. I stumbled upon this question which says that it is a common problem with Java based Excel handling packages such as XLConnect and xlsx. I tried the memory management trick suggested there, but it did not work. One of the comments in one of the comments on the accepted answers suggested using openxls as it based on RCpp and hence avoid this particular problem.
My current code is as follows:
library(openxlsx)
mnth="January"
files <- list.files(path="./Original Files", pattern=mnth, full.names=T, recursive=FALSE) #pattern match as multiple files are from the same month
# Read them into a list and write to sheet
wb <- createWorkbook()
lapply(files, function(x){
print(x)
xlFile<-read.xlsx(xlsxFile = x, sheet = 1, startRow = 2, colNames = T) #Also tried
str(xlFile)
#Create a sheet in the new Excel file called Consolidated.xlsx with the month name
#Append current data in sheet
})
The problem I am getting is the error: Error in read.xlsx.default(xlsxFile = x, sheet = 1, startRow = 2, colNames = T) : openxlsx can not read .xls or .xlm files!
I have ensured that files variable contains all the files of interest (Ex: January 2015.xls, January 2016.xls, etc). I have also ensured that the path to the file is correct and the Excel files actually exists there.
I have left the writing to Excel as skeleton code as I need to solve the problem with reading the files first.
In case it helps, here is the code attempt with XLConnect
library(XLConnect)
setwd("D:/something/something")
mnth="January"
files <- list.files(path="./Original Files", pattern=mnth, full.names=T, recursive=FALSE)
# Read them into a list
df.list = lapply(files, readWorksheetFromFile, sheet=1, startRow=2)
#combine them into a single data frame and write to disk:
df = do.call(rbind, df.list)
rm(df.list)
outputFileName<-"Consolidated.xlsx"
# Load workbook (create if not existing)
wb <- loadWorkbook(outputFileName, create = TRUE)
createSheet(wb, name = mnth)
writeWorksheet(wb,df,sheet = mnth)
#write.xlsx2(df, outputFileName, sheetName = mnth, col.names = T, row.names = F, append = TRUE)
saveWorkbook(wb)
rm(df)
gc()