Read Excel file into R with locked cells - r

I have an Excel spreadsheet to read into R, that is both password protected and has locked cells. I can use excel.link to import a password protected file, but I can't figure out how to unlock/unprotect the cells. excel.link gives me this error:
> <checkErrorInfo> 80020009 Error in top_left_corner[["CurrentRegion"]]
> : You cannot use this command on a protected sheet. To use this
> command, you must first unprotect the sheet (Review tab, Changes
> group, Unprotect Sheet button). You may be prompted for a password.
> (Microsoft Excel)
Any advice is welcome. I can manually unprotect the cells, but I have to do this to hundreds of files on a daily basis.
My end goal here is to have the data from the 100s of spreadsheets imported into R for analytics. I do not need to export back into Excel. I also do not need to import the protected cells into R, so if there was a way to skip them that would work.
EDIT: New issue has emerged related to this operation. I get an error in R when I try to do the extraction on a shared workbook:
80020009 Error: Exception occurred.
If I manually go into Excel and unshare the workbook (under Review->Share Workbook->Uncheck Allow changes made by more than one user). Is there a way with excel.link to programmatically do this?

Try the following code:
library(excel.link)
filename = "shared.xlsx"
xl.workbook.open(filename, password = "test")
# here we resave workbook to the temporary folder with exclusive access
new_path = paste0(tempdir(), "\\", filename)
xl()[["Activeworkbook"]]$saveas(new_path, AccessMode=xl.constants$xlExclusive)
###
xl()[["Activesheet"]]$Unprotect(password = "test")
data = crc[a1]
xl.workbook.close()
unlink(new_path) # remove temporary Excel File
UPDATE 2018.07.16 Add code for saving workbook with exclusive access.

Related

XLConnect: Error: IllegalArgumentException (Java): Sheet index (-1) is out of range (no sheets)

I am trying to use XLConnect to load in a series of excel workbooks that I have. Using the code:
BASZ <- loadWorkbook("BASZ.xlsx", create = TRUE)
works every time, and gives me a formal class workbook. However when I go to read in the worksheet I wish to use:
data <- readWorksheet("BASZ", sheet = "Sheet1")
I always get the same arguement:
"Error: IllegalArgumentException (Java): Sheet index (-1) is out of range (no sheets")
Just yesterday this code worked, im new to this and wondering why this continues to occur. Furthermore; it doesn't matter which excel workbook I try to load, the same error occurs when trying to read in the specific sheet I want to work with. It must be a syntax issue or something im doing wrong right? I fail to understand why it would work, then I close out Studio, then the next day it won't...?
If you have already loaded the excel file using loadWorkbook(), you can use the function readWorksheet() to read individual sheets. You would only use readWorksheetFromFile() if you had not previously loaded the file. So your code should read:
BASZ <- loadWorkbook("BASZ.xlsx", create = TRUE)
data <- readWorksheet(BASZ, sheet = "Sheet1")
Note that in the second line, the first argument is the variable BASZ, not a quoted string.
Okay so just in case someone else makes the same mistake as me; you have to be working within the directory your xlsx file is in.

readxl::read_xls returns "libxls error: Unable to open file"

I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!
UPDATE:
It seems that some users are able to open the file above using the readxl::read_xls function, while others are not, both on Mac and Windows, using the most up to date versions of R, Rstudio, and readxl. The issue has been posted on the readxl GitHub and has not been resolved yet.
I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
install.packages("readxl")
library(readxl)
library(data.table)
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls function instead read_excel.
I checked, it also works correctly and even a little faster, since read_excel is a wrapper over read_xls and read_xlsx functions from readxl package.
Also, you can use excel_sheets function from readxl package to read all sheets of your Excel file.
UPDATE
Benchmarking is done with microbenchmark package for the following packages/functions: gdata::read.xls, XLConnect::readWorksheetFromFile and readxl::read_excel.
But XLConnect it's a Java-based solution, so it requires a lot of RAM.
I found that I was unable to open the file with read_xl immediately after downloading it, but if I opened the file in Excel, saved it, and closed it again, then read_xl was able to open it without issue.
My suggested workaround for handling hundreds of files is to build a little C# command line utility that opens, saves, and closes an Excel file. Source code is below, the utility can be compiled with visual studio community edition.
using System.IO;
using Excel = Microsoft.Office.Interop.Excel;
namespace resaver
{
class Program
{
static void Main(string[] args)
{
string srcFile = Path.GetFullPath(args[0]);
Excel.Application excelApplication = new Excel.Application();
excelApplication.Application.DisplayAlerts = false;
Excel.Workbook srcworkBook = excelApplication.Workbooks.Open(srcFile);
srcworkBook.Save();
srcworkBook.Close();
excelApplication.Quit();
}
}
}
Once compiled, the utility can be called from R using e.g. system2().
I will propose a different workflow. If you happen to have LibreOffice installed, then you can convert your excel files to csv programatically. I have Linux, so I do it in bash, but I'm sure it can be possible in macOS.
So open a terminal and navigate to the folder with your excel files and run in terminal:
for i in *.xls
do soffice --headless --convert-to csv "$i"
done
Now in R you can use data.table::fread to read your files with a loop:
Scenario 1: the structure of files is different
If the structure of files is different, then you wouldn't want to rbind them together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
all_files <- list()
for (i in 1:length(files)){
fileName <- gsub("(^.*/)(.*)(.csv$)", "\\2", files[i])
all_files[[fileName]] <- fread(files[i])
}
If you want to extract your named elements within the list into the global environment, so that they can be converted into objects, you can use list2env:
list2env(all_files, envir = .GlobalEnv)
Please be aware of two things: First, in the gsub call, the direction of the slash. And second, list2env may overwrite objects in your Global Environment if they have the same name as the named elements within the list.
Scenario 2: the structure of files is the same
In that case it's likely you want to rbind them all together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
joined <- list()
for (i in 1:length(files)){
joined <- rbindlist(joined, fread(files[i]), fill = TRUE)
}
On my system, i had to use path.expand.
R> file = "~/blah.xls"
R> read_xls(file)
Error:
filepath: ~/Dropbox/signal/aud/rba/balsheet/data/a03.xls
libxls error: Unable to open file
R> read_xls(path.expand(file)) # fixed
Resaving your file and you can solve your problem easily.
I also find this problem before but I get the answer from your discussion.
I used the read_excel() to open those files.
I was seeing a similar error and wanted to share a short-term solution.
library(readxl)
download.file("https://mjwebster.github.io/DataJ/spreadsheets/MLBpayrolls.xls", "MLBPayrolls.xls")
MLBpayrolls <- read_excel("MLBpayrolls.xls", sheet = "MLB Payrolls", na = "n/a")
Yields (on some systems in my classroom but not others):
Error: filepath: MLBPayrolls.xls libxls error: Unable to open file
The temporary solution was to paste the URL of the xls file into Firefox and download it via the browser. Once this was done we could run the read_excel line without error.
This was happening today on Windows 10, with R 3.6.2 and R Studio 1.2.5033.
If you have downloaded the .xls data from the internet, even if you are opening it in Ms.Excel, it will open a prompt first asking to confirm if you trust the source, see below screenshot, I am guessing this is the reason R (read_xls) also can't open it, as it's considered unsafe. Save it as .xlsx file and then use read_xlsx() or read_excel().
Even thought this is not a code-based solution, I just changed the type file. For instance, instead of xls I saved as csv or xlsx. Then I opened it as regular one.
I worked it for me, because when I opened my xlsfile, I popped up the message: "The file format and extension of 'file.xls'' don't match. The file could be corrupted or unsafe..."

XLConnect: saveworkbook update error

saveWorkbook() function in XLConnect saves the workbook and the changes and updated calculations are visible in the excel file but not on R (because it has a formula not accepted by the apache poi)
However, to view the cell I save the file to disk and call it using another function. But when I call the same file again the calculated fields still show the old values. I don't want to save the excel file every time I make a change in the workbook.
Would you know a workaround to be able to call the new values without manually saving excel?
Code -
options(java.parameters = "-Xmx1024m")
library(rJava)
library(XLConnect)
wb = loadWorkbook(file.choose(), create = TRUE)
readWorksheet(wb,16, region = 'D25:D26')
writeWorksheet(wb,-.45,sheet = 16,startRow = 25,startCol = 4)
setForceFormulaRecalculation(wb,sheet = 16, TRUE)
saveWorkbook(wb)
detach("package:XLConnect", unload=TRUE)
detach("package:XLConnectJars", unload=TRUE)
library(xlsx)
y = read.xlsx(file.choose(), sheetIndex = 16)
So the Excel file on the system shows the changes corresponding to the new -.45 value but when I read the file again, the calculated values are the old values and not the new ones. This gets fixed if I save the file manually.
I believe the command you are using is correct but maybe some small modifications would make this work.
I think you could try placing the needed calculations in a different sheet in excel and treat the data you inserted as a dependency for those calculations in the new sheet.
Then read it in as a fresh workbook and call the new sheet. I think that will you the output you need.
setForceFormulaRecalculation(wb, sheet = "*", TRUE)
I would use this command to force all sheets to recalculate instead.
Hope that helps!

How do you save Excel file and enable cell protection in R?

I have a basic Excel workbook created with the XLSX package. I want to save it as an .xlsx file but lock all columns except for one to protect them from being edited. I'm able to set cell protection to the selected column with the CellProtection() function, but I don't know how to turn password protection on for the worksheet in order to actually make the columns protected.
library(xlsx)
wb = createWorkbook()
s1 = createSheet(wb, "Sheet 1")
addDataFrame(mtcars, s1) #using mtcars as example dataset
cs = CellStyle(wb, cellProtection = CellProtection(locked=F)) #setting style to unlock cells
rows <- getRows(s1, rowIndex=2:101)
cells <- getCells(rows, colIndex = c(2)) #getting the cells to unlock
lapply(names(cells), function(ii)setCellStyle(cells[[ii]],cs)) #applying unlocking to all columns except the second one (the one i want to leave locked)
saveWorkbook(wb, "file.xlsx")
When I check the Excel file, the properties of the cells in column 2 say they're unlocked, but then I have to click on "Protect Sheet" and manually enter a password in order to actually lock all the cells.
Is there a way to do this in R and enable worksheet protection?
I have been using #AEF 's answer. But today I found out this in fact can be done in the xlsx package:
s1$protectSheet("mypassword").
Of course, before you call saveworkbook
You can do this directly with apache POI (which is used by xlsx). Just call
.jcall(s1, "V", "protectSheet", "mypassword")
before you call saveWorkbook.
If the sheet is not stored as an object, you can summon the sheet with the getSheet() method within the ".jcall" function:
rJava::.jcall(wb$getSheet("Sheet1"),"V","protectSheet", "MyPassword123")
xlsx::saveWorkbook(wb,"C:/myfilepath)
Also, to provide clarity, the ".jcall" function comes from the "rJava" package. This package must be installed and working properly.

Import data from excel with HSSF in R

I'm trying to import data from an excel file into R, with the library xlsx. I get the error:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod",
cl, : org.apache.poi.EncryptedDocumentException: The supplied
spreadsheet seems to be an Encrypted .xlsx file. It must be decrypted
before use by XSSF, it cannot be used by HSSF
I changed the file from filename.xlsx to filename.xls, but I keep getting the same message
I also tried the advice of this links:
Import password-protected xlsx workbook into R
How to read xlsx file in protect mode to R
but it won't work.
The sheets of my file are protected but not the file itself.
It seems from the package xlsx website that facilities to work with password protected spreadsheets is a feature still being worked on - although a user Heather has made a fix.
See https://code.google.com/p/rexcel/issues/detail?id=49
But it is not clear if this extends to protected sheets as well.
Fercho - Can you try other workarounds?
Save as csv and use read.csv to get data into R?
Save a version of Excel file without protected sheets for your data input?
Try other Excel to R programs like XLConnect? This package seems more up to date.
EDIT: Mango Solutions has a comparison of Excel and R tools. openxlsx can handle password protected sheets but is slower than XLConnect.
CODE for 1 Above
' Microsoft for Excel VBA for saving as csv
' First Select your sheet to turn to CSV file and then run code like this
' Save sheet as csv
ThisWorkbook.SaveAs Filename:=strSaveFilename, _
FileFormat:= xlCSV
Workbook.SaveAs Method
' SYNTAX expression .SaveAs(FileName, FileFormat, Password, WriteResPassword, ReadOnlyRecommended, CreateBackup, AccessMode, ConflictResolution, AddToMru, TextCodepage, TextVisualLayout, Local)
thanks, I finally did it in VBA it takes a little bit of time but it works, here is the code I used for VBA.
Sub LoopThroughFiles()
FolderName = "C:folder with files\"
If Right(FolderName, 1) <> Application.PathSeparator Then FolderName = FolderName & Application.PathSeparator
Fname = Dir(FolderName & "*.xls")
'loop through the files
Do While Len(Fname)
With Workbooks.Open(FolderName & Fname)
Dim ws As Worksheet
For Each ws In ActiveWorkbook.Worksheets
On Error Resume Next
ws.Unprotect Password:="password 1"
ws.Unprotect Password:="password 2"
On Error GoTo 0
Next ws
For Each w In Application.Workbooks
w.Save
Next w
End With
' go to the next file in the folder
Fname = Dir
Loop
Application.Quit
End Sub
I used two password to unlock the sheets, I didn't know which password was so I try both on each file.
thanks again for the help.

Resources