Here is the question:
In file.r I ran an extensive analysis based on a huge dataset.
Every time I open the file I just need to load the libraries and everything is ready.
I don't need to download anymore any of the dataset inputs I need.
Now I have created a RMD file.rmd with the same code of file.r to present its findings.
I'm trying to get a preview of how the pdf will look like.
The problem is that when I click "Knit to pdf", it starts to download all the packages and datasets again. I have to wait hours to see the effects of small changes in code.
And there is more:
Some objects created in R file simply are not working in the rmd file.
Ex: in R file I coded:
edx2 <- edx2 %>% mutate(timeRr = yearRating - release)
When I try to run the same code in the rmd file I get the message:
Error in Func(x[[i]],...) : object 'timeRr' not found calls: f -> scales_add_defaults - > lapply - > fun
The same libraries loaded in both files (r and rmd)
What am I doing wrong?
1) At the end of the data analysis (file.R), save the data you need for the Notebook in a .RDS file.
For example, if you generated 3 results : res1, res2 and res3
results <- list(res1 = res1, res2 = res2, res3 = res3)
saveRDS(file = 'results.RDS', results)
2) instead of sourcing the analysis script, just read the results in the Notebook (.Rmd)
data <- readRDS('results.RDS')
# Results available for further use in the Notebook
data$res1
data$res2
data$res3
The error you get with edx2 is probably due to the fact that a new session is opened during generation of a notebook : are you sure that file.R really generates edx2, or is it only available in your current session?
I'm using "studio (preview)" from Microsoft Azure Machine Learning to create a pipeline that applies machine learning to a dataset in a blob storage that is connected to our data warehouse.
In the "Designer", an "Exectue R Script" action can be added to the pipeline. I'm using this functionality to execute some of my own machine learning algorithms.
I've got a 'hello world' version of this script working (including using the "script bundle" to load the functions in my own R files). It applies a very simple manipulation (compute the days difference with the date in the date column and 'today'), and stores the output as a new file. Given that the exported file has the correct information, I know that the R script works well.
The script looks like this:
# R version: 3.5.1
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# The entry point function can contain up to two input arguments:
# Param<medals>: a R DataFrame
# Param<matches>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
message("STARTING R script run.")
# If a zip file is connected to the third input port, it is
# unzipped under "./Script Bundle". This directory is added
# to sys.path.
message('Adding functions as source...')
if (FALSE) {
# This works...
source("./Script Bundle/first_function_for_script_bundle.R")
} else {
# And this works as well!
message('Sourcing all available functions...')
functions_folder = './Script Bundle'
list.files(path = functions_folder)
list_of_R_functions <- list.files(path = functions_folder, pattern = "^.*[Rr]$", include.dirs = FALSE, full.names = TRUE)
for (fun in list_of_R_functions) {
message(sprintf('Sourcing <%s>...', fun))
source(fun)
}
}
message('Executing R pipeline...')
dataframe1 = calculate_days_difference(dataframe = dataframe1)
# Return datasets as a Named List
return(list(dataset1=dataframe1, dataset2=dataframe2))
}
And although I do print some messages in the R Script, I haven't been able to find the "stdoutlogs" nor the "stderrlogs" that should contain these printed messages.
I need the printed messages for 1) information on how the analysis went and -most importantly- 2) debugging in case the code failed.
Now, I have found (on multiple locations) the files "stdoutlogs.txt" and "stderrlogs.txt". These can be found under "Logs" when I click on "Exectue R Script" in the "Designer".
I can also find "stdoutlogs.txt" and "stderrlogs.txt" files under "Experiments" when I click on a finished "Run" and then both under the tab "Outputs" and under the tab "Logs".
However... all of these files are empty.
Can anyone tell me how I can print messages from my R Script and help me locate where I can find the printed information?
Can you please click on the "Execute R module" and download the 70_driver.log? I tried message("STARTING R script run.") in an R sample and can found the output there.
I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!
UPDATE:
It seems that some users are able to open the file above using the readxl::read_xls function, while others are not, both on Mac and Windows, using the most up to date versions of R, Rstudio, and readxl. The issue has been posted on the readxl GitHub and has not been resolved yet.
I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
install.packages("readxl")
library(readxl)
library(data.table)
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls function instead read_excel.
I checked, it also works correctly and even a little faster, since read_excel is a wrapper over read_xls and read_xlsx functions from readxl package.
Also, you can use excel_sheets function from readxl package to read all sheets of your Excel file.
UPDATE
Benchmarking is done with microbenchmark package for the following packages/functions: gdata::read.xls, XLConnect::readWorksheetFromFile and readxl::read_excel.
But XLConnect it's a Java-based solution, so it requires a lot of RAM.
I found that I was unable to open the file with read_xl immediately after downloading it, but if I opened the file in Excel, saved it, and closed it again, then read_xl was able to open it without issue.
My suggested workaround for handling hundreds of files is to build a little C# command line utility that opens, saves, and closes an Excel file. Source code is below, the utility can be compiled with visual studio community edition.
using System.IO;
using Excel = Microsoft.Office.Interop.Excel;
namespace resaver
{
class Program
{
static void Main(string[] args)
{
string srcFile = Path.GetFullPath(args[0]);
Excel.Application excelApplication = new Excel.Application();
excelApplication.Application.DisplayAlerts = false;
Excel.Workbook srcworkBook = excelApplication.Workbooks.Open(srcFile);
srcworkBook.Save();
srcworkBook.Close();
excelApplication.Quit();
}
}
}
Once compiled, the utility can be called from R using e.g. system2().
I will propose a different workflow. If you happen to have LibreOffice installed, then you can convert your excel files to csv programatically. I have Linux, so I do it in bash, but I'm sure it can be possible in macOS.
So open a terminal and navigate to the folder with your excel files and run in terminal:
for i in *.xls
do soffice --headless --convert-to csv "$i"
done
Now in R you can use data.table::fread to read your files with a loop:
Scenario 1: the structure of files is different
If the structure of files is different, then you wouldn't want to rbind them together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
all_files <- list()
for (i in 1:length(files)){
fileName <- gsub("(^.*/)(.*)(.csv$)", "\\2", files[i])
all_files[[fileName]] <- fread(files[i])
}
If you want to extract your named elements within the list into the global environment, so that they can be converted into objects, you can use list2env:
list2env(all_files, envir = .GlobalEnv)
Please be aware of two things: First, in the gsub call, the direction of the slash. And second, list2env may overwrite objects in your Global Environment if they have the same name as the named elements within the list.
Scenario 2: the structure of files is the same
In that case it's likely you want to rbind them all together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
joined <- list()
for (i in 1:length(files)){
joined <- rbindlist(joined, fread(files[i]), fill = TRUE)
}
On my system, i had to use path.expand.
R> file = "~/blah.xls"
R> read_xls(file)
Error:
filepath: ~/Dropbox/signal/aud/rba/balsheet/data/a03.xls
libxls error: Unable to open file
R> read_xls(path.expand(file)) # fixed
Resaving your file and you can solve your problem easily.
I also find this problem before but I get the answer from your discussion.
I used the read_excel() to open those files.
I was seeing a similar error and wanted to share a short-term solution.
library(readxl)
download.file("https://mjwebster.github.io/DataJ/spreadsheets/MLBpayrolls.xls", "MLBPayrolls.xls")
MLBpayrolls <- read_excel("MLBpayrolls.xls", sheet = "MLB Payrolls", na = "n/a")
Yields (on some systems in my classroom but not others):
Error: filepath: MLBPayrolls.xls libxls error: Unable to open file
The temporary solution was to paste the URL of the xls file into Firefox and download it via the browser. Once this was done we could run the read_excel line without error.
This was happening today on Windows 10, with R 3.6.2 and R Studio 1.2.5033.
If you have downloaded the .xls data from the internet, even if you are opening it in Ms.Excel, it will open a prompt first asking to confirm if you trust the source, see below screenshot, I am guessing this is the reason R (read_xls) also can't open it, as it's considered unsafe. Save it as .xlsx file and then use read_xlsx() or read_excel().
Even thought this is not a code-based solution, I just changed the type file. For instance, instead of xls I saved as csv or xlsx. Then I opened it as regular one.
I worked it for me, because when I opened my xlsfile, I popped up the message: "The file format and extension of 'file.xls'' don't match. The file could be corrupted or unsafe..."
I am trying to use the free version of Amazon Web Services EC2 with Ubuntu and R. I created a simple R file I hope will read a small CSV input data file in one folder, perform a trivial operation and write the output to a CSV file in a separate folder. However, the output CSV file is not being created.
Here are the contents of the R file:
my.data <- read.csv('/my_cloud_input_file_test/my_input_test_data_Nov22_2019.csv')
my.data$c <- my.data$a + my.data$b
write.csv(my.data, '/my_cloud_output_file_test/my_output_test_data_Nov22_2019.csv', row.names = FALSE, quote = FALSE)
Here are the contents of the input data file:
a,b
100,12
200,22
300,32
400,42
500,52
Here are the only two lines I used in PuTTY after connecting to the instance:
ubuntu#ip-122-31-22-243:~$ sudo su
root#ip-122-31-22-243:/home/ubuntu# R CMD BATCH Cloud_test_R_file_Nov22_2019.R
The R file is located in the ubuntu folder according to FileZilla, as are my input and output folders.
Can someone please point out my mistake? If I put the R file and input data set both in the ubuntu folder then the output data set is created in the ubuntu folder without me having to use a setwd statement (after I modify the read.csv and write.csv statements to eliminate my input and output folder names). So, I am not using a setwd statement here. If I need a setwd statement here what should it be?
Sorry for such a trivial question.
This code worked:
setwd('/home/ubuntu/')
my.data <- read.csv('retest_input_data/my_input_data_Nov24_2019.csv')
my.data$c <- my.data$a + my.data$b
write.csv(my.data, 'retest_output_data/my_output_data_Nov24_2019.csv', row.names = FALSE, quote = FALSE)
PuTTY line:
ubuntu#ip-122-31-22-243:~$ R CMD BATCH retest_R_file_Nov24_2019.R
Is it possible to open (not run using source) a R script for editing from the console of R studio or R command line
This is a nice command to edit an r script:
file.edit('foo.R')
You can open a file for editing with edit, e.g.,
edit(file = "test.R")
See help("edit") for more information, in particular regarding different editors.
You could just read it like a text file and do anything you want with it. This can be done using the following syntax
SourceF <- file("Source.R", open = "r")
SourceF_lines <- readLines(SourceF)
To follow with someting like:
cat(SourceF_lines, sep = "\n")
## or
writeClipboard(SourceF_lines)
Or replace a specific part:
SourceF_lines[2] <- '## Do not run!!'