How to call a parallelized script from command prompt? - r

I'm running into this issue and I for the life of me can't figure out how to solve it.
Quick summary before example:
I have several hundred data sets from which I want create reports on everyday. In order to do this efficiently, I parallelized the process with doParallel. From within RStudio, the process works fine, but when I try to make the process automatic via Task Scheduler on windows, I can't seem to get it to work.
The process within RStudio is:
I call a script that sources all of my other scripts, each individual script has a header section that performs the appropriate package import, so for instance it would look like:
get_files <- function(){
get_files.create_path() -> path
for(file in path){
if(!(file.info(paste0(path, file))[['isdir']])){
source(paste0(path, file))
}
}
}
get_files.create_path <- function(){
return(<path to directory>)
}
#self call
get_files()
This would be simply "Source on saved" and brings in everything I need into the .GlobalEnv.
From there, I could simply type: parallel_report() which calls a script that sources another script that houses the parallelization of the report generations. There was an issue awhile back with simply calling the parallelization directly (I wonder if this is related?) and so I had to make the doParallel script a non-function housing script and thus couldn't be brought in with the get_files script which would start the report generation every time I brought everything in. Thus, I had to include it in its own script and save it elsewhere to be called when necessary. The parallel_report() function would simply be:
parallel_report <- function(){
source(<path to script>)
}
Then the script that is sourced is the real parallelization script, and would look something like:
doParallel::registerDoParallel(cl = (parallel::detectCores() - 1))
foreach(name = report.list$names,
.packages = c('tidyverse', 'knitr', 'lubridate', 'stringr', 'rmarkdown'),
.export = c('generate_report'),
.errorhandling = 'remove') %dopar% {
tryCatch(expr = {
generate_report(name)
}, error = function(e){
error_handler(error = e, caller = paste0("generate report for ", name, " from parallel"), line = 28)
})
}
doParallel::stopImplicitCluster()
The generate_report function is simply an .Rmd and render() caller:
generate_report <- function(<arguments>){
#stuff
generate_report.render(<arguments>)
#stuff
}
generate_report.render <- function(<arguments>){
rmarkdown::render(
paste0(data.information#location, 'report_generator.Rmd'),
params = list(
name = name,
date = date,
thoughts = thoughts,
auto = auto),
output_file = paste0(str_to_upper(stock), '_report_', str_remove_all(date, '-'))
)
}
So to recap, in RStudio I would simply perform the following:
1 - Source save the script to bring everything
2 - type parallel_report
2.a - this calls directly the doParallization of generate_report
2.b - generate_report calls an .Rmd file that houses the required function calling and whatnot to produce the reports
And the process starts and successfully completes without a hitch.
In order to make the situation automatic via the Task Scheduler, I made a script that the Task Scheduler can call, named automatic_caller:
source(<path to the get_files script>) # this brings in all the scripts and data into the global, just
# as if it were being done manually
tryCatch(
expr = {
parallel_report()
}, error = function(e){
error_handler(error = e, caller = "parallel_report from automatic_callng", line = 39)
})
The error_handler function is just an in-house script used to log errors throughout.
So then on the Task Schedule's tasks I have the Rscript.exe called and then the automatic_caller after that. Everything within the automatic_caller function works except for the report generation.
The process completes almost automatically, and the only output I get is an error:
"pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available)."
But rmarkdown is within the .export call of the doParallel and it is in the scripts that use it explicitly, and in the actual generate_report it is called directly via rmarkdown::render().
So - I am at a complete loss.
Thoughts and suggestions would be completely appreciated.

So pandoc is apprently an executable that helps convert files from one extension to another. RStudio comes with its own pandoc executable so when running the scripts from RStudio, it knew where to point when pandoc is required.
From the command prompt, the system did not know to look inside of RStudio, so simply downloading pandoc as a standalone executable gives the system the proper pointer.
Downloded pandoc and everything works fine.

Related

using rstudioapi in devtools tests

I'm making a package which contains a function that calls rstudioapi::jobRunScript(), and I would like to to be able to write tests for this function that can be run normally by devtools::test(). The package is only intended for use during interactive RStudio sessions.
Here's a minimal reprex:
After calling usethis::create_package() to initialize my package, and then usethis::use_r("rstudio") to create R/rstudio.R, I put:
foo_rstudio <- function(...) {
script.file <- tempfile()
write("print('hello')", file = script.file)
rstudioapi::jobRunScript(
path = script.file,
name = "foo",
importEnv = FALSE,
exportEnv = "R_GlobalEnv"
)
}
I then call use_test() to make an accompanying test file, in which I put:
test_that("foo works", {
foo_rstudio()
})
I then run devtools::test() and get:
I think I understand the basic problem here: devtools runs a separate R session for the tests, and that session doesn't have access to RStudio. I see here that rstudioapi can work inside child R sessions, but seemingly only those "normally launched by RStudio."
I'd really like to use devtools to test my function as I develop it. I suppose I could modify my function to accept an argument passed from the test code which will simply run the job in the R session itself or in some other kind of child R process, instead of an RStudio job, but then I'm not actually testing the normal intended functionality, and if there's an issue which is specific to the rstudioapi::jobRunScript() call and which could occur during normal use, then my tests wouldn't be able to pick it up.
Is there a way to initialize an RStudio process from within a devtools::test() session, or some other solution here?

How can I make R's output more verbose so as to reassure me that it hasn't broken yet?

I often run code that eats up a lot of RAM, and may take as much as an hour before it gives its outputs. Often, I'll be half an hour in to running such code and I'll be worrying that something gone wrong. Is there any way that I can get R to reassure me that there's not been any errors yet? I suppose that I could put milestones in to the code itself, but I'm wondering if there's anything in R (or RStudio) that can automatically do this job at run time. For example, it would be handy to see how much memory the code is using, because then I'd be reassured that it's still working whenever I see the memory use significantly vary.
You might like my package {boomer}.
If you rig() your function, all its calls will be exploded and printed as the code is executed.
For instance
# remotes::install_github("moodymudskipper/boomer")
fun <- function(x) {
x <- x + 1
Sys.sleep(3)
x + 1
}
library(boomer)
# rig() the function and all the calls will be exploded
# and displayed as they're run
rig(fun)(2)
One way is:
to make a standalone file containing all the stuff to be run,
sourcing it and getting warned when the code is done, possibly with error.
The small function warn_me below:
runs the source file located in "path"
possibly catches an error, if an error there was
plays a sound when the run is over
sends an email reporting the status of the run: OK or fail
optionally: plays a sound until you stop it, so you can't miss it's over
And here it is:
warn_me = function(path, annoying = FALSE){
# The run
info = try(source(path))
# Sound telling it's over
library(beepr)
beep()
# Send an email with status
library(mailR)
msg = if(inherits(info, "try-error")) "The run failed" else "It's over, all went well"
send.mail(from = "me#somewhere.com",
to = "me#somewhere.com",
subject = msg,
body = "All is in the title.",
smtp = list(host.name = "smtp.mailtrap.io", port = 25,
user.name = "********",
passwd = "******", ssl = TRUE),
authenticate = TRUE,
send = TRUE)
if(annoying){
while(TRUE){
beepr::beep()
Sys.sleep(1)
}
}
}
warn_me(path)
I didn't test the package mailR myself, but any email sending package would do. See this excellent page on sending emails in R for alternatives.
If you are running an R script file within RStudio, use the "Source with Echo" selection (Ctrl+Shift+Enter, or via dropdown).

Avoid pauses due to readline() while testing

I am running tests in R using the test_dir() function from the testthat package. In some of the test scripts there are functions that call readline(), which - in interactive mode - causes the testing to pause and wait for user input. The functions calling readline() are not my own and I don't have any influence on them. The user input is irrelevant for the output of those functions.
Is there a way to avoid these pauses during testing?
Approaches that come to mind, but I wouldn't know how to implement them:
disable interactive mode while R is running
use another function from the testthat package that runs scripts in non-interactive mode
somehow divert stdin to something else than the terminal(??)
wrap functions calling readline() in another script that is called in non-interactive mode from my testing script and makes the results available
Testing only from the command line using Rscript is an option, but I'd rather stay in the RStudio workflow.
======
Example Code
with_pause <- function () {
readline()
2
}
without_pause <- function () {
2
}
expect_equal(with_pause(), without_pause())
I have a similar problem. I solved it with a global option setting.
original_test_mode <- getOption('my_package.test_mode')
options('my_package.test_mode' = TRUE)
# ... some tests ...
options('my_package.test_mode' = original_test_mode)
In my scripts I have a if statement
if(getOption('my_package.test_mode', FALSE)) {
# This happens in test mode
my_value <- 5
} else {
# normal processing
my_value <- readline('please write value: ')
}
Also not the nicest way but it works for me.
Maybe one more hint. It happened to that my test script failed. The problem here is, that the global option stays TRUE and in the next round and also for executing the script in the same session, it will never prompt you to write a value. I guess I should put some stuff in a tryCatch function or so. But if you have this problem in mind, just "sometimes" options('my_package.test_mode', NULL) helps :-)

testthat error on check() but not on test() because of ~/.Rprofile?

EDIT:
Is it possible that ~/.Rprofile is not loaded on within check(). It looks like my whole process fails since the ~/.Rprofile is not loaded.
DONE EDIT
I have a strange problem on automated testing with testthat. Actually, when I test my package with test() everything works fine. But when I test with check() I get an error message.
The error message says:
1. Failure (at test_DML_create_folder_start_MQ_script.R#43): DML create folder start MQ Script works with "../DML_IC_MQ_DATA/dummy_data" data
capture.output(messages <- source(basename(script_file))) threw an error
Error in sprintf("%s folder got created for each raw file.", subfolder_prefix) :
object 'subfolder_prefix' not found
Before this error I source a script which defines the subfolder_prefix variable and I guess this is why it works in the test() case. But I expected to get this running in the check() function as well.
I will post the complete test script here, hope it is not to complicated:
library(testthat)
context("testing DML create folder and start MQ script")
test_dir <- 'dml_ic_mq_test'
start_dir <- getwd()
# list of test file folders
data_folders <- list.dirs('../DML_IC_MQ_DATA', recursive=FALSE)
for(folder in data_folders) { # for each folder with test files
dir.create(test_dir)
setwd(test_dir)
script_file <- a.DML_prepare_IC.script(dbg_level=Inf) # returns filename I will source
test_that(sprintf('we could copy all files from "%s".',
folder), {
expect_that(
all(file.copy(list.files(file.path('..',folder), full.names=TRUE),
'.',
recursive=TRUE)),
is_true())
})
test_that(sprintf('DML create folder start MQ Script works with "%s" data', folder), {
expect_that(capture.output(messages <- source(basename(script_file))),
not(throws_error()))
})
count_rawfiles <- length(list.files(pattern='.raw$'))
created_folders <- list.dirs(recursive=FALSE)
test_that(sprintf('%s folder got created for each raw file.',
subfolder_prefix), {
expect_equal(length(grep(subfolder_prefix, created_folders)),
count_rawfiles)
})
setwd(start_dir)
unlink(test_dir, recursive=TRUE)
}
In my script I define the variable subfolder_prefix <- 'IC_' and within the test I check if the same number of folders are created for each raw file... This is what my script should do...
So as I said, I am not sure how to debug this problem here since test() works but check() fails during the testthat run.
Now that I know to look in devtools we can find the answer. Per the docs check "automatically builds and checks a source package, using all known best practices". That includes ignoring .Rprofile. It looks like check calls build and that all of that work is done is a separate (clean) R session. In contrast test appears to use your currently running session (in a new environment).

Different results from Rscript and R CMD BATCH

I have an inconsistency issue which I cannot explain when running an R script. I am not able to produce a reproducible example because there is a whole set of files/functions called by the entry script.
Using Rscript or RStudio with R v3.1.2 I obtain the results I'm expecting, however when calling R CMD BATCH from bash my script does not produce identical output. From bash, R seems to read the command line arguments correctly and reports them from the script, BUT in my code only the Rscript and RStudio source methods seem to use the parameter correctly in my code.
The 2 command line calls are as follows:
Rscript ./script/forecast_category_script.R "category='razors'" "cores=4L"
R CMD BATCH --no-save "--args category='razors' cores=4L" ./script/forecast_category_script.R ~/data/output/out.out
Is there any obvious reason why these inconsistencies might be occurring? I'd prefer to use R CMD BATCH as it redirects output to a file and when I migrate my code to the university cluster as a batch job through the scheduler I'd like to be able to follow what it has done.
UPDATE: changing this line resolves it but why?
Previously I had the following line in there, basically so when I was testing I didn't keep reloading the huge dataset if it was already loaded in my RStudio environment:
if(!exists("spi")) spi = f_load.spi(category = category)
Replaced it with this:
spi = f_load.spi(category = category)
The underlying function f_load_spi remained the same however:
f_load.spi = function(spi = NULL, category = "razors" , n=NULL) {
# check if the data is pre-loaded
if (is.null(spi)) {
fil = paste0(pth.data.storage, "categories/", category, "/", category, ".sp_ss.interp.rds")
print(fil)
spi = readRDS(fil)
}
# subset to a specific set of items
if (!is.null(n)) {
fc.items = unique(spi$fc.item)
rnd = sample(1:length(fc.items), n)
spi = spi[fc.item %in% fc.items[rnd]]
}
spi
}
For some reason the category variable was not being passed through properly into the function and it was loading a different category (beer rather than razors) which was an enormous file and not suitable for testing.
This still doesn't explain why Rscript and R CMD BATCH behaved differently.
It is possible that one of them is loading up a previously saved workspace and using global variables. Have you checked whether it matters which directory you are in or if there are any .Rhistory files present? One way to ensure that you don't have any hidden variables is to clear the worspace at the beginning of each script. For example, rm(list=ls()) as the first line of your Rscript.
Also, you can pipe output to a file with an Rscript using sink().

Resources