I am running into several distinct problems while trying to speed up some automated report generation for a large dataset. I'm utilizing R + markdown -> HTML to generate a report, and loop over ~10K distinct groupings for the report accessing the data from Oracle.
The system is comprised mainly of two parts
a main script
a markdown template file
The main script sets up the computing environment and parallel processing backends:
library(ROracle)
library(doParallel) ..etc
....
cl <- makeCluster(4)
clusterEvalQ(cl, con<-dbConnect(db,un,pw)) ##pseudocode...
Here the first issue appears to arise. R throws an exception stating the connections on the workers are invalid BUT when I monitor live sessions on Oracle they appear to be fine...
Next, the main calls the loop for report generation.
foreach(i=1:nrow(reportgroups), .packages=c('ROracle', 'ggplot2', 'knitr') %dopar% ##...etc
{
rmarkdown::render(inputfile.Rmd, outputfile.html, params=list(groupParam1[i], groupParam2[i], etc)
}
If I run the foreach loop sequentially i.e., %do% instead of %dopar%, everything seems to work fine. No errors, then entire set runs correctly (I have only tested to ~400 groups, will do a full run of all 10k overnight).
However, if I attempt to run the loop in parallel, invariably 'pandoc' throws an error #1 in converting the file. If I run the broken loop multiple times, the 'task' in the loop (or cluster, not sure which task refers to in this context) which causes the error changes.
The template file is pretty basic, it takes in parameters for groups, runs an SQL query on the connection defined for the cluster worker, and utilizes ggplot2 + dplyr to generate results. Since the template seem to operate when not through a cluster, I believe the problem must be something to do with the connection objects in the cluster nodes from ROracle, although I don't know enough about the subject to really pinpoint the problem.
If anyone has had a similar experience, or has a hunch about what is going on, any advice would be appreciated!
Let me know if I can clarify anything...
Thanks
Problem has been solved using a variety of hacks. First, main loop was re-written to pull data into local R session instead of running SQL within the markdown report. This seemed to help cure some of the collisions in connections to the database, but did not fix it entirely. So, I added some tryCatch() and repeat() functionality to the functions which queried data or attempted to connect to the dB. Highly recommend anyone having issues with ROracle on a cluster implements something similar, and spends the time to review the error messages to see what exactly is going on.
Next, in the issues with pandoc conversion. Most of the problems were solved by resolving a join to a depreciated table in the SQL (old table lacked data for some groups, and thus was pulling in no rows and the report would not generate). However, some issues still remained where rmarkdown::render() would run successfully, but the actual output would be empty or broken. No idea what caused this issue, but the way I solved it was to compare filesizes of the reports generated (empty ones were ~ 300kb, completed ones 400 + ) and re-run the report generation after the cluster was shut down on a single machine. This seemed to clean up any of the last issues.
In summary:
Previously: reports generated, but with significant issues and incomplete data within.
Fixes:
For the cluster, ensure multiple attempts to connect to dB in case of an issue during the run. Don't forget to close connection after grabbing data. In ROracle, creating the driver, i.e,
driver <- dbDriver("Oracle")
is the most time consuming part of connecting to a database, but the driver can be reused with multiple connections in a loop e.g.,
##Create driver outside loop, and reuse inside
for(i in 1:n){
con <- dbConnect(driver, 'username', 'password')
data <- dbGetQuery(con, 'Select * from mydata')
dbDisconnect(con)
...##do something to data
}
is much faster than calling
dbConnect(dbDriver("Oracle"), 'username', 'password')
inside the loop
Wrap connection
/ sql attempts inside a function which implements tryCatch and repeat functionality in case of errors
Wrap calls to rmarkdown::render() with tryCatch and repeat functionality and log status, filesize, filename, filelocation, etc.
--Load in logfile created as above from rmarkdown::render calls, find outlier file sizes (in my case, simply Z-scores of filesize, filter those with < -3) to identify reports where issues still exist. Do something to fix them rerun on a single worker
Overall, ~4500 reports generated from a subset of my data, time on 4 core i5 # 2.8 Ghz is around 1 hour. Total number of 'bad' reports generated is around 1% of the total.
Related
I'm not an R expert and I'm not very good at english language, so keep that in mind when you'll respond to my question.
I'm trying to automate some R script execution. My ultimate goal is to automate the execution of a script that can query some data from Binance exchange API, export those data into an external .csv file (or update the file with new observations if it already exists) and then import the file into RStudio Global Environment, so that I always have up to date data ready to be analyzed.
Searching on the web I learned that I can automate tasks using Windows Task Scheduler, so I downloaded taskscheduleR package to speed up the process. It worked quite well in fact, however I only managed to automate 2 of the 3 tasks I mentioned above:
Query from API
Export data into .csv
So, using task scheduler I can periodically query the web and export data/update existing datasets. However, I'm struggling with the 3rd task. I can't figure out how to automatically and periodically import the data into RStudio Global Environment.
In order to simplify my question I'll use a very basic code line as an example. Imagine I want to automate a script named "rnorm.R"
x <- rnorm(10)
I want the output of rnorm() to be stored into x and then loaded into the Global Environment, and I want this code line to be run every 2 minutes, so that the x values also change every 2 minutes while I'm working with Rstudio. I tried many times with different methods.
First I tried with taskscheduleR package. Used the following code:
require(taskscheduleR)
require(lubridate)
taskscheduler_create(taskname = "rnorm", rscript = "mydir/rnorm.R", starttime = format(ceiling_date(Sys.time(), unit = "mins"), "%H:%M), schedule = "MINUTE", modifier = 2)
Then, I tried with manual schedulation using Windows task scheduler.
Lastly, I tried scheduling a batch file:
#echo off
Rscript.exe mydir/rnorm.R
None of the 3 methods worked. Or better, the schedulation works, I see a command line window appear each time Windows task scheduler executes the script. However, it does nothing. The x variable isn't loaded into the Global Environment. What am I doing wrong? I'm sure this problem has a very simple and stupid answer, however I can't figure out what it may be. Thanks for your answers.
I rewrote my program many times to not hit any memory limits. It again takes up full VIRT which does not make any sense to me. I do not save any objects. I write to disk each time I am done with a calculation.
The code (simplified) looks like
lapply(foNames, # these are just folder names like ["~/datastes/xyz","~/datastes/xyy"]
function(foName){
Filepath <- paste(foName,"somefile,rds",sep="")
CleanDataObject <- readRDS(Filepath) # reads the data
cl <- makeCluster(CONF$CORES2USE) # spins up a cluster (it does not matter if I use the cluster or not. The problem is intependent imho)
mclapply(c(1:noOfDataSets2Generate),function(x,CleanDataObject){
bootstrapper(CleanDataObject)
},CleanDataObject)
stopCluster(cl)
})
The bootstrap function simply samples the data and save the sampled data to disk.
bootstrapper <- function(CleanDataObject){
newCPADataObject <- sample(CleanDataObject)
newCPADataObject$sha1 <- digest::sha1(newCPADataObject, algo="sha1")
saveRDS(newCPADataObject, paste(newCPADataObject$sha1 ,".rds", sep = "") )
return(newCPADataObject)
}
I do not get how this can now accumulate to over 60 GB of RAM. The code is highly simplified but imho there is nothing else which could be problematic. I can paste more code details if needed.
How does R manage to successively eat up my memory, even though I already re-wrote the software to store the generated object on disk?
I have had this problem with loops in the past. It is more complicated to address in functions and apply.
But, what I have done is used two things in combination to fix the problem.
Within each function that generates temporary files, use rm(file-name) to remove the temp file and then run gc() which forces a garbage collection before exiting the functions. This will slow the process some, but reduce memory pressure. This way each iteration of apply will purge before moving on to the next step. You may have to go back to your first function in nested functions to accomplish this well. It takes experimentation to figure out where the system is getting backed up.
I find this to be especially necessary if you use ANY methods called from packages built over rJava, it is extremely wasteful of resources and R has no way of running garbage collection on the Java heap, and most authors of java packages do not seem to be accounting for the need to collect in their methods.
I found lots of questions here asking how to deal with "cannot allocate vector of size **" and tried the suggestions but still can't find out why Rstudio crashes every time.
I'm using 64bit R in Windows 10, and my memory.limit() is 16287.
I'm working with a bunch of large data files (mass spectra) that take up 6-7GB memory each, so I've been calling individual files one at a time and saving it as a variable with the XCMS package like below.
msdata <- xcmsRaw(datafile1,profstep=0.01,profmethod="bin",profparam=list(),includeMSn=FALSE,mslevel=NULL, scanrange=NULL)
I do a series of additional operations to clean up data and make some plots using rawEIC (also in XCMS package), which increases my memory.size() to 7738.28. Then I removed all the variables I created that are saved in my global environment using rm(list=ls()). But when I try to call in a new file, it tells me it cannot allocate vector of size **Gb. With the empty environment, my memory.size() is 419.32, and I also checked with gc() to confirm that the used memory (on the Vcells row) is on the same order with when I first open a new R session.
I couldn't find any information on why R still thinks that something is taking up a bunch of memory space when the environment is completely empty. But if I terminate the session and reopen the program, I can import the data file - I just have to re-open the session every single time one data file processing is done, which is getting really annoying. Does anyone have suggestions on this issue?
I want to know if there is a way to run R code (train, mutate, search, ...) in the background, without the need to wait for execution to end or to manually transfer related data to a new session.
Multiple tabs in R-Studio or running multiple sessions in a Jupyer notebook from localhost:8888, localhost:8889, localhost:8890, localhost:8891, etc. is another crude way.
Be mindful of system compute strengths/limitations.
I'm wondering if anyone else has ever encountered this problem. I'm writing a fairly small amount of data to a csv file. It's about 30 lines, 50 times.
I'm using a for loop to write data to the file.
It seems "finicky" sometimes the operation completes successfully, and other times it stops after the first ten times (300 lines) other times 3, or 5... by telling me
"cannot open connection".
I imagine it is some type of timeout. Is there a way to tell R to "slow down" when writing tables?
Before you ask: there's just too much code to provide an example here.
Code would help, despite your objections. R has a fixed-size connection pool and I suspect you are running out of connection.
So make sure you follow the three-step of
open connection (and check for error as a bonus)
write using the connection
close the connection
I can't reproduce it on a R 2.11.1 32bit on a Windows 7 64bit. For these kind of things, please provide more info on your system (see e.g. ?R.version, ?Sys.info )
Memory is a lot faster than disk access. 1500 lines are pretty much manageable in the memory and can be written to file in one time. If it's different sets of data, add an extra factor variable indicating the set (set1 to set50). All your data is easily manageable in one dataframe and you avoid having to access the disk many times.
In case it really is for 50 files, this code illustrates the valuable advice of Dirk :
for(i in 1:50){
...
ff <- file("C:/Mydir/Myfile.txt",open="at")
write.table(myData,file=ff)
close(ff)
}
See also the help: ?file
EDIT : you should use open="at" instead of open="wt". "at" is appending mode. "wt" is writing mode. append=T is the same as open="at".