I am currently using the sbrl() function from the sbrl library. The function does the job of any supervised statistical learning algorithm: it takes data, and generates a predictive model.
I have a memory leak issue when using it.
If I run the function in a loop, my RAM will get filled more and more, although I am always pointing to the same object.
Eventually, my computer will reach the RAM limit and crash.
Calling gc() will never help. Only closing the R session releases the memory.
Below is a minimal reproducible example. An eye should be kept on the system's memory management program.
Importantly, the sbrl() function calls, from what I can tell, C code, and also makes use of Rcpp. I guess this relates to the memory leak issue.
Would you know how to force memory to be released?
Configuration: Windows 10, R 3.5.0 (Rstudio or R.exe)
install.packages("sbrl")
library(sbrl)
# Getting / prepping data
data("tictactoe")
# Looping over sbrl
for (i in 1:1e3) {
rules <- sbrl(
tdata = tictactoe, iters=30000, pos_sign="1",
neg_sign="0", rule_minlen=1, rule_maxlen=3,
minsupport_pos=0.10, minsupport_neg=0.10,
lambda=10.0, eta=1.0, alpha=c(1,1), nchain=20
)
invisible(gc())
cat("Rules object size in Mb:", object.size(rules)/1e6, "\n")
}
Related
I'm trying to speed up my R code using future package by using mutlicore plan on Linux. In future definition I'm creating a java object and trying to pass it to .jcall(), But I'm getting a null value for java object in future. Could anyone please help me out to resolve this. Below is sample code-
library("future")
plan(multicore)
library(rJava)
.jinit()
# preprocess is a user defined function
my_value <- preprocess(a = value){
# some preprocessing task here
# time consuming statistical analysis here
return(lreturn) # return a list of 3 components
}
obj=.jnew("java.custom.class")
f <- future({
.jcall(obj, "V", "CustomJavaMethod", my_value)
})
Basically I'm dealing with large streaming data. In above code I'm sending the string of streaming data to user defined function for statistical analysis and returning the list of 3 components. Then want to send this list to custom java class [ java.custom.class ]for further processing using custom Java method [ CustomJavaMethod ].
Without using future my code is running fine. But I'm getting 12 streaming records in one minute and then my code is getting slow, observed delay in processing.
Currently I'm using Unix with 16 cores. After using future package my process is done fast. I have traced back my code, in .jcall something happens wrong.
Hope this clarifies my pain.
(Author of the future package here:)
Unfortunately, there are certain types of objects in R that cannot be sent to another R process for further processing. To clarify, this is a limitation to those type of objects - not to the parallel framework use (here the future framework). This simplest example of such an objects may be a file connection, e.g. con <- file("my-local-file.txt", open = "wb"). I've documented some examples in Section 'Non-exportable objects' of the 'Common Issues with Solutions' vignette (https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html).
As mentioned in the vignette, you can set an option (*) such that the future framework looks for these type of objects and gives an informative error before attempting to launch the future ("early stopping"). Here is your example with this check activated:
library("future")
plan(multisession)
## Assert that global objects can be sent back and forth between
## the main R process and background R processes ("workers")
options(future.globals.onReference = "error")
library("rJava")
.jinit()
end <- .jnew("java/lang/String", " World!")
f <- future({
start <- .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
# Error in FALSE :
# Detected a non-exportable reference ('externalptr') in one of the
# globals ('end' of class 'jobjRef') used in the future expression
So, yes, your example actually works when using plan(multicore). The reason for that is that 'multicore' uses forked processes (available on Unix and macOS but not Windows). However, I would try my best to limit your software to parallelize only on "forkable" systems; if you can find an alternative approach I would aim for that. That way your code will also work on, say, a huge cloud cluster.
(*) The reason for these checks not being enabled by default is (a) it's still in beta testing, and (b) it comes with overhead because we basically need to scan for non-supported objects among all the globals. Whether these checks will be enabled by default in the future or not, will be discussed over at https://github.com/HenrikBengtsson/future.
The code in the question is calling unknown Method1 method, my_value is undefined, ... hard to know what you are really trying to achieve.
Take a look at the following example, maybe you can get inspiration from it:
library(future)
plan(multicore)
library(rJava)
.jinit()
end = .jnew("java/lang/String", " World!")
f <- future({
start = .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
value(f)
[1] "Hello World!"
I've got a largish dataset on an Athena database on AWS. I'd like to read from it in parallel, and I'm accustomed to the foreach package's approach to forking from within R.
I'm using RJDBC
Here's what I am trying:
out <- foreach(i = 1:length(fipsvec), .combine = rbind, .errorhandling = "remove") %dopar% {
coni <- dbConnect(driver, "jdbc:awsathena://<<location>>/",
s3_staging_dir="my_directory",
user="...",
password="...")
print(paste0("starting ", i))
sqlstring <- paste0("SELECT ",
"My_query_body"
fipsvec[i]
)
row <- fetch(dbSendQuery(coni, sqlstring), -1, block = 999)
print(i)
dbDisconnect(coni)
rm(coni)
gc()
return(row)
}
(Sorry I can't make this reproducible -- I obviously can't hand out the keys to the DB online.)
When I run this, the first c = number of cores steps run fine, but then it hangs and does nothing -- indefinitely as far as I can tell. htop shows no activity on any of the cores. And when I change the for loop to only loop over c entries, the output is what I expect. When I change from parallel to serial (%do% instead of %dopar%), it also works fine.
Does this have something to do with the connection not being closed properly, or somehow being defined redundantly? I've placed the connection within the parallel loop, so each core should have its own connection in its own environment. But I don't know enough about databases to tell whether this is sufficiently distinct.
I'd appreciate answers that help me understand what's going on under the hood here -- it's all voodoo to me at this point.
Are you passing the RJDBC package (and it's dependencies-- methods, DBI, and rJava) into the cluster anywhere?
If not, your the first line of your code should look something like below:
results <- foreach(i = 1:length(fipsvec),
.combine = rbind,
.errorhandling = "remove",
.packages=c('methods','DBI','rJava','RJDBC')) %dopar% {
One thing that I suspect (but don't know) might make things a little hairier is that RJDBC uses a JVM to execute the queries. Not super knowledgeable about how rJava handles JVM initialization, and if each of the threads may be trying to re-use the same JVM simultaneously, or if they have enough information about the external environment to properly initialize one in the first place.
Another troubleshooting step if the above doesn't work might be to move the assignment step for driver into the %dopar% environment.
On another track, how many rows are in your result set? If the result set is in the million+ row range and can be returned with a single query, I actually came across an opportunity for optimization within the RJDBC package and have an open pull request on github ( https://github.com/s-u/RJDBC/pull/50 ) that I haven't heard anything on but have been using myself for a couple months. There's a basic benchmark documented in the pull request, I found the speedup to be substantial on the particular query I was running.
If it seems applicable you can install the branch with:
library(devtools)
devtools::install_github("msummersgill/RJDBC",ref = "harmonize", force = TRUE)
I have a project that's downloading ~20 million PDFs multithreaded on an ec2. I'm most proficient in R and it's a one off so my initial assessment was that the time savings from bash scripting wouldn't be enough to justify the time spent on the learning curve. So I decided just to call curl from within an R script. The instance is a c4.8xlarge, rstudio server over ubuntu with 36 cores and 60 gigs of memory.
With any method I've tried it runs up to the max ram fairly quickly. It runs alright but I'm concerned swapping the memory is slowing it down. curl_download or curl_fetch_disk work much more quickly than the native download.file function (one pdf per every .05 seconds versus .2) but those both run up to max memory extremely quickly and then seem to populate the directory with empty files. With the native function I was dealing with the memory problem by suppressing output with copious usage of try() and invisible(). That doesn't seem to help at all with the curl package.
I have three related questions if anyone could help me with them.
(1) Is my understanding of how memory is utilized correct in that needlessly swapping memory would cause the script to slow down?
(2) curl_fetch_disk is supposed to be writing direct to disk, does anyone have any idea as to why it would be using so much memory?
(3) Is there any good way to do this in R or am I just better off learning some bash scripting?
Current method with curl_download
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
curl_download(url
,destfile=filename
,quiet=T
)
)
)
}
Previous method with native download.file
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
download.file(url
,destfile=filename
,quiet=T
,method="curl"
)
)
)
}
parLapply loop
len <- nrow(url_sweep.df)
gc.vec <- unlist(lapply(0:35, function(x) x + seq(
from=100,to=len,by=1000)))
gc.vec <- gc.vec[order(gc.vec)]
start.time <- Sys.time()
ptm <- proc.time()
cl <- makeCluster(detectCores()-1,type="FORK")
invisible(
parLapply(cl,1:len, function(x){
invisible(
try(
getfile_sweep.fun(
url = url_sweep.df[x,"url"]
,filename = url_sweep.df[x,"filename"]
)
)
)
if(x %in% gc.vec){
gc()
}
}
)
)
stopCluster(cl)
Sweep.time <- proc.time() - ptm
Sample of data -
Sample of url_sweep.df:
https://www.dropbox.com/s/anldby6tcxjwazc/url_sweep_sample.rds?dl=0
Sample of existing.filenames:
https://www.dropbox.com/s/0n0phz4h5925qk6/existing_filenames_sample.rds?dl=0
Notes:
1- I do not have such powerful system available to me, so I cannot reproduce every issue mentioned.
2- All the comments are being summarized here
3- It was stated that machine received an upgrade: EBS to provisioned SSD w/ 6000 IOPs/sec, however the issue persists
Possible issues:
A- if memory swap starts to happen then you are nor purely working with RAM anymore and I think R would have harder and harder time to find available continues memory spaces.
B- work load and the time it takes to finish the workload, compared to the number of cores
c- parallel set up, and fork cluster
Possible solutions and troubleshooting:
B- Limiting memory usage
C- Limiting number of cores
D- If the code runs fine on a smaller machine like personal desktop than issue is with how the parallel usage is setup, or something with fork cluster.
Things to still try:
A- In general running jobs in parallel incurs overhead, now more cores you have, you will see the effects more. when you pass a lot of jobs that take very very little time (think smaller than second) this will results in increase of overhead related to constantly pushing jobs. try to limit the core to 8 just like your desktop and try your code? does the code run fine? if yes than try to increase the workload as you increase the cores available to the program.
Start with lower end of spectrum of number of cores and amount of ram an increase them as you increase the workload and see where the fall happens.
B- I will post a summery about Parallelism in R, this might help you catch something that we have missed
What worked:
Limiting the number of cores has fixed the issue. As mentioned by OP, he has also made other changes to the code, however i do not have access to them.
You can use the async interface instead. Short example below:
cb_done <- function(resp) {
filename <- basename(urltools::path(resp$url))
writeBin(resp$content, filename)
}
pool <- curl::new_pool()
for (u in urls) curl::curl_fetch_multi(u, pool = pool, done = cb_done)
curl::multi_run(pool = pool)
being able to multithread on windows would be awesome, but perhaps this problem is harder than i had thought.. :(
inside of survey:::svyby.default there is a a block that's either lapply or mclapply depending on multicore=TRUE and your operating system. windows users get forced into the lapply loop no matter what, and i was wondering if there's any way to go down the mclapply path instead.. speeding up the computation.
i don't know too much about the innards of parallel processing, but i did some experiments to see if any of the windows-acceptable alternatives would work. first i tried overwriting mclapply with
mclapply <-
function( X , FUN , ... ){
clusterApply(
x = X ,
fun = FUN ,
cl = makeCluster( detectCores() ) , ... )
}
next i used fixInNamespace( svyby.default , "survey" ) to remove the line
if (multicore) parallel:::closeAll()
but that only got me to the point where
> svyby(~api99, ~stype, dclus1, svymean , multicore=TRUE )
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: object 'svymean' not found
quoting Dr. Thomas Lumley, author of the R survey package in response to my inquiry--
No. This approach to parallelising relies on forking, which Windows
doesn't support.
It would be necessary to rewrite it to use clusterApply(), and I'm
pretty sure the communications overhead would eat the speed gain. With
forking, the child process gets a copy of the parent process data for
free -- it's all done by the virtual<->physical memory translation
hardware -- but with the cluster approach R has to send data to the
child process explicitly.
When I am using mclapply, from time to time (really randomly) it gives incorrect results. The problem is quite thoroughly described in other posts across the Internet, e.g. (http://r.789695.n4.nabble.com/Bug-in-mclapply-td4652743.html). However, no solution is provided. Does anyone know how to fix this problem? Thank you!
The problem reported by Winston Chang that you cite appears to have been fixed in R 2.15.3. There was a bug in mccollect that occurred when assigning the worker results to the result list:
if (is.raw(r)) res[[which(pid == pids)]] <- unserialize(r)
This fails if unserialize(r) returns a NULL, since assigning a NULL to a list in this way deletes the corresponding element of the list. This was changed in R 2.15.3 to:
if (is.raw(r)) # unserialize(r) might be null
res[which(pid == pids)] <- list(unserialize(r))
which is a safe way to assign an unknown value to a list.
So if you're using R <= 2.15.2, the solution is to upgrade to R >= 2.15.3. If you have a problem using R >= 2.15.3, then presumably it's a different problem then the one reported by Winston Chang.
I also read over the issues discussed in the R-help thread started by Elizabeth Purdom. Without a specific test case, my guess is that the problem is not due to a bug in mclapply because I can reproduce the same symptoms with the following function:
work <- function(i, poison) {
if (i == poison) quit(save='no')
i
}
If a worker started by mclapply dies while executing a task for any reason (receiving a signal, seg faulting, exiting), mclapply will return a NULL for all of the tasks that were assigned to that worker:
> library(parallel)
> mclapply(1:4, work, 3, mc.cores=2)
[[1]]
NULL
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
In this case, NULL's were returned for tasks 1 and 3 due to prescheduling, even though only task 3 actually failed.
If a worker dies when using a function such as parLapply or clusterApply, an error is reported:
> cl <- makePSOCKcluster(3)
> parLapply(cl, 1:4, work, 3)
Error in unserialize(node$con) : error reading from connection
I've seen many such reports, and I think they tend to happen in large programs that use lots of packages that are hard to turn into reproducible test cases.
Of course, in this example, you'll also get an error when using lapply, although the error won't be hidden as it is with mclapply. If the problem doesn't seem to happen when using lapply, it may be because the problem rarely occurs, so it only happens in very large runs that are executed in parallel using mclapply. But it is also possible that the error occurs, not because the tasks are executed in parallel, but because they are executed by forked processes. For example, various graphics operations will fail when executed in a forked process.
I'm adding this answer so others hitting this question won't have to wade through the long thread of comments (I am the bounty granter but not the OP).
mclapply initially populates the list it creates with NULLS. As the worker processes return values, these values overwrite the NULLS. If a process dies without ever returning a value, mclapply will return a NULL.
When memory becomes low, the Linux out of memory killer (oom killer)
https://lwn.net/Articles/317814/
will start silently killing processes. It does not print anything to the console to let you know what it's doing, although the oom killer activities show up in the system log. In this situation the output of mclapply will appear to have been randomly contaminated with NULLS.