Abnormal termination of R session on Rstudio server with DECIPHER alignment - r

I am running an alignment with the DECIPHER package in bioconductor using an Rstudio instance located on a server.
dna1 <- RemoveGaps(dnaSet, removeGaps = "all", processors = NULL)
alignmentO <- AlignSeqs(dna1, processors = NULL)
For some reason, every time the alignment reaches 99% the r session crashes with the message "The previous R session was abnormally terminated due to an unexpected crash."
Sometimes the program will work for a short time before crashing, but recently it crashes on the first alignment. I have run the code repeatedly using varying input sizes and it always crashes in the exact same place:
Generally in the past when I've had session crashes, the issue has been memory, but these are small viral genomes, which shouldn't be an issue. I also pulled all the code off the server to run in Rstudio on my personal computer, which has less RAM and CPUs, and the code ran no problem on the exact same inputs. Any ideas as to what the issue could be?
I have tried running it on two separate servers with different R versions (seen below), but I have the same issue on both servers.
The session info is as follows:
Server 1:
Server 2:

So eventually I reencountered what I believe is the same issue, but I no longer have access to the original data to test. However, I was experiencing R session crashing crashing before the alignment could complete. The second time it was a data issue. The sequence that crashed the system was oriented from 3 to 5 instead of 5 to 3, so the sequences were too dissimilar to align. Adding in an orientation function resolved the issue.
dna1 <- RemoveGaps(dnaSet, removeGaps = "all", processors = NULL)
dna1 <- OrientNucleotides(dna1)
alignmentO <- AlignSeqs(dna1, processors = NULL)

Related

TCGABiolinks: GDCprepare never terminates and crashes

I recently started using TCGAbiolinks to process some gene expression from the TCGA database. All I need to do is download the data into an R file, and there are many examples online. However, every time I try the example codes, it crashes my R workspace and sometimes my PC entirely.
Here's the code I'm using:
library(TCGAbiolinks)
queryLUAD <- GDCquery(project = "TCGA-LUAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
sample.type = "Primary Tumor",
legacy = FALSE,
workflow.type = "HTSeq - FPKM-UQ"
)
GGDCdownload(queryLUAD)
LUADRNAseq <- GDCprepare(queryLUAD,
save = TRUE,
save.filename = "LUAD.R")
As you can see, it's very simple and (as far as I can tell, identical) to examples like this one.
When I run this code, it downloads fully (I've checked the folder with the files). Then, I run GDCprepare. The progress bar starts and goes to 100%. Then, the command never terminates eventually either RStudio or my machine crashes.
Here's the terminal output:
> GDCdownload(queryLUAD)
Downloading data for project TCGA-LUAD
Of the 533 files for download 533 already exist.
All samples have been already downloaded
> LUADRNAseq <- GDCprepare(queryLUAD,
+ save = TRUE,
+ save.filename = "LUAD.R")
|==============================================================================================|100% Completed after 13 s
Although it says completed, it never does. To solve this, I've tried reinstalling TCGAbiolinks, updating R to the latest version, and even running it on an entirely different machine (a Mac instead of Windows). I've tried other datasets ("LUSC") and got the exact same behavior. Nothing has solved the issue, and I haven't found this issue mentioned anywhere online.
I am sincerely grateful for any and all advice on why this is happening and how I can fix it.
Experienced exactly the same problem. Tried a variety of things, and noticed it doesn't crash when dataset has less than 100 samples or running with "summarizedExperiment = FALSE" for dataset less than 300 samples.
I am facing the same issue here. Looks like there is some kind of a memory leak happening because my RAM usage goes to 100%. I managed to "GDCprepare" 500 samples without crashing with ~64GB RAM but even after finishing, the memory is still occupied by the R session, even if I try garbage collection and removing everything in the environment.
I didn't have this issue with TCGAbiolinks around a year ago...

Can writing to a file while running foreach in R cause connection errors?

I'm fitting models to about 1000 different time series data parallely. The code seems to be running in some machines but not in others. The error I'm getting (in some cases) is:
Error in { : task 135 failed - “cannot open the connection”
The task # differs from time to time. Here's a sample of my foreach code:
parallel_df <- foreach(sku = SKUList, .packages = loadedNamespaces(), .combine = rbind) %dopar% {
#This prints out the status of the operation in a file
#So that you can check how deep into the SKUList the code is
#See "Status.txt"
writeLines(paste("Training Until: ", dt, "\nFitting for: ", sku, "\nKey", match(sku,SKUList), "of", length(SKUList), "keys"), con = "Status.txt")
# Other things the loop does
}
My question is: could it be the case that the connection error is caused by this exact same line? (writeLines). I'm guessing that could be the issue because that is the only line in the entire loop that uses an external connection (i.e., a text file to which the code writes the status).
If so, why does this pose problems in certain machines and not in others? It works fine on my personal laptop, but not on other work machines.

Memory leak and C wrapper

I am currently using the sbrl() function from the sbrl library. The function does the job of any supervised statistical learning algorithm: it takes data, and generates a predictive model.
I have a memory leak issue when using it.
If I run the function in a loop, my RAM will get filled more and more, although I am always pointing to the same object.
Eventually, my computer will reach the RAM limit and crash.
Calling gc() will never help. Only closing the R session releases the memory.
Below is a minimal reproducible example. An eye should be kept on the system's memory management program.
Importantly, the sbrl() function calls, from what I can tell, C code, and also makes use of Rcpp. I guess this relates to the memory leak issue.
Would you know how to force memory to be released?
Configuration: Windows 10, R 3.5.0 (Rstudio or R.exe)
install.packages("sbrl")
library(sbrl)
# Getting / prepping data
data("tictactoe")
# Looping over sbrl
for (i in 1:1e3) {
rules <- sbrl(
tdata = tictactoe, iters=30000, pos_sign="1",
neg_sign="0", rule_minlen=1, rule_maxlen=3,
minsupport_pos=0.10, minsupport_neg=0.10,
lambda=10.0, eta=1.0, alpha=c(1,1), nchain=20
)
invisible(gc())
cat("Rules object size in Mb:", object.size(rules)/1e6, "\n")
}

curl memory usage in R for multiple files in parLapply loop

I have a project that's downloading ~20 million PDFs multithreaded on an ec2. I'm most proficient in R and it's a one off so my initial assessment was that the time savings from bash scripting wouldn't be enough to justify the time spent on the learning curve. So I decided just to call curl from within an R script. The instance is a c4.8xlarge, rstudio server over ubuntu with 36 cores and 60 gigs of memory.
With any method I've tried it runs up to the max ram fairly quickly. It runs alright but I'm concerned swapping the memory is slowing it down. curl_download or curl_fetch_disk work much more quickly than the native download.file function (one pdf per every .05 seconds versus .2) but those both run up to max memory extremely quickly and then seem to populate the directory with empty files. With the native function I was dealing with the memory problem by suppressing output with copious usage of try() and invisible(). That doesn't seem to help at all with the curl package.
I have three related questions if anyone could help me with them.
(1) Is my understanding of how memory is utilized correct in that needlessly swapping memory would cause the script to slow down?
(2) curl_fetch_disk is supposed to be writing direct to disk, does anyone have any idea as to why it would be using so much memory?
(3) Is there any good way to do this in R or am I just better off learning some bash scripting?
Current method with curl_download
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
curl_download(url
,destfile=filename
,quiet=T
)
)
)
}
Previous method with native download.file
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
download.file(url
,destfile=filename
,quiet=T
,method="curl"
)
)
)
}
parLapply loop
len <- nrow(url_sweep.df)
gc.vec <- unlist(lapply(0:35, function(x) x + seq(
from=100,to=len,by=1000)))
gc.vec <- gc.vec[order(gc.vec)]
start.time <- Sys.time()
ptm <- proc.time()
cl <- makeCluster(detectCores()-1,type="FORK")
invisible(
parLapply(cl,1:len, function(x){
invisible(
try(
getfile_sweep.fun(
url = url_sweep.df[x,"url"]
,filename = url_sweep.df[x,"filename"]
)
)
)
if(x %in% gc.vec){
gc()
}
}
)
)
stopCluster(cl)
Sweep.time <- proc.time() - ptm
Sample of data -
Sample of url_sweep.df:
https://www.dropbox.com/s/anldby6tcxjwazc/url_sweep_sample.rds?dl=0
Sample of existing.filenames:
https://www.dropbox.com/s/0n0phz4h5925qk6/existing_filenames_sample.rds?dl=0
Notes:
1- I do not have such powerful system available to me, so I cannot reproduce every issue mentioned.
2- All the comments are being summarized here
3- It was stated that machine received an upgrade: EBS to provisioned SSD w/ 6000 IOPs/sec, however the issue persists
Possible issues:
A- if memory swap starts to happen then you are nor purely working with RAM anymore and I think R would have harder and harder time to find available continues memory spaces.
B- work load and the time it takes to finish the workload, compared to the number of cores
c- parallel set up, and fork cluster
Possible solutions and troubleshooting:
B- Limiting memory usage
C- Limiting number of cores
D- If the code runs fine on a smaller machine like personal desktop than issue is with how the parallel usage is setup, or something with fork cluster.
Things to still try:
A- In general running jobs in parallel incurs overhead, now more cores you have, you will see the effects more. when you pass a lot of jobs that take very very little time (think smaller than second) this will results in increase of overhead related to constantly pushing jobs. try to limit the core to 8 just like your desktop and try your code? does the code run fine? if yes than try to increase the workload as you increase the cores available to the program.
Start with lower end of spectrum of number of cores and amount of ram an increase them as you increase the workload and see where the fall happens.
B- I will post a summery about Parallelism in R, this might help you catch something that we have missed
What worked:
Limiting the number of cores has fixed the issue. As mentioned by OP, he has also made other changes to the code, however i do not have access to them.
You can use the async interface instead. Short example below:
cb_done <- function(resp) {
filename <- basename(urltools::path(resp$url))
writeBin(resp$content, filename)
}
pool <- curl::new_pool()
for (u in urls) curl::curl_fetch_multi(u, pool = pool, done = cb_done)
curl::multi_run(pool = pool)

mclapply returns NULL randomly

When I am using mclapply, from time to time (really randomly) it gives incorrect results. The problem is quite thoroughly described in other posts across the Internet, e.g. (http://r.789695.n4.nabble.com/Bug-in-mclapply-td4652743.html). However, no solution is provided. Does anyone know how to fix this problem? Thank you!
The problem reported by Winston Chang that you cite appears to have been fixed in R 2.15.3. There was a bug in mccollect that occurred when assigning the worker results to the result list:
if (is.raw(r)) res[[which(pid == pids)]] <- unserialize(r)
This fails if unserialize(r) returns a NULL, since assigning a NULL to a list in this way deletes the corresponding element of the list. This was changed in R 2.15.3 to:
if (is.raw(r)) # unserialize(r) might be null
res[which(pid == pids)] <- list(unserialize(r))
which is a safe way to assign an unknown value to a list.
So if you're using R <= 2.15.2, the solution is to upgrade to R >= 2.15.3. If you have a problem using R >= 2.15.3, then presumably it's a different problem then the one reported by Winston Chang.
I also read over the issues discussed in the R-help thread started by Elizabeth Purdom. Without a specific test case, my guess is that the problem is not due to a bug in mclapply because I can reproduce the same symptoms with the following function:
work <- function(i, poison) {
if (i == poison) quit(save='no')
i
}
If a worker started by mclapply dies while executing a task for any reason (receiving a signal, seg faulting, exiting), mclapply will return a NULL for all of the tasks that were assigned to that worker:
> library(parallel)
> mclapply(1:4, work, 3, mc.cores=2)
[[1]]
NULL
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
In this case, NULL's were returned for tasks 1 and 3 due to prescheduling, even though only task 3 actually failed.
If a worker dies when using a function such as parLapply or clusterApply, an error is reported:
> cl <- makePSOCKcluster(3)
> parLapply(cl, 1:4, work, 3)
Error in unserialize(node$con) : error reading from connection
I've seen many such reports, and I think they tend to happen in large programs that use lots of packages that are hard to turn into reproducible test cases.
Of course, in this example, you'll also get an error when using lapply, although the error won't be hidden as it is with mclapply. If the problem doesn't seem to happen when using lapply, it may be because the problem rarely occurs, so it only happens in very large runs that are executed in parallel using mclapply. But it is also possible that the error occurs, not because the tasks are executed in parallel, but because they are executed by forked processes. For example, various graphics operations will fail when executed in a forked process.
I'm adding this answer so others hitting this question won't have to wade through the long thread of comments (I am the bounty granter but not the OP).
mclapply initially populates the list it creates with NULLS. As the worker processes return values, these values overwrite the NULLS. If a process dies without ever returning a value, mclapply will return a NULL.
When memory becomes low, the Linux out of memory killer (oom killer)
https://lwn.net/Articles/317814/
will start silently killing processes. It does not print anything to the console to let you know what it's doing, although the oom killer activities show up in the system log. In this situation the output of mclapply will appear to have been randomly contaminated with NULLS.

Resources