TCGABiolinks: GDCprepare never terminates and crashes - r

I recently started using TCGAbiolinks to process some gene expression from the TCGA database. All I need to do is download the data into an R file, and there are many examples online. However, every time I try the example codes, it crashes my R workspace and sometimes my PC entirely.
Here's the code I'm using:
library(TCGAbiolinks)
queryLUAD <- GDCquery(project = "TCGA-LUAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
sample.type = "Primary Tumor",
legacy = FALSE,
workflow.type = "HTSeq - FPKM-UQ"
)
GGDCdownload(queryLUAD)
LUADRNAseq <- GDCprepare(queryLUAD,
save = TRUE,
save.filename = "LUAD.R")
As you can see, it's very simple and (as far as I can tell, identical) to examples like this one.
When I run this code, it downloads fully (I've checked the folder with the files). Then, I run GDCprepare. The progress bar starts and goes to 100%. Then, the command never terminates eventually either RStudio or my machine crashes.
Here's the terminal output:
> GDCdownload(queryLUAD)
Downloading data for project TCGA-LUAD
Of the 533 files for download 533 already exist.
All samples have been already downloaded
> LUADRNAseq <- GDCprepare(queryLUAD,
+ save = TRUE,
+ save.filename = "LUAD.R")
|==============================================================================================|100% Completed after 13 s
Although it says completed, it never does. To solve this, I've tried reinstalling TCGAbiolinks, updating R to the latest version, and even running it on an entirely different machine (a Mac instead of Windows). I've tried other datasets ("LUSC") and got the exact same behavior. Nothing has solved the issue, and I haven't found this issue mentioned anywhere online.
I am sincerely grateful for any and all advice on why this is happening and how I can fix it.

Experienced exactly the same problem. Tried a variety of things, and noticed it doesn't crash when dataset has less than 100 samples or running with "summarizedExperiment = FALSE" for dataset less than 300 samples.

I am facing the same issue here. Looks like there is some kind of a memory leak happening because my RAM usage goes to 100%. I managed to "GDCprepare" 500 samples without crashing with ~64GB RAM but even after finishing, the memory is still occupied by the R session, even if I try garbage collection and removing everything in the environment.
I didn't have this issue with TCGAbiolinks around a year ago...

Related

Abnormal termination of R session on Rstudio server with DECIPHER alignment

I am running an alignment with the DECIPHER package in bioconductor using an Rstudio instance located on a server.
dna1 <- RemoveGaps(dnaSet, removeGaps = "all", processors = NULL)
alignmentO <- AlignSeqs(dna1, processors = NULL)
For some reason, every time the alignment reaches 99% the r session crashes with the message "The previous R session was abnormally terminated due to an unexpected crash."
Sometimes the program will work for a short time before crashing, but recently it crashes on the first alignment. I have run the code repeatedly using varying input sizes and it always crashes in the exact same place:
Generally in the past when I've had session crashes, the issue has been memory, but these are small viral genomes, which shouldn't be an issue. I also pulled all the code off the server to run in Rstudio on my personal computer, which has less RAM and CPUs, and the code ran no problem on the exact same inputs. Any ideas as to what the issue could be?
I have tried running it on two separate servers with different R versions (seen below), but I have the same issue on both servers.
The session info is as follows:
Server 1:
Server 2:
So eventually I reencountered what I believe is the same issue, but I no longer have access to the original data to test. However, I was experiencing R session crashing crashing before the alignment could complete. The second time it was a data issue. The sequence that crashed the system was oriented from 3 to 5 instead of 5 to 3, so the sequences were too dissimilar to align. Adding in an orientation function resolved the issue.
dna1 <- RemoveGaps(dnaSet, removeGaps = "all", processors = NULL)
dna1 <- OrientNucleotides(dna1)
alignmentO <- AlignSeqs(dna1, processors = NULL)

R curl::has_internet() FALSE even though there are internet connection

My problem arose when downloading data from EuroSTAT using the R package eurostat:
# Population data by NUTS3
pop_data <- subset(eurostat::get_eurostat("demo_r_pjangrp3", time_format = "num"),
(age == "TOTAL") & (sex == "T") &
(nchar(trimws(geo)) == 5))[, c("time","geo","values")]
#Fejl i eurostat::get_eurostat("demo_r_pjangrp3", time_format = "num") :
# You have no internet connection, please reconnect!
Seaching, I have found out that it is the statement (in the eurostat-package code):
if (curl::has_internet() {stop("You have no inernet connection, please connnect") that cause the problem.
However, I have interconnection and can e.g. ping www.eurostat.eu
I have tried curl::has_internet() on different computers, all with internet connection. On some it work (respond TRUE) on others it don't.
I have talked with our IT department, and we tried if it could be a firewall problem. Removing the firewall, did not solve the problem.
Unfortunately, I am ignorant on network-settings. Hence, when trying to read the documentation for the curl-package I am lost.
Downloading data from EuroSTAT using the command above have worked for the last at least 2 years, and for me the problem arose at the start of 2020 (January 7).
Hope someone can help with this, as downloading population data from EuroSTAT is a mandatory part in more of my/our regular work.
In the special case of curl::has_internet, you don't need to modify the function to return a specific value. It has its own enclosing environment, from which it reads a state variable indicating whether a proxy connection exists. You can modify that state variable instead.
assign("has_internet_via_proxy", TRUE, environment(curl::has_internet))
curl::has_internet() # will always be TRUE
# [1] TRUE
It's difficult to tell without knowing your settings but there are a couple of things to try. This issue has been noted and possibly addressed in a development version which you can install with
install.packages("https://github.com/jeroen/curl/archive/master.tar.gz", repos = NULL)
You could also try updating libcurl, which is the C library for which the R package acts as an R interface. The problem you describe seems to be more common with older versions of libcurl.
If all else fails, you could overwrite the curl::has_internet function like this:
remove_has_internet <- function()
{
unlockBinding(sym = "has_internet", asNamespace("curl"))
assign("has_internet", function() return(TRUE), envir = asNamespace("curl"))
lockBinding(sym = "has_internet", asNamespace("curl"))
}
Now if you run remove_has_internet(), any call to curl::has_internet() will return TRUE for the remainder of your R session. However, this will only work if other curl functionality is working properly with your network settings. If it isn't then you will get other strange errors and should abandon this approach.
If, for any reason, you want to restore the functionality of the original curl::has_internet without restarting an R session, you can do this:
restore_has_internet <- function()
{
unlockBinding(sym = "has_internet", asNamespace("curl"))
assign("has_internet",
function() {!is.null(nslookup("r-project.org", error = FALSE))},
envir = asNamespace("curl"))
lockBinding(sym = "has_internet", asNamespace("curl"))
}
I just got into this problem, so here's an additional solution, blending both previous answers. It's reversible and checks if we actually have internet to avoid bigger problems later.
# old value
op = get("has_internet_via_proxy", environment(curl::has_internet))
# check for internet
np = !is.null(curl::nslookup("r-project.org", error = FALSE))
assign("has_internet_via_proxy", np, environment(curl::has_internet))
Within a function, this line can be added to automatically revert the process:
on.exit(assign("has_internet_via_proxy", op, environment(curl::has_internet)))

Error in file(con, "w") : cannot open the connection [Using R-Studio to plot interactive bar graphs using rCharts, knitr]

I am getting an error when I am trying to run the code below in R-Studio 3.3.2 on a Mac (OS Sierra)
devtools::install_github('ramnathv/rCharts')
install.packages("knitr")
require(rCharts)
require(knitr)
haireye <- as.data.frame(HairEyeColor)
n1 <- nPlot(Freq ~ Hair, group = 'Eye', type = 'multiBarChart',
data = subset(haireye, Sex == 'Male')
)
n1$save('fig/n1.html', cdn = TRUE)
cat('<iframe src="fig/n1.html" width= 100%, height=600</iframe>')
Pls see output below:
Error in file(con, "w") : cannot open the connection
In addition: Warning message: In file(con, "w") : cannot open file 'fig/n1.html': No such file or directory
But I am able to generate the reqd bar graph in the viewer when I use:
n1$show(cdn = TRUE)
in lieu of n1$save('fig/n1.html', cdn = TRUE)
To take care of write permission issues (if any), I also tried including the line below, altering the WD path wherever necessary.
knitr::knit2html('Users/documents/n1.html')
But it did not help. I see the n1.html file created but it only opens an empty browser.
Any help to resolve this is appreciated.
Best,
S
A lot of times we face this error due to caching in RStudio and in that case, actual code errors don't show up. Restart RStudio and this error will be gone and actual code errors would show.
You have two separate problems.
The connection error appears because the fig/ folder does not exist. Create the folder and the save command will work. R has functions to check the existance of directories and create new ones if you would like to do it in your code.
The second problem comes from the way you save, you should use n1$save('fig/n1.html', standalone = TRUE). Here you have a similar situation.
As a side-note, I would say rCharts is not currently developed or mantained at all, so I would recommend you to use another library for your charts. In my opinion Plotly is quite nice. rCharts brought the NVD3 project to R and the chart style is in my opinion really nice. However, as far as I know both projects are stopped so I would look for a library that is still alive.
I have fixed this problem with good old rm(list=ls()) . I know I have
fallen into sequences where the error stops execution of my script. I fix the error, and then it won't run. This is likely due to lazy evaluation but it is a near impossible problem to diagnose, so the solution at the top works almost all the time.

R Parallelisation Error unserialize(socklisk[[n]])

In a nutshell I am trying to parallelise my whole script over dates using Snow and adply but continually get the below error.
Error in unserialize(socklist[[n]]) : error reading from connection
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
I have set up the parallelisation process in the following way:
Cores = detectCores(all.tests = FALSE, logical = TRUE)
cl = makeCluster(Cores, type="SOCK")
registerDoSNOW(cl)
clusterExport(cl, c("Var1","Var2","Var3","Var4"), envir = environment())
exposureDaily <- adply(.data = dateSeries,.margins = 1,.fun = MainCalcFunction,
.expand = TRUE, Var1, Var2, Var3,
Var4,.parallel = TRUE)
stopCluster(cl)
Where dateSeries might look something like
> dateSeries
marketDate
1 2016-04-22
2 2016-04-26
MainCalcFunction is a very long script with multiple of my own functions contained within it. As the script is so long reproducing it wouldn't be practical, and a hypothetical small function would defeat the purpose as I have already got this methodology to work with other smaller functions. I can say that within MainCalcFunction I call all my libraries, necessary functions, and a file containing all other variables aside from those exported above so that I don't have to export a long list libraries and other objects.
MainCalcFunction can run successfully in its entirety over 2 dates using adply but not parallelisation, which tells me that it is not a bug in the code that is causing the parallelisation to fail.
Initially I thought (from experience) that the parallelisation over dates was failing because there was another function within the code that utilised parallelisation, however I have subsequently rebuilt the whole code to make sure that there was no such function.
I have poured over the script with a fine tooth comb to see if there was any place where I accidently didn't export something that I needed and I can't find anything.
Some ideas as to what could be causing the code to fail are:
The use of various option valuation functions in fOptions and rquantlib
The use of type sock
I am aware of this question already asked and also this question, and while the first question has helped me, it hasn't yet help solve the problem. (Note: that may be because I haven't used it correctly, having mainly used loginfo("text") to track where the code is. Potentially, there is a way to change that such that I log warning and/or error messages instead?)
Please let me know if there is any other information I can provide to help in solving this. I would be so appreciative if someone could provide some guidance, as the code takes close to 40 minutes to run for a day and I need to run it for close to a year, therefore parallelisation is essential!
EDIT
I have tried to implement the suggestion in the first question included above by utilising the outfile option. Given I am using Windows, I have done this by including the following lines before the exporting of the key objects and running MainCalcFunction :
reportLogName <- paste("logout_parallel.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting log file", getwd()))
mc<-detectCores()
cl<-makeCluster(mc, outfile="")
registerDoParallel(cl)
Similarly, at the beginning of MainCalcFunction, after having sourced my libraries and functions I have included the following to print to file:
reportLogName <- paste(testDate,"_logout.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting test function ",getwd(), sep = ""))
In the MainCalcFunction function I have then put loginfo("text") statements at key junctures to inform me of where the code is at.
This has resulted in some text files being available after the code fails due to the aforementioned error. However, these text files provide no more information on the cause of the error aside from at what point. This is despite having a tryCatch statement embedded in MainCalcFunction where at the end, on any instance of error I have added the line logerror(e)
I am posting this answer in case it helps anyone else with a similar problem in the future.
Essentially, the error unserialize(socklist[[n]]) doesn't tell you a lot, so to solve it it's a matter of narrowing down the issue.
Firstly, be absolutely sure the code runs over several dates in non-parallel with no errors
Ensure the parallelisation is set up correctly. There are some obvious initial errors that many other questions respond to, e.g., hidden parallelisation inside the code which means parallelisation is occurring twice.
Once you are sure that there is no problem with the code and the parallelisation is set up correctly start narrowing down. The issue is likely (unless something has been missed above) something in the code which isn't a problem when it is run in serial, but becomes a problem when run in parallel. The easiest way to narrow down is by setting outfile = "Log.txt" in which make cluster function you use, e.g., cl<-makeCluster(cores-1, outfile="Log.txt"). Then add as many print("Point in code") comments in your function to narrow down on where the issue is occurring.
In my case, the problem was the line jj = closeAllConnections(). This line works fine in non-parallel but breaks the code when in parallel. I suspect it has something to do with the function closing all connections including socket connections that are required for the parallelisation.
Try running using plain R instead of running in RStudio.

What to do when a NOAA ERDDAP dataset is not found?

I'm trying to download some gridded ERDDAP data using the rnoaa package in R. While the data retrieval works perfectly for some datasets, I'm having some problems getting the data for some datasets in particular. For example when I run:
library (rnoaa)
ds.info <- erddap_info ("noaa_pfeg_95de_54ab_a60a")
erddap_grid (ds.info,
time = c("2005-01-01", "2015-01-01"),
altitude = c (0,0),
latitude = c (3.25, 3.75),
longitude = c (72.5, 73.25),
fields = "all")
I get the following error:
`Error: (404) - Resource not found: /erddap/griddap/ncdcOwDly.csv (Currently unknown datasetID=ncdcOwDly)`.
The error is not really consistent because it works sometimes when I try different time-spans. But I get it pretty much every single time I try to download data from the datasets noaa_pfeg_95de_54ab_a60a, noaa_pfeg_1a4b_0c2a_2365 and some others by NOAA-NCDC.
Because erddap_grid works for some datasets but not for others, I'm inclined to think it's not a bug. Maybe it is a problem of the ERDDAP server?, or maybe something to do with my API key? Is there a way around it?
Update - 2015-01-10: It seems it is a server's problem. When trying to download the data using the address generated by the web interface (see below) I get the same error. I guess I'll just have to wait until "they" fix the problem with the database.
http://coastwatch.pfeg.noaa.gov/erddap/griddap/ncdcOw6hr.csv?u[(2006-01-01):1:(2015-01-09T18:00:00Z)][(10.0):1:(10.0)][(3.25):1:(3.75)][(72.5):1:(73.25)],v[(2006-01-01):1:(2015-01-09T18:00:00Z)][(10.0):1:(10.0)][(3.25):1:(3.75)][(72.5):1:(73.25)]
ERDDAP servers often become overloaded and 404 on some requests. These are public-facing servers that do heavy data lifting, after all.
So the answer here is to try again after waiting some time (please wait a while to be nice to the ERDDAP administrators), and contact the server administrator to be sure that your IP address has not been blacklisted for performing too many requests.

Resources