Use save() within sfClusterCall() - r

I want to run a parallelized simulation using snowfall. Trying to aquire data using return() causes the cluster to exceed memory limitations. Data is at some point not recorded anymore.
So I want to use save() to write the data to a file after each replicate.
sfInit()
try<-function(x){
save(list=ls(),file=paste("myfilename_",x,".RData",sep=""))
}
sfClusterSetupRNG()
sfClusterCall(try,1:100)
sfStop()
The error I get is
Error in gzfile(file, "wb") : invalid 'description' argument
Calls: sfClusterCall -> do.call -> -> save -> gzfile
In addition: Warning message:
In if (!nzchar(file)) stop("'file' must be non-empty string") :
the condition has length > 1 and only the first element will be used

Related

Saving text file in the data folder of an R package

I am trying to save a text file in the data folder of a private package I am developing.
I tried the following:
my_text <- “Some text string”
save.RDS(my_text, file = “C/…./package_name/data/mytext.rda”)
When I try to build the document, I get the error:
Error in FUN(X[[i]], ...) :
bad restore file magic number (file may be corrupted) -- no data loaded
Calls: suppressPackageStartupMessages ... <Anonymous> -> load_all -> load_data -> unlist -> lapply -> FUN
In addition: Warning message:
file mytext.rda' has magic number 'X'
Use of save versions prior to 2 is deprecated
Execution halted
Exited with status 1.
What could I do to save the text?
Library(readr)
write_rds(x = my_text, path = "C/…./package_name/data/mytext.rda")
try this.
Best way is using devtools::use_data(my_text, internal = TRUE) as mentioned by #PoGibas.

Error using DEXSeqDataSetFromHTSeq

Currently I am trying to understand DEXSeq package. I have a design tsv file and 7 files which contains Counts. Now would like to run the following command
library("DEXSeq");
design=read.table("dexseq_design.tsv", header=TRUE, row.names=1);
ecs = DEXSeqDataSetFromHTSeq(countfiles=c("M0.txt", "M1.txt", "M2.txt", "M3.txt", "M4.txt", "M5.txt", "M6.txt", "M7.txt"), design=design, flattenedfile="genome.chr.gff");
The last command gives and error
Error in class(sampleData) %in% c("data.frame") :
error in evaluating the argument 'x' in selecting a method for function '%in%':
Error: argument "sampleData" is missing, with no default
What does this error means and how to fix it? While loading the package DEXSeq there was a warning
Warning message:
replacing previous import by ‘ggplot2::Position’ when loading ‘DESeq2’

Tracing back rare error

I'm running a code which takes very long to compute. I made my code parallel using foreach()%dopar% and run in on the cluster.
It runs generally fine but sometimes crashes and I get the following error :
Error in { : task 4 failed - "missing value where TRUE/FALSE needed"
Calls: %dopar% -> <Anonymous>
Execution halted
Now it says Execution halted but only for this particular core so the others keep running and at the end it fails to output but doesn't tell me before hand.
I guess it's a problem with an if statement. I tried simulating the code on my computer but it is so rare that I can't simulate it.
The code runs easily 100 hours doing as many as 100 000 loops and only one of them will fail.
My questions are : Can I traceback where the error was? (I run the code on a cluster so I don't have all the nice Rstudio stuff)
Also, is it possible to still output from a foreach() loop even if one of the tasks crashed?
Or perhaps any method people use so that I can make the crash happen on my computer?
I can write the code if needed, please ask if it helps.
The foreach ".errorhandling" argument is intended to help in this situation. If you want foreach to pass errors through, then use .errorhandling="pass". If you want it to filter out errors (which reduces the length of the result), then use .errorhandling="remove". The default value is "stop" which throws an error indicating which task failed.
Unfortunately, most parallel backends don't support tracebacks, but doMPI does. You simply call "startMPIcluster" with verbose=TRUE, and the traceback will be written to the log file of the worker that had the error. Here's an example that generates an error on task 42:
suppressMessages(library(doMPI))
cl <- startMPIcluster(4, verbose=TRUE)
registerDoMPI(cl)
g <- function(i) {
if (i == 42) {
if (NULL) cat('hello, world\n')
}
7
}
f <- function(i) g(i)
r <- foreach(i=1:50, .errorhandling='pass') %dopar% f(i)
print(r)
closeCluster(cl)
mpi.quit()
Since it uses .errorhandling="pass", the script runs to completion, with an error object returned in element 42 of the result list. In addition, one of the log files contains a traceback of the error (along with many other messages):
waiting for a taskchunk...
executing taskchunk 42 containing 1 tasks
error executing task: argument is of length zero
traceback (most recent call first):
> g(i)
> f(i)
> eval(expr, envir, enclos)
> eval(expr, envir)
> executeTask(taskchunk$argslist[[1]])
> executeTaskChunk(cl$workerid, taskchunk, envir, err, cores)
returning error results for taskchunk 42
Unfortunately, doMPI is mostly used on Linux systems, so this isn't helpful to most Mac and Windows users.

Saving a lattice plot

I am trying to open a device, but getting the following error:
> trellis.device(device="pdf", filename="runtime.pdf")
Error in device.call(...) : unused argument (filename = "runtime.pdf")
The same error does occur when I try to open a device with
pdf(filename="c:/R/FSM/runtime.pdf")
Is there a package that I need to load into the library?
The correct argument is file rather than filename, as in pdf(file = "myfile.pdf"). Other functions that open new devices do use the filename argument, such as jpeg(), so you need to check the help file. In general, the error that was returned Error in ...: unused argument indicates that an argument supplied in the function call is not part of the function.

Store large dataframes in redis through R

I have a number of large dataframes in R which I was planning to store using redis. I am totally new to redis but have been reading about it today and have been using the R package rredis.
I have been playing around with small data and saved and retrieved small dataframes using the redisSet() and redisGet() functions. However when it came to saving my larger dataframes (the largest of which is 4.3 million rows and 365MB when saved as .RData file)
using the code redisSet('bigDF', bigDF) I get the following error message:
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERR Protocol error: invalid bulk length
In addition: Warning messages:
1: In writeBin(v, con) : problem writing to connection
2: In writeBin(.raw("\r\n"), con) : problem writing to connection
Presumably because the dataframe is too large to save. I know that redisSet writes the dataframe as a string, which is perhaps not the best way to do it with large dataframes. Does anyone know of the best way to do this?
EDIT: I have recreated the error my creating a very large dummy dataframe:
bigDF <- data.frame(
'lots' = rep('lots',40000000),
'of' = rep('of',40000000),
'data' = rep('data',40000000),
'here'=rep('here',40000000)
)
Running redisSet('bigDF',bigDF) gives me the error:
Error in .redisError("Invalid agrument") : Invalid agrument
the first time, then running it again immediately afterwards I get the error
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERR Protocol error: invalid bulk length
In addition: Warning messages:
1: In writeBin(v, con) : problem writing to connection
2: In writeBin(.raw("\r\n"), con) : problem writing to connection
Thanks
In short: you cannot. Redis can store a maximum of 512 Mb of data in a String value and your serialized demo data frame is bigger than that:
> length(serialize(bigDF, connection = NULL)) / 1024 / 1024
[1] 610.352
Technical background:
serialize is called in the .cerealize function of the package via redisSet and rredis:::.redisCmd:
> rredis:::.cerealize
function (value)
{
if (!is.raw(value))
serialize(value, ascii = FALSE, connection = NULL)
else value
}
<environment: namespace:rredis>
Offtopic: why would you store such a big dataset in redis anyway? Redis is for small key-value pairs. On the other hand I had some success storing big R datasets in CouchDB and MongoDB (with GridFS) by adding the compressed RData there as an attachement.

Resources