R rowsum in a data.table crashes R - r

The code below works fine
library(data.table)
dt <- data.table(mtcars)[,.(cyl, gear, mpg)]
colsToSum <- c("cyl", "gear", "mpg")
dt[, F15_49 := rowSums(.SD), .SDcols = colsToSum]
but a version of this crashes R in RStudio with the message "R Session Aborted. R encountered a fatal error. The session was terminated." Followed by a Start New Session button. The code snippet that crashes is
ageColsToSum <- c("F15_19", "F20_24", "F25_29", "F30_34", "F35_39", "F40_44", "F45_49")
dt.SSP.scen.wide[, F15_49 := rowSums(.SD), .SDcols = ageColsToSum]
When I run the code in R in a shell I get the following message.
OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Abort trap: 6
I don't know anything about openMP so I don't know what could be initializing libomp.dylib.

I followed the directions at https://github.com/Rdatatable/data.table/wiki/Installation and installed the development version of data.table 1.10.5. My code now works.

Related

'mc.cores' > 1 is not supported on Windows

I am new to programming and I am trying to use parallel processing for R in windows, using an existing code.
Following is the snippet of my code:
if (length(grep("linux", R.version$os)) == 1){
num_cores = detectCores()
impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
}
# else if(length(grep("mingw32", R.version$os)) == 1){
# num_cores = detectCores()
# impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
#
# }
else{
impact_list <- lapply(len_a, impact_func)
}
return(sum(unlist(impact_list, use.names = F)))
This works fine, I am using R on windows so the code enters in 'else' statement and it runs the code using lapply() and not by parallel processing.
I have added the 'else if' statement to make it work for windows. So when I un-comment 'else if' block of code and run it, I am getting an error "'mc.cores' > 1 is not supported on Windows".
Please suggest how can I use parallel processing in windows, so that less time is taken to run the code.
Any help will be appreciated.
(disclaimer: I'm author of the future framework here)
The future.apply package provides parallel versions of R's built-in "apply" functions. It's cross platform, i.e. it works on Linux, macOS, and Windows. The package allows you to often just replace an existing lapply() with a future_lapply() call, e.g.
library(future.apply)
plan(multisession)
your_fcn <- function(len_a) {
impact_list <- future_lapply(len_a, impact_func)
sum(unlist(impact_list, use.names = FALSE))
}
Regarding mclapply() per se: If you use parallel::mclapply() in your code, make sure that there is always an option not to use it. The reason is that it is not guaranteed to work in all environment, that is, it might be unstable and crash R. In R-devel thread 'mclapply returns NULLs on MacOS when running GAM' (https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html), the author of mclapply() wrote on 2020-04-28:
Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons Henrik explained. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don't know the resource available so only the user can tell you when it's safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.

R: How make dump.frames() include all variables for later post-mortem debugging with debugger()

I have the following code which provokes an error and writes a dump of all frames using dump.frames() as proposed e. g. by Hadley Wickham:
a <- -1
b <- "Hello world!"
bad.function <- function(value)
{
log(value) # the log function may cause an error or warning depending on the value
}
tryCatch( {
a.local.value <- 42
bad.function(a)
bad.function(b)
},
error = function(e)
{
dump.frames(to.file = TRUE)
})
When I restart the R session and load the dump to debug the problem via
load(file = "last.dump.rda")
debugger(last.dump)
I cannot find my variables (a, b, a.local.value) nor my function "bad.function" anywhere in the frames.
This makes the dump nearly worthless to me.
What do I have to do to see all my variables and functions for a decent post-mortem analysis?
The output of debugger is:
> load(file = "last.dump.rda")
> debugger(last.dump)
Message: non-numeric argument to mathematical functionAvailable environments had calls:
1: tryCatch({
a.local.value <- 42
bad.function(a)
bad.function(b)
2: tryCatchList(expr, classes, parentenv, handlers)
3: tryCatchOne(expr, names, parentenv, handlers[[1]])
4: value[[3]](cond)
Enter an environment number, or 0 to exit
Selection:
PS: I am using R3.3.2 with RStudio for debugging.
Update Nov. 20, 2016: Note that it is not an R bug (see answer of Martin Maechler). I did not change my answer for reproducibility. The described work around still applies.
Summary
I think dump.frames(to.file = TRUE) is currently an anti pattern (or probably a bug) in R if you want to debug errors of batch jobs in a new R session.
You should better replace it with
dump.frames()
save.image(file = "last.dump.rda")
or
options(error = quote({dump.frames(); save.image(file = "last.dump.rda")}))
instead of
options(error = dump.frames)
because the global environment (.GlobalEnv = the user workspace you normally create your objects) is included then in the dump while it is missing when you save the dump directly via dump.frames(to.file = TRUE).
Impact analysis
Without the .GlobalEnv you loose important top level objects (and their current values ;-) to understand the behaviour of your code that led to an error!
Especially in case of errors in "non-interactive" R batch jobs you are lost without .GlobalEnv since you can debug only in a newly started (empty) interactive workspace where you then can only access the objects in the call stack frames.
Using the code snippet above you can examine the object values that led to the error in a new R workspace as usual via:
load(file = "last.dump.rda")
debugger(last.dump)
Background
The implementation of dump.frames creates a variable last.dump in the workspace and fills it with the environments of the call stack (sys.frames(). Each environment contains the "local variables" of the called function). Then it saves this variable into a file using save().
The frame stack (call stack) grows with each call of a function, see ?sys.frames:
.GlobalEnv is given number 0 in the list of frames. Each subsequent
function evaluation increases the frame stack by 1 and the [...] environment for evaluation of that function are returned by [...] sys.frame with the appropriate index.
Observe that the .GlobalEnv has the index number 0.
If I now start debugging the dump produced by the code in the question and select the frame 1 (not 0!) I can see a variable parentenv which points (references) the .GlobalEnv:
Browse[1]> environmentName(parentenv)
[1] "R_GlobalEnv"
Hence I believe that sys.frames does not contain the .GlobalEnv and therefore dump.frames(to.file = TRUE) neither since it only stores the sys.frames without all other objects of the .GlobalEnv.
Maybe I am wrong, but this looks like an unwanted effect or even a bug.
Discussions welcome!
References
https://cran.r-project.org/doc/manuals/R-exts.pdf
Excerpt from section 4.2 Debugging R code (page 96):
Because last.dump can be looked at later or even in another R session,
post-mortem debug- ging is possible even for batch usage of R. We do
need to arrange for the dump to be saved: this can be done either
using the command-line flag
--save to save the workspace at the end of the run, or via a setting such as
options(error = quote({dump.frames(to.file=TRUE); q()}))
Note that it is often more productive to work with the R Core team rather than just telling that R has a bug. It clearly has no bug, here, as it behaves exactly as documented.
Also there is no problem if you work interactively, as you have full access to your workspace (which may be LARGE) there, so the problem applies only to batch jobs (as you've mentioned).
What we rather have here is a missing feature and feature requests (and bug reports!) should happen on the R bug site (aka _'R bugzilla'), https://bugs.r-project.org/ ... typically however after having read the corresponding page on the R website: https://www.r-project.org/bugs.html.
Note that R bugzilla is searchable, and in the present case, you'd pretty quickly find that Andreas Kersting made a nice proposal (namely as a wish, rather than claiming a bug),
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17116
and consequently I had added the missing feature to R, on Aug.16, already.
Yes, of course, the development version of R, aka R-devel.
See also today's thread on the R-devel mailing list,
https://stat.ethz.ch/pipermail/r-devel/2016-November/073378.html

R Parallelisation Error unserialize(socklisk[[n]])

In a nutshell I am trying to parallelise my whole script over dates using Snow and adply but continually get the below error.
Error in unserialize(socklist[[n]]) : error reading from connection
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
I have set up the parallelisation process in the following way:
Cores = detectCores(all.tests = FALSE, logical = TRUE)
cl = makeCluster(Cores, type="SOCK")
registerDoSNOW(cl)
clusterExport(cl, c("Var1","Var2","Var3","Var4"), envir = environment())
exposureDaily <- adply(.data = dateSeries,.margins = 1,.fun = MainCalcFunction,
.expand = TRUE, Var1, Var2, Var3,
Var4,.parallel = TRUE)
stopCluster(cl)
Where dateSeries might look something like
> dateSeries
marketDate
1 2016-04-22
2 2016-04-26
MainCalcFunction is a very long script with multiple of my own functions contained within it. As the script is so long reproducing it wouldn't be practical, and a hypothetical small function would defeat the purpose as I have already got this methodology to work with other smaller functions. I can say that within MainCalcFunction I call all my libraries, necessary functions, and a file containing all other variables aside from those exported above so that I don't have to export a long list libraries and other objects.
MainCalcFunction can run successfully in its entirety over 2 dates using adply but not parallelisation, which tells me that it is not a bug in the code that is causing the parallelisation to fail.
Initially I thought (from experience) that the parallelisation over dates was failing because there was another function within the code that utilised parallelisation, however I have subsequently rebuilt the whole code to make sure that there was no such function.
I have poured over the script with a fine tooth comb to see if there was any place where I accidently didn't export something that I needed and I can't find anything.
Some ideas as to what could be causing the code to fail are:
The use of various option valuation functions in fOptions and rquantlib
The use of type sock
I am aware of this question already asked and also this question, and while the first question has helped me, it hasn't yet help solve the problem. (Note: that may be because I haven't used it correctly, having mainly used loginfo("text") to track where the code is. Potentially, there is a way to change that such that I log warning and/or error messages instead?)
Please let me know if there is any other information I can provide to help in solving this. I would be so appreciative if someone could provide some guidance, as the code takes close to 40 minutes to run for a day and I need to run it for close to a year, therefore parallelisation is essential!
EDIT
I have tried to implement the suggestion in the first question included above by utilising the outfile option. Given I am using Windows, I have done this by including the following lines before the exporting of the key objects and running MainCalcFunction :
reportLogName <- paste("logout_parallel.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting log file", getwd()))
mc<-detectCores()
cl<-makeCluster(mc, outfile="")
registerDoParallel(cl)
Similarly, at the beginning of MainCalcFunction, after having sourced my libraries and functions I have included the following to print to file:
reportLogName <- paste(testDate,"_logout.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting test function ",getwd(), sep = ""))
In the MainCalcFunction function I have then put loginfo("text") statements at key junctures to inform me of where the code is at.
This has resulted in some text files being available after the code fails due to the aforementioned error. However, these text files provide no more information on the cause of the error aside from at what point. This is despite having a tryCatch statement embedded in MainCalcFunction where at the end, on any instance of error I have added the line logerror(e)
I am posting this answer in case it helps anyone else with a similar problem in the future.
Essentially, the error unserialize(socklist[[n]]) doesn't tell you a lot, so to solve it it's a matter of narrowing down the issue.
Firstly, be absolutely sure the code runs over several dates in non-parallel with no errors
Ensure the parallelisation is set up correctly. There are some obvious initial errors that many other questions respond to, e.g., hidden parallelisation inside the code which means parallelisation is occurring twice.
Once you are sure that there is no problem with the code and the parallelisation is set up correctly start narrowing down. The issue is likely (unless something has been missed above) something in the code which isn't a problem when it is run in serial, but becomes a problem when run in parallel. The easiest way to narrow down is by setting outfile = "Log.txt" in which make cluster function you use, e.g., cl<-makeCluster(cores-1, outfile="Log.txt"). Then add as many print("Point in code") comments in your function to narrow down on where the issue is occurring.
In my case, the problem was the line jj = closeAllConnections(). This line works fine in non-parallel but breaks the code when in parallel. I suspect it has something to do with the function closing all connections including socket connections that are required for the parallelisation.
Try running using plain R instead of running in RStudio.

Fatal Error while using Rcpp in RStudio on Windows

I'm trying to use Rcpp on Windows in RStudio. I have R version 3.2.3 and I have installed the Rcpp package. The problem is that I am unable to call any functions defined through the CPP code. I tried the following (picked up from an example online).
body <- '
NumericVector xx(x);
return wrap( std::accumulate( xx.begin(), xx.end(), 0.0));'
add <- cxxfunction(signature(x = "numeric"), body, plugin = "Rcpp")
This gives the following warning, but completes execution successfully.
cygwin warning:
MS-DOS style path detected: C:/R/R-32~1.3/etc/x64/Makeconf
Preferred POSIX equivalent is: /cygdrive/c/R/R-32~1.3/etc/x64/Makeconf
CYGWIN environment variable option "nodosfilewarning" turns off this warning.
Consult the user's guide for more details about POSIX paths:
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
When I try to use the above function,
x <- 1
y <- 2
res <- add(c(x, y))
I get the following error :
R Session Aborted
R encountered a fatal error.
The session was terminated.
Any suggestions? This same 'Fatal Error' happens for any code that I run with Rcpp.
Try rebuilding locally, starting with Rcpp. This is valid code and will work (and the hundreds of unit tests stress may more than this). Sometimes the compiler or something else changes under you and this sort of thing happens. It is then useful to have an alternative build system -- eg via Travis at GitHub you get Linux for free.
Also, learning about Rcpp Attributes. Your example can be written as
R> library(Rcpp)
R> cppFunction("double adder(std::vector<double> x) { return std::accumulate(x.begin(), x.end(), 0.0); }")
R> adder(c(1,2))
[1] 3
R>
which is simpler. Works of course the same way with Rcpp::NumericVector.

mclapply returns NULL randomly

When I am using mclapply, from time to time (really randomly) it gives incorrect results. The problem is quite thoroughly described in other posts across the Internet, e.g. (http://r.789695.n4.nabble.com/Bug-in-mclapply-td4652743.html). However, no solution is provided. Does anyone know how to fix this problem? Thank you!
The problem reported by Winston Chang that you cite appears to have been fixed in R 2.15.3. There was a bug in mccollect that occurred when assigning the worker results to the result list:
if (is.raw(r)) res[[which(pid == pids)]] <- unserialize(r)
This fails if unserialize(r) returns a NULL, since assigning a NULL to a list in this way deletes the corresponding element of the list. This was changed in R 2.15.3 to:
if (is.raw(r)) # unserialize(r) might be null
res[which(pid == pids)] <- list(unserialize(r))
which is a safe way to assign an unknown value to a list.
So if you're using R <= 2.15.2, the solution is to upgrade to R >= 2.15.3. If you have a problem using R >= 2.15.3, then presumably it's a different problem then the one reported by Winston Chang.
I also read over the issues discussed in the R-help thread started by Elizabeth Purdom. Without a specific test case, my guess is that the problem is not due to a bug in mclapply because I can reproduce the same symptoms with the following function:
work <- function(i, poison) {
if (i == poison) quit(save='no')
i
}
If a worker started by mclapply dies while executing a task for any reason (receiving a signal, seg faulting, exiting), mclapply will return a NULL for all of the tasks that were assigned to that worker:
> library(parallel)
> mclapply(1:4, work, 3, mc.cores=2)
[[1]]
NULL
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
In this case, NULL's were returned for tasks 1 and 3 due to prescheduling, even though only task 3 actually failed.
If a worker dies when using a function such as parLapply or clusterApply, an error is reported:
> cl <- makePSOCKcluster(3)
> parLapply(cl, 1:4, work, 3)
Error in unserialize(node$con) : error reading from connection
I've seen many such reports, and I think they tend to happen in large programs that use lots of packages that are hard to turn into reproducible test cases.
Of course, in this example, you'll also get an error when using lapply, although the error won't be hidden as it is with mclapply. If the problem doesn't seem to happen when using lapply, it may be because the problem rarely occurs, so it only happens in very large runs that are executed in parallel using mclapply. But it is also possible that the error occurs, not because the tasks are executed in parallel, but because they are executed by forked processes. For example, various graphics operations will fail when executed in a forked process.
I'm adding this answer so others hitting this question won't have to wade through the long thread of comments (I am the bounty granter but not the OP).
mclapply initially populates the list it creates with NULLS. As the worker processes return values, these values overwrite the NULLS. If a process dies without ever returning a value, mclapply will return a NULL.
When memory becomes low, the Linux out of memory killer (oom killer)
https://lwn.net/Articles/317814/
will start silently killing processes. It does not print anything to the console to let you know what it's doing, although the oom killer activities show up in the system log. In this situation the output of mclapply will appear to have been randomly contaminated with NULLS.

Resources