R file connection when using parallel - r

I am trying to make a file connection within a cluster (using parallel).
While it works correctly in the global environment, it gives me an error message when used within the members of the cluster (See the script below).
Do I missed something?
Any suggestion?
Thanks,
# This part works
#----------------
cat("This is a test file" , file={f <- tempfile()})
con <- file(f, "rt")
# Doing what I think is the same thing gives an error message when executed in parallel
#--------------------------------------------------------------------------------------
library(parallel)
cl <- makeCluster(2)
## Exporting the object f into the cluster
clusterExport(cl, "f")
clusterEvalQ(cl[1], con <- file(f[[1]], "rt"))
#Error in checkForRemoteErrors(lapply(cl, recvResult)) :
# one node produced an error: cannot open the connection
## Creating the object f into the cluster
clusterEvalQ(cl[1],cat("This is a test file" , file={f <- tempfile()}))
clusterEvalQ(cl[1],con <- file(f, "rt"))
#Error in checkForRemoteErrors(lapply(cl, recvResult)) :
# one node produced an error: cannot open the connection
############ Here is my sessionInfo() ###################
# R version 3.3.0 (2016-05-03)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
#
# locale:
# [1] LC_COLLATE=French_Canada.1252 LC_CTYPE=French_Canada.1252
# [3] LC_MONETARY=French_Canada.1252 LC_NUMERIC=C
# [5] LC_TIME=French_Canada.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#

Try changing the code to return a NULL rather than the created connection object:
clusterEvalQ(cl[1], {con <- file(f[[1]], "rt"); NULL})
Connection objects can't be safely sent between the master and workers, but this method avoids that.

Related

glmmTMB_phylo: Error in Matrix::rankMatrix(TMBStruc$data.tmb[[whichX]]) : length(d <- dim(x)) == 2 is not TRUE

I am trying to run the following model:
mod1<- phylo_glmmTMB(response ~ sv1 + # sampling variables
sv2 + sv3 + sv4 + sv5 +
sv6 + sv7 +
(1|phylo) + (1|reference_id), #random effects
ziformula = ~ 0,
#ar1(pos + 0| group) # spatial autocorrelation structure; group is a dummy variable
phyloZ = supertreenew,
phylonm = "phylo",
family = "binomial",
data = data)
But I keep getting the error:
Error in Matrix::rankMatrix(TMBStruc$data.tmb[[whichX]]) :
length(d <- dim(x)) == 2 is not TRUE
This error is also occurring with other reproducible example (data) that I found.
Before I run the model, I just loaded my data (data and supertree) and computed a Z matrix from supertree:
#Compute Z matrix
#supertreenew <- vcv.phylo(supertreenew)
#or
supertreenew <- phylo.to.Z(supertreenew)
#enforced match between
supertreenew <- supertreenew[levels(factor(data$phylo)), ]
I have installed the development version via:
remotes::install_github("wzmli/phyloglmm/pkg")
But no success.
The dimension of my supertree are:
[[1]]
... [351]
[[2]]
... [645]
Any guess?
My session info:
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] phyloglmm_0.1.0.9001 brms_2.18.0 cpp_1.0.9 performance_0.10. DHARMa_0.4.6
[6] phytools_1.2-0 maps_3.4.0 ape_5.6-2 lme4_1.1-31 Matrix_1.5-1
[11] TMB_1.9.1 glmmTMB_1.1.5.9000 remotes_2.4.2
(First error, "Error in Matrix::rankMatrix") This is a consequence of the addition of a check of the rank of the fixed-effects matrix in recent versions of glmmTMB. For now, adding
control = glmmTMB::glmmTMBControl(rank_check = "skip")
to your phylo_glmmTMB call should work around the problem.
(Second error, "Error in getParameterOrder(data, parameters, new.env(), DLL = DLL) ...") I just updated the refactor branch to handle this problem [caused by internal changes in glmmTMB]. Use remotes::install_github("wzmli/phyloglmm/pkg#refactor") to install this version, then try your example again.

error saving: Error in gzfile(file, "wb") : cannot open the connection

I am trying to run a LDA topic analysis on Rstudio 3.3.0. I am at the following step but keep getting the error:
Error in gzfile(file, "wb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "wb") :
cannot open compressed file 'results/Gibbs_5_1.rda', probable reason 'No such file or directory'
There is a problem while saving.
D <- nrow(data)
folding <- sample(rep(seq_len(10), ceiling (D))[seq_len(D)])
for (k in topics)
{
for (chain in seq_len(10))
{
FILE <- paste("Gibbs_", k, "_", chain, ".rda", sep = "")
training <- LDA(data[folding != chain,], k = k,
control = list(seed = SEED,
burnin = BURNIN, thin = THIN, iter = ITER, best= BEST),
method = "Gibbs")
best_training <- training#fitted[[which.max(logLik(training))]]
testing <- LDA(data[folding == chain,], model = best_training,
control = list(estimate.beta = FALSE, seed = SEED,
burnin = BURNIN,
thin = THIN, iter = ITER, best = BEST))
save(training, testing, file = file.path("results", FILE))
}
}
There is enough workspace on my computer, and I tried to restart r several times and yes I looked at the other questions but none of the solutions seem to work.
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] topicmodels_0.2-4 wordcloud_2.5 RColorBrewer_1.1-2 slam_0.1-35 SnowballC_0.5.1
[6] tm_0.6-2 NLP_0.1-9
loaded via a namespace (and not attached):
[1] modeltools_0.2-21 parallel_3.3.0 tools_3.3.0 Rcpp_0.12.5 stats4_3.3.0
I am a beginner in R and I follow a book to conduct the analysis for my master thesis.
Thanks!
The error message says it can't save the file. What is it trying to save? Looking at the code it looks like its trying to save in a folder called "results". Does this folder exist? Because if it doesn't, I get that error when I try and save something to a non-existent folder:
> save(iris, file=file.path("results","foo.rda"))
Error in gzfile(file, "wb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "wb") :
cannot open compressed file 'results/foo.rda', probable reason 'No such file or directory'
If I create the folder then it works:
> dir.create("results")
> save(iris, file=file.path("results","foo.rda"))

Quantstrat WFA with intraday Data

I've been getting WFA to run on the full set of intraday GBPUSD 30min data, and have come across a couple of things that need addressing. The first is I believe the save function needs changing to remove the time from the string (as shown here as a pull request on the R-Finance/quantstrat repo on github). The walk.forward function throws this error:
Error in gzfile(file, "wb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "wb") :
cannot open compressed file 'wfa.GBPUSD.2002-10-21 00:30:00.2002-10-23 23:30:00.RData', probable reason 'Invalid argument'
The second is a rare case scenario where its ends up calling runSum on a data set with less rows than the period you are testing (n). This is the traceback():
8: stop("Invalid 'n'")
7: runSum(x, n)
6: runMean(x, n)
5: (function (x, n = 10, ...)
{
ma <- runMean(x, n)
if (!is.null(dim(ma))) {
colnames(ma) <- "SMA"
}
return(ma)
})(x = Cl(mktdata)[, 1], n = 25)
4: do.call(indFun, .formals)
3: applyIndicators(strategy = strategy, mktdata = mktdata, parameters = parameters,
...)
2: applyStrategy(strategy, portfolios = portfolio.st, mktdata = symbol[testing.timespan]) at custom.walk.forward.R#122
1: walk.forward(strategy.st, paramset.label = "WFA", portfolio.st = portfolio.st,
account.st = account.st, period = "days", k.training = 3,
k.testing = 1, obj.func = my.obj.func, obj.args = list(x = quote(result$apply.paramset)),
audit.prefix = "wfa", anchored = FALSE, verbose = TRUE)
The extended GBPUSD data used in the creation of the Luxor Demo includes an erroneous date (2002/10/27) with only 1 observation which causes this problem. I can also foresee this being an issue when testing longer signal periods on instruments like Crude where they have only a few trading hours on Sunday evenings (UTC).
Given that I have purely been following the Luxor demo with the same (extended) intra-day data set, are these genuine issues or have they been caused by package updates etc?
What is the preferred way for these things to be reported to the authors of QS, and find out if/when fixes are likely to be made?
SessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quantstrat_0.9.1739 foreach_1.4.3 blotter_0.9.1741 PerformanceAnalytics_1.4.4000 FinancialInstrument_1.2.0 quantmod_0.4-5 TTR_0.23-1
[8] xts_0.9.874 zoo_1.7-13
loaded via a namespace (and not attached):
[1] compiler_3.3.0 tools_3.3.0 codetools_0.2-14 grid_3.3.0 iterators_1.0.8 lattice_0.20-33
quantstrat is on github here:
https://github.com/braverock/quantstrat
Issues and patches should be reported via github issues.

Possible reasons for " Error in checkForRemoteErrors(lapply(cl, recvResult)) : ... object '.doSnowGlobals' not found" Error?

I have been executing a function script repeatedly in R for many years. Within the function definition, I set up a parallel cluster using on my multi-core Windows workstation using:
# cores0 <- 20 (cores set to 20 outside of function definition)
cl.spec <- rep("localhost", cores0)
cl <- makeCluster(cl.spec, type="SOCK", outfile="")
registerDoParallel(cl, cores=cores0)
As of yesterday, my function execution is no longer working, and was getting hung up for hours. (Additionally, using the Resource Monitor, I could see that none of my CPUs were active despite my script specifying 20 cores). When I went back into the function and tested line, by line, I discovered that the following line is not executing (i.e., is getting hung up when it would usually execute in a few seconds):
cl.spec <- rep("localhost", cores0)
cl <- makeCluster(cl.spec, type="SOCK", outfile="")
I tried looking up the problem and found several references to using "PSOCK" type, but could not determine when to use PSOCK versus SOCK. Nonetheless, I attempted the same script using "PSOCK" instead of "SOCK":
cl <- makeCluster(cl.spec, type="PSOCK", outfile="")
registerDoParallel(cl, cores=cores0)
With the PSOCK modification, it no longer got hung up and it appeared to execute this as well as the registerDoParallel() call.
However, when I then executed the complete function containing the above two lines and then called the function, as below, I got an error I had never seen:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
20 nodes produced errors; first error: object '.doSnowGlobals' not found
I also tried not specifying the type or outfile, but this produced the identical error as using type="PSOCK"
cl <- makeCluster(cl.spec)
registerDoParallel(cl, cores=cores0)
My questions:
1. Why might the makeCluster() line be getting hung up when it never has before?
cl <- makeCluster(cl.spec, type="SOCK", outfile="")
The problem happens whether I have only the parallel and doParallel packages loaded AND if I also have the snow and doSNOW packages loaded. Are all 4 packages required to execute foreach() commands?
Here is the function definition and function call containing the makeCluster() and registerDoParallel() calls, as above:
# FUNCTION DEFINITION
FX_RFprocessingSNPruns <- function(path, CurrentRoundSNPlist, colSAMP, Nruns, ntreeIN, coresIN,CurrentRoundGTframeRDA){
...do a bunch of steps ...
#&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
# SET UP INTERNAL FUNCTION
#&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
ImpOOBerr<-function(x,y,d) {
create function
}
#################################################################
# SET UP THE CLUSTER
#################################################################
#Setup clusters via parallel/DoParallel
cl.spec <- rep("localhost", cores0)
cl <- makeCluster(cl.spec, type="PSOCK", outfile="")
registerDoParallel(cl, cores=cores0)
#################################################################
# *** EMPLOY foreach TO CARRY OUT randomForest IN PARALLEL
#################################################################
system.time(RFoutput_runs <- foreach(i=1:Nruns0, .combine='cbind', .packages= 'randomForest', .inorder=FALSE, .multicombine=TRUE, .errorhandling="remove")
%dopar% {
...do a bunch of steps ...
ImpOOBerr(x,y,d)
})
#################################################################
# STOP THE CLUSTER
#################################################################
stopCluster(cl)
return(RFoutput_runs)
}
# CALL FUNCTION
path0="C:/USERS/KDA/WORKING/"
system.time(GTtest_5runs <- FX_RFprocessingSNPruns(
path=path0,
CurrentRoundSNPlist="SNPlist.rda",
colSAMP=20,
Nruns=5,
ntreeIN=150,
coresIN=5,
CurrentRoundGTframeRDA="GT.rda"))
#Error in checkForRemoteErrors(lapply(cl, recvResult)) :
# 20 nodes produced errors; first error: object '.doSnowGlobals' not found.
I found these posts that reference the error, but the solutions are not working for me:
error: object '.doSnowGlobals' not found?
http://grokbase.com/t/r/r-sig-hpc/148880dpsm/error-object-dosnowglobals-not-found
I'm working on Windows 8 machine, 64-bit with 40 cores.
R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "3.0"
$year
[1] "2016"
$month
[1] "05"
$day
[1] "03"
$`svn rev`
[1] "70573"
$language
[1] "R"
$version.string
[1] "R version 3.3.0 (2016-05-03)"
$nickname
[1] "Supposedly Educational"
R version 3.3.0 (2016-05-03) -- "Supposedly Educational"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
It was institutional anti-virus software preventing access to the cores.

parallel data.table -- what's the correct syntax

Following up some data.table parallelism (1) (2) (3) I'm trying to figure it out. What's wrong with this syntax?
library(data.table)
set.seed(1234)
dt <- data.table(id= factor(sample(1L:10000L, size= 1e6, replace= TRUE)),
val= rnorm(n= 1e6), key="id")
foo <- function(l) sum(l)
dt2 <- dt[, foo(.SD), by= "id"]
library(parallel)
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
# note that library(parallel) is annoying and you often have to do this type ("::", ":::") of exporting to the parallel package
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: incorrect number of dimensions
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x <- data.table::data.table(x)
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: object 'id' not found
I've played around with the syntax quite a bit. These two seem to be the closest I can get. And obviously something's still not right.
My real problem is similarly structured but has many more rows and I'm using a machine with 24 cores / 48 logical processors. So watching my computer use roughly 4% of it's computing power (by using only 1 core) is really annoying
You may want to evaluate Rserve solution for parallelism.
See below example build on Rserve using 2 R nodes locally in parallel. It can be distributed over remote instances also.
library(data.table)
set.seed(1234)
dt <- data.table(id= factor(sample(1L:10000L, size= 1e6, replace= TRUE)),
val= rnorm(n= 1e6), key="id")
foo <- function(l) sum(l)
library(big.data.table)
# start 2 R instances
library(Rserve)
port = 6311:6312
invisible(sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save"))))
# client side
rscl = rscl.connect(port = port, pkgs = "data.table") # connect and auto require packages
bdt = as.big.data.table(dt, rscl) # create big.data.table from local data.table and list of connections to R nodes
rscl.assign(rscl, "foo", foo) # assign `foo` function to nodes
bdt[, foo(.SD), by="id"][, foo(.SD), by="id"] # first query is run remotely, second locally
# id V1
# 1: 1 10.328998
# 2: 2 -8.448441
# 3: 3 21.475910
# 4: 4 -5.302411
# 5: 5 -11.929699
# ---
# 9996: 9996 -4.905192
# 9997: 9997 -4.293194
# 9998: 9998 -2.387100
# 9999: 9999 16.530731
#10000: 10000 -15.390543
# optionally with special care
# bdt[, foo(.SD), by= "id", outer.aggregate = TRUE]
session info:
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rserve_1.8-5 big.data.table_0.3.3 data.table_1.9.7
loaded via a namespace (and not attached):
[1] RSclient_0.7-3 tools_3.2.3

Resources