slurm_apply a RefClass method from within a RefClass method - r

EDIT: New version of rslurm makes the solution very easy. See my answer below.
Apologies for the somewhat longer than desired MWE, and a title that I realize after submitting the question may be needlessly complicated. I believe the real issue is getting the environment of a RefClass object into rslurm::slurm_apply.
MWE
Here I define a toy reference class called BankAccount. It has two fields and two methods.
The fields are transactions, a list of all transactions associated with the account and suspicion_threshold the value above which the bank will investigate the transaction.
The two methods are is_suspicious which compares the transactions with the suspicion_threshold on the local machine and is_suspicious_slurm, which uses rslurm::slurm_apply to spread many calls to is_suspicious over a cluster of computers managed by SLURM. You can imagine if there were many transactions or if the is_suspicious function were more complex, this might be necessary.
So, here's the setup
BankAccount <- setRefClass(
Class = 'BankAccount',
fields = list(
transactions = 'numeric',
suspicion_threshold = 'numeric'
)
)
BankAccount$methods(
is_suspicious = function(start_idx = 1, stop_idx = length(transactions)) {
return(start_idx + which(transactions[start_idx:stop_idx] > suspicion_threshold) - 1)
}
)
BankAccount$methods(
is_suspicious_slurm = function(num_nodes) {
usingMethods(is_suspicious)
t <- length(transactions)
t_per_n <- floor(t/num_nodes)
starts <- seq(from = 1, length.out = num_nodes, by = t_per_n)
stops <- seq(from = t_per_n, length.out = num_nodes, by = t_per_n)
stops[num_nodes] <- t
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes,
add_objects = .self)
results_list <- rslurm::get_slurm_out(slr_job = sjob,
outtype = "raw",
wait = TRUE)
return(unlist(results_list))
}
)
Now, on my local machine I can run:
library(RCexampleforSE)
set.seed(27599)
b <- BankAccount$new()
b$transactions <- rnorm(n = 500)
b$suspicion_threshold <- 2
b$is_suspicious()
b$is_suspicious_slurm(num_nodes = 3)
and it works as expected:
62 103 155 171 182 188 297 398 493 499
If I run:
b$is_suspicious_slurm(num_nodes = 3)
I get an error, since my personal computer is not connected to a SLURM cluster.
sh: squeue: command not found
Cannot submit; no SLURM workload manager on path
Submission scripts output in directory _rslurm_13ba46e3c70b0
Error in rslurm::get_slurm_out(slr_job = sjob, outtype = "raw", wait = TRUE):
slr_job has not been submitted
If I logon to my university cluster, which uses SLURM, and run the same script, the setup and local methods work just as they did on my personal computer. When I run:
b$is_suspicious_slurm(num_nodes = 3)
it sends jobs to the cluster, as hoped for:
Submitted batch job 6363868
But these jobs error immediately with the following error message in slurm_0.out, slurm_1.out, and slurm_2.out:
Error in attr(, "mayCall") : argument 1 is empty
Execution halted
Thoughts and Attempts
I figure the job probably needs, but doesn't have available, the BankAccount object. So I tried passing it in as add_objects parameter to rslurm::slurm_apply:
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes,
add_objects = .self)
I also tried it in quotes and inside eval(), neither of which worked.
How can I make the object accessible to the worker jobs created with rslurm::slurm_apply?

Version 0.4.0 of rslurm completely solved this problem.
Define is_suspicious_slurm() as:
BankAccount$methods(
is_suspicious_slurm = function(num_nodes) {
usingMethods(is_suspicious)
t <- length(transactions)
t_per_n <- floor(t/num_nodes)
starts <- seq(from = 1, length.out = num_nodes, by = t_per_n)
stops <- seq(from = t_per_n, length.out = num_nodes, by = t_per_n)
stops[num_nodes] <- t
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes)
results_list <- rslurm::get_slurm_out(slr_job = sjob,
outtype = "raw",
wait = TRUE)
return(unlist(results_list))
}
)
The only change is that in the call to rslurm::slurm_apply, the add_objects parameter is not specified. It does not need to be specified because as #Ian pointed out:
"...you don't need to pass self at all when slurm_apply sends the serialized function, which appears to include both ".self" and "transactions" in the enclosing environment."

EDIT: OP's answer is all you need to know.
The add_objects parameter is used for passing a character vector, not the objects themselves. All the objects are then saved in one RData file, assuming they can be found by name. In theory, you should be able to use add_objects = c('.self') within your method definition.
The key here is, "assuming they can be found". I will edit this post once a pending update to the rslurm package (which should make that finding more successful) is released.
Be very careful passing objects to cluster nodes: they do not come back. Not only will any side effects be lost, there's no inter-node communication implemented by rslurm.
Also be careful with which :) Your is_suspicious method will be wrong for arguments that don't start at 1. Try this version:
BankAccount$methods(
is_suspicious = function(i = 1:length(transactions)) {
idx <- which(transactions[i] > suspicion_threshold)
i[idx]
}
)

Related

"Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: could not find function "wincrqa"

I am currently trying to run a parallelized RQA with the following code.
library(snow)
library(doSNOW)
library(crqa)
my_wincrqa = function(x, y){
wincrqa(x, y, windowstep = 1000, windowsize = 2000,
radius = .2, delay = 4, embed = 2, rescale = 0, normalize = 0,
mindiagline = 2, minvertline = 2, tw = 0, whiteline = F,
side = "both", method = "crqa", metric = "euclidean", datatype = "continuous")
}
cl<-makeCluster(11,type="SOCK")
start_time <- Sys.time()
WCRQA_list = clusterMap(cl, my_wincrqa, HR_list, RR_list)
end_time <- Sys.time()
end_time - start_time
Unfortunately, I get this: "
Error in checkForRemoteErrors(val) : 2 nodes produced errors; first
error: could not find function "wincrqa"
I know there is probably sum error in setting up the parallel processing, but I am not able to resolve it. I also tried a similar thing using the parallel() package.
I am happy for any help!
Best,
Johnson
The issue is that you’ve loaded and attached the ‘crqa’ package in your main execution environment, but the cluster nodes are running code in separate, isolated R sessions — they don’t see the same loaded packages or global variables!
The easiest solution is to replace use of wincrqa with a fully qualified name, i.e. to use crqa::wincrqa inside your function.
Alternatively, it is possible to attach the ‘crqa’ package on all cluster nodes prior to executing the function:
clusterEvalQ(cl, library(crqa))
WCRQA_list = clusterMap(cl, my_wincrqa, HR_list, RR_list)

Error in make_empty_graph(n = 0, directed = directed) : VECTOR_ELT() can only be applied to a 'list', not a 'closure'

As part of research I'm doing this semester, I'm adding onto a network that was already built by previous students but I'm running into errors I never had before. This code was not written by me, I have only modified it very slightly as I move along.
So I have this list daily_networks_df that contains 85953 data frames with each data frame looking something like this:
And the function & code here that generates different sections of the network (home, work, and other):
generate_el <- function(net_df){
networki = net_df[,1]
if(length(networki)>1){
# add edge in all possible pairs with prob_interact[i]
prob_interact=1/sqrt(length(networki))
ppl_pairs = combn(networki, 2)
tmp1 = rbinom(ncol(ppl_pairs), 1, prob_interact)
el_tmp = t(ppl_pairs[,tmp1==1])
return(el_tmp)
}
}
This function above is modified a bit for the 3 different network types but the other 2 are very similar so I didn't include. And here is the function calling:
# DEMAND NETWORK
work_start = 2
list_el1 = lapply(daily_networks_df[1:work_start], generate_el)
gr_el1 = do.call('rbind', list_el1)
# WORK NETWORK
prob_interact = .4
list_el2 = lapply(daily_networks_df[(work_start+1):homes_start], generate_el_work)
gr_el2 = do.call('rbind', list_el2)
# HOME NETWORK
prob_interact = .8
list_el3 = lapply(daily_networks_df[(homes_start+1):length(daily_networks_df)], generate_el_homes)
gr_el3 = do.call('rbind', list_el3)
# FULL NETWORK
gr_el=rbind(gr_el1, gr_el2, gr_el3)
gr = graph_from_data_frame(gr_el, directed=FALSE, vertices=total_pop)
I'm getting this error when I try to execute and have no idea why. Any help?
> gr = graph_from_data_frame(gr_el, directed=FALSE, vertices=total_pop)
Error in make_empty_graph(n = 0, directed = directed) :
VECTOR_ELT() can only be applied to a 'list', not a 'closure'

how to write out multiple files in R?

I am a newbie R user. Now, I have a question related to write out multiple files with different names. Lets says that my data has the following structure:
IV_HAR_m1<-matrix(rnorm(1:100), ncol=30, nrow = 2000)
DV_HAR_m1<-matrix(rnorm(1:100), ncol=10, nrow = 2000)
I am trying to estimate multiple LASSO regressions. At the beginning, I was storing the iterations in one object called Dinamic_beta. This object was stored in only one file, and it saves the required information each time that my code iterate.
For doing this I was using stew which belongs to pomp package, but the total process takes 5 or 6 days and I am worried about a power outage or a fail in my computer.
Now, I want to save each environment (iterations) in a .Rnd file. I do not know how can I do that? but the code that I am using is the following:
library(glmnet)
library(Matrix)
library(pomp)
space <- 7 #THE NUMBER OF FILES THAT I would WANT TO CREATE
Dinamic_betas<-array(NA, c(10, 31, (nrow(IV_HAR_m1)-space)))
dimnames(Dinamic_betas) <- list(NULL, NULL)
set.seed(12345)
stew( #stew save the enviroment in a .Rnd file
file = "Dinamic_LASSO_RD",{ # The name required by stew for creating one file with all information
for (i in 1:dim(Dinamic_betas)[3]) {
tryCatch( #print messsages
expr = {
cv_dinamic <- cv.glmnet(IV_HAR_m1[i:(space+i-1),],
DV_HAR_m1[i:(space+i-1),], alpha = 1, family = "mgaussian", thresh=1e-08, maxit=10^9)
LASSO_estimation_dinamic<- glmnet(IV_HAR_m1[i:(space+i-1),], DV_HAR_m1[i:(space+i-1),],
alpha = 1, lambda = cv_dinamic$lambda.min, family = "mgaussian")
coefs <- as.matrix(do.call(cbind, coef(LASSO_estimation_dinamic)))
Dinamic_betas[,,i] <- t(coefs)
},
error = function(e){
message("Caught an error!")
print(e)
},
warning = function(w){
message("Caught an warning!")
print(w)
},
finally = {
message("All done, quitting.")
}
)
if (i%%400==0) {print(i)}
}
}
)
If someone can suggest another package that stores the outputs in different files I will grateful.
Try adding this just before the close of your loop
save.image(paste0("Results_iteration_",i,".RData"))
This should save your entire workspace to disk for every iteration. You can then use load() to load the workspace of every environment. Let me know if this works.

How can I correctly use the cluster plan in the R future (furrr) package

I am currently using furrr to create a more organized execution of my model. I use a data.frame to pass parameters to a function in a orderly way, and then using the furrr::future_map() to map a function across all the parameters. The function works flawlessly when using the sequential and multicore futures on my local Machine (OSX).
Now, I want to test my code creating my own cluster of AWS instances (just as shown here).
I created a function using the linked article code:
make_cluster_ec2 <- function(public_ip){
ssh_private_key_file <- Sys.getenv('PEM_PATH')
github_pac <- Sys.getenv('PAC')
cl_multi <- future::makeClusterPSOCK(
workers = public_ip,
user = "ubuntu",
rshopts = c(
"-o", "StrictHostKeyChecking=no",
"-o", "IdentitiesOnly=yes",
"-i", ssh_private_key_file
),
rscript_args = c(
"-e", shQuote("local({p <- Sys.getenv('R_LIBS_USER'); dir.create(p, recursive = TRUE, showWarnings = FALSE); .libPaths(p)})"),
"-e", shQuote("install.packages('devtools')"),
"-e", shQuote(glue::glue("devtools::install_github('user/repo', auth_token = '{github_pac}')"))
),
dryrun = FALSE)
return(cl_multi)
}
Then, I create the cluster object and then check that is connected to the right instance
public_ids <- c('public_ip_1', 'public_ip_2')
cls <- make_cluster_ec2(public_ids)
f <- future(Sys.info())
And when I print f I get the specs of one of my remote instances, which indicates the socket is correctly connected:
> value(f)
sysname
"Linux"
release
"4.15.0-1037-aws"
version
"#39-Ubuntu SMP Tue Apr 16 08:09:09 UTC 2019"
nodename
"ip-xxx-xx-xx-xxx"
machine
"x86_64"
login
"ubuntu"
user
"ubuntu"
effective_user
"ubuntu"
But when I run my code using my cluster plan:
plan(list(tweak(cluster, workers = cls), multisession))
parameter_df %>%
mutate(model_traj = furrr::future_pmap(list('lat' = latitude,
'lon' = longitude,
'height' = stack_height,
'name_source' = facility_name,
'id_source' = facility_id,
'duration' = duration,
'days' = seq_dates,
'daily_hours' = daily_hours,
'direction' = 'forward',
'met_type' = 'reanalysis',
'met_dir' = here::here('met'),
'exec_dir' = here::here("Hysplit4/exec"),
'cred'= list(creds)),
dirtywind::hysplit_trajectory,
.progress = TRUE)
)
I get the following error:
Error in file(temp_file, "a") : cannot open the connection
In addition: Warning message:
In file(temp_file, "a") :
cannot open file '/var/folders/rc/rbmg32js2qlf4d7cd4ts6x6h0000gn/T//RtmpPvdbV3/filecf23390c093.txt': No such file or directory
I can not figure out what is happening under the hood, and I can not traceback() the error either from my remote machines. I have test the connection with the examples in the article and things seem to run correctly. I am wondering why is trying to create a tempdir during the execution. What am I missing here?
(This is also an issue in the furrr repo)
Disable the progress bar, i.e. don't specify .progress = TRUE.
This is because .progress = TRUE assumes your R workers can write to the a temporary file that the main R process created. This is typically only possible when you parallelize on on the same machine.
A smaller example of this error is:
library(future)
## Set up a cluster with one worker running on another machine
cl <- makeClusterPSOCK(workers = "node2")
plan(cluster, workers = cl)
y <- furrr::future_map(1:2, identity, .progress = FALSE)
str(y)
## List of 2
## $ : int 1
## $ : int 2
y <- furrr::future_map(1:2, identity, .progress = TRUE)
## Error in file(temp_file, "a") : cannot open the connection
## In addition: Warning message:
## In file(temp_file, "a") :
## cannot open file '/tmp/henrik/Rtmp1HkyJ8/file4c4b864a028ac.txt': No such file or directory

R - Parallel Processing and ldply error

I am trying to use the below code to make API calls in a parallel process to speed up the API calls. (I know this isn't the best way to speed up API calls but it works)
It only fails when I try to use parallel, otherwise it works. In the ldply function I am getting the below error:
Error in do.ply(i) :
task 1 failed - "object of type 'closure' is not subsettable"
In addition:
Warning messages:
1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
any help would be appreciated!
One <- 26
cl<-makeCluster(4)
registerDoSNOW(cl)
func.time <- Sys.time()
## API CALL ONE FOR "kline"
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[1],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline <- data.frame(text_content %>% fromJSON())
kline$symbol <- pairs[1]
## API FUNCTION TO BE APPLIED FOR REST
loopfunction <- function(i){
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[i],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline_temp <- data.frame(text_content %>% fromJSON())
kline_temp$symbol <- pairs[i]
kline <- rbind(kline,kline_temp)
return(kline)
}
## DPLY PARALLEL FUNCTION
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, .parallel = T, .paropts = c("httr", "jsonlite", "dplyr"))) ##"ONE" is a list varriable created earlier
stopCluster(cl)
func.end.time <- Sys.time()
func.tot.time <- func.end.time - func.time
Your question isn't fully reproducible, so the following is an educated guess.
Your loopfunction() references an object called pairs. It seems from your script that a variable called pairs is defined somewhere in your local environment. However, when loopfunction() is passed to ldply(), it no longer has access to that variable (ordinarily, it would, but parallelization requires fresh R environments to be created). Having failed to find an object called pairs in the environment, R continues searching, and finds a match in stats::pairs(). This is a plotting function, not a subsettable object like a vector or data frame. Hence the error message, "object of type 'closure' is not subsettable".
I'm not especially familiar with how ldply implements parallel processing, but you could probably modify your function definition like this:
loopfunction <- function(i, pairs) {
...[body of function]...
}
And pass pairs as an extra parameter in your ldply call:
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, pairs = pairs, .parallel = T, .paropts = list(.packages = c("httr", "jsonlite", "dplyr"))))

Resources