how to write out multiple files in R? - r

I am a newbie R user. Now, I have a question related to write out multiple files with different names. Lets says that my data has the following structure:
IV_HAR_m1<-matrix(rnorm(1:100), ncol=30, nrow = 2000)
DV_HAR_m1<-matrix(rnorm(1:100), ncol=10, nrow = 2000)
I am trying to estimate multiple LASSO regressions. At the beginning, I was storing the iterations in one object called Dinamic_beta. This object was stored in only one file, and it saves the required information each time that my code iterate.
For doing this I was using stew which belongs to pomp package, but the total process takes 5 or 6 days and I am worried about a power outage or a fail in my computer.
Now, I want to save each environment (iterations) in a .Rnd file. I do not know how can I do that? but the code that I am using is the following:
library(glmnet)
library(Matrix)
library(pomp)
space <- 7 #THE NUMBER OF FILES THAT I would WANT TO CREATE
Dinamic_betas<-array(NA, c(10, 31, (nrow(IV_HAR_m1)-space)))
dimnames(Dinamic_betas) <- list(NULL, NULL)
set.seed(12345)
stew( #stew save the enviroment in a .Rnd file
file = "Dinamic_LASSO_RD",{ # The name required by stew for creating one file with all information
for (i in 1:dim(Dinamic_betas)[3]) {
tryCatch( #print messsages
expr = {
cv_dinamic <- cv.glmnet(IV_HAR_m1[i:(space+i-1),],
DV_HAR_m1[i:(space+i-1),], alpha = 1, family = "mgaussian", thresh=1e-08, maxit=10^9)
LASSO_estimation_dinamic<- glmnet(IV_HAR_m1[i:(space+i-1),], DV_HAR_m1[i:(space+i-1),],
alpha = 1, lambda = cv_dinamic$lambda.min, family = "mgaussian")
coefs <- as.matrix(do.call(cbind, coef(LASSO_estimation_dinamic)))
Dinamic_betas[,,i] <- t(coefs)
},
error = function(e){
message("Caught an error!")
print(e)
},
warning = function(w){
message("Caught an warning!")
print(w)
},
finally = {
message("All done, quitting.")
}
)
if (i%%400==0) {print(i)}
}
}
)
If someone can suggest another package that stores the outputs in different files I will grateful.

Try adding this just before the close of your loop
save.image(paste0("Results_iteration_",i,".RData"))
This should save your entire workspace to disk for every iteration. You can then use load() to load the workspace of every environment. Let me know if this works.

Related

R foreach multiple cores accessing a function at the same time

I have 1000 csv files in my working directory and each file has a location Id, rainfall and temperature. The structure of one file is shown below:
set.seed(123)
my.dat <- data.frame(Id = rep(1, each = 365),
rain = runif(365, min = 0, max = 20),
tmean = sample(20:40, 365, replace = T))
I wrote an Rcpp function that is also stored in my working directory. This function takes in rainfall and temperature data and calculates some derived variables var1 andvar2. I want to read each location's weather data and apply the function and save the corresponding output using foreach package.
location.vec <- 1:1000
myClusters <- makeCluster(6)
registerDoParallel(myClusters)
foreach(i = 1:length(location.vec),
.packages = c('Rcpp', 'dplyr', 'data.table'),
.noexport = c('myRcppFunc'),
.verbose = T) %dopar%
{
Rcpp::sourceCpp('myRcppFunc.cpp')
idRef <- location.vec[i]
# read the weather data
temp_weather <- fread(paste0('weather_',idRef,'.csv'))
# apply my Rcpp function
temp_weather[, c("var1","var2") := myRcppFunc(rain, tmean)]
# save my output
fwrite(temp_weather, 'paste0('weather_',idRef_modified,'.csv')')
}
stopCluster(myClusters)
This loop seems to have a weird behaviour. Sometimes it just gets stuck on iteration 10, sometimes on 40 etc everytime I run it and then I have to kill the job.
My doubt is this driven by the fact that multiple process are trying to access the Rcpp function at the same time which is leading to this issue? How can I fix it? Can I read in the Rcpp function in the foreach argument so that I don't have to keep loading it? Any other advise?
Thanks

Read in large text file in chunks

I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

slurm_apply a RefClass method from within a RefClass method

EDIT: New version of rslurm makes the solution very easy. See my answer below.
Apologies for the somewhat longer than desired MWE, and a title that I realize after submitting the question may be needlessly complicated. I believe the real issue is getting the environment of a RefClass object into rslurm::slurm_apply.
MWE
Here I define a toy reference class called BankAccount. It has two fields and two methods.
The fields are transactions, a list of all transactions associated with the account and suspicion_threshold the value above which the bank will investigate the transaction.
The two methods are is_suspicious which compares the transactions with the suspicion_threshold on the local machine and is_suspicious_slurm, which uses rslurm::slurm_apply to spread many calls to is_suspicious over a cluster of computers managed by SLURM. You can imagine if there were many transactions or if the is_suspicious function were more complex, this might be necessary.
So, here's the setup
BankAccount <- setRefClass(
Class = 'BankAccount',
fields = list(
transactions = 'numeric',
suspicion_threshold = 'numeric'
)
)
BankAccount$methods(
is_suspicious = function(start_idx = 1, stop_idx = length(transactions)) {
return(start_idx + which(transactions[start_idx:stop_idx] > suspicion_threshold) - 1)
}
)
BankAccount$methods(
is_suspicious_slurm = function(num_nodes) {
usingMethods(is_suspicious)
t <- length(transactions)
t_per_n <- floor(t/num_nodes)
starts <- seq(from = 1, length.out = num_nodes, by = t_per_n)
stops <- seq(from = t_per_n, length.out = num_nodes, by = t_per_n)
stops[num_nodes] <- t
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes,
add_objects = .self)
results_list <- rslurm::get_slurm_out(slr_job = sjob,
outtype = "raw",
wait = TRUE)
return(unlist(results_list))
}
)
Now, on my local machine I can run:
library(RCexampleforSE)
set.seed(27599)
b <- BankAccount$new()
b$transactions <- rnorm(n = 500)
b$suspicion_threshold <- 2
b$is_suspicious()
b$is_suspicious_slurm(num_nodes = 3)
and it works as expected:
62 103 155 171 182 188 297 398 493 499
If I run:
b$is_suspicious_slurm(num_nodes = 3)
I get an error, since my personal computer is not connected to a SLURM cluster.
sh: squeue: command not found
Cannot submit; no SLURM workload manager on path
Submission scripts output in directory _rslurm_13ba46e3c70b0
Error in rslurm::get_slurm_out(slr_job = sjob, outtype = "raw", wait = TRUE):
slr_job has not been submitted
If I logon to my university cluster, which uses SLURM, and run the same script, the setup and local methods work just as they did on my personal computer. When I run:
b$is_suspicious_slurm(num_nodes = 3)
it sends jobs to the cluster, as hoped for:
Submitted batch job 6363868
But these jobs error immediately with the following error message in slurm_0.out, slurm_1.out, and slurm_2.out:
Error in attr(, "mayCall") : argument 1 is empty
Execution halted
Thoughts and Attempts
I figure the job probably needs, but doesn't have available, the BankAccount object. So I tried passing it in as add_objects parameter to rslurm::slurm_apply:
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes,
add_objects = .self)
I also tried it in quotes and inside eval(), neither of which worked.
How can I make the object accessible to the worker jobs created with rslurm::slurm_apply?
Version 0.4.0 of rslurm completely solved this problem.
Define is_suspicious_slurm() as:
BankAccount$methods(
is_suspicious_slurm = function(num_nodes) {
usingMethods(is_suspicious)
t <- length(transactions)
t_per_n <- floor(t/num_nodes)
starts <- seq(from = 1, length.out = num_nodes, by = t_per_n)
stops <- seq(from = t_per_n, length.out = num_nodes, by = t_per_n)
stops[num_nodes] <- t
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes)
results_list <- rslurm::get_slurm_out(slr_job = sjob,
outtype = "raw",
wait = TRUE)
return(unlist(results_list))
}
)
The only change is that in the call to rslurm::slurm_apply, the add_objects parameter is not specified. It does not need to be specified because as #Ian pointed out:
"...you don't need to pass self at all when slurm_apply sends the serialized function, which appears to include both ".self" and "transactions" in the enclosing environment."
EDIT: OP's answer is all you need to know.
The add_objects parameter is used for passing a character vector, not the objects themselves. All the objects are then saved in one RData file, assuming they can be found by name. In theory, you should be able to use add_objects = c('.self') within your method definition.
The key here is, "assuming they can be found". I will edit this post once a pending update to the rslurm package (which should make that finding more successful) is released.
Be very careful passing objects to cluster nodes: they do not come back. Not only will any side effects be lost, there's no inter-node communication implemented by rslurm.
Also be careful with which :) Your is_suspicious method will be wrong for arguments that don't start at 1. Try this version:
BankAccount$methods(
is_suspicious = function(i = 1:length(transactions)) {
idx <- which(transactions[i] > suspicion_threshold)
i[idx]
}
)

Only one processor being used while running NetLogo models using parApply

I am using the 'RNetLogo' package to run sensitivity analyses on my NetLogo model. My model has 24 parameters I need to vary - so parallelising this process would be ideal! I've been following along with the example in Thiele's "Parallel processing with the RNetLogo package" vignette, which uses the 'parallel' package in conjunction with 'RNetLogo'.
I've managed to get R to initialise the NetLogo model across all 12 of my processors, which I've verified using gui=TRUE. The problem comes when I try to run the simulation code across the 12 processors using 'parApply'. This line runs without error, but it only runs on one of the processors (using around 8% of my total CPU power). Here's a mock up of my R code file - I've included some commented-out code at the end, showing how I run the simulation without trying to parallelise:
### Load packages
library(parallel)
### Set up initialisation function
prepro <- function(dummy, gui, nl.path, model.path) {
library(RNetLogo)
NLStart(nl.path, gui=gui)
NLLoadModel(model.path)
}
### Set up finalisation function
postpro <- function(x) {
NLQuit()
}
### Set paths
# For NetLogo
nl.path <- "C:/Program Files/NetLogo 6.0/app"
nl.jarname <- "netlogo-6.0.0.jar"
# For the model
model.path <- "E:/Model.nlogo"
# For the function "sim" code
sim.path <- "E:/sim.R"
### Set base values for parameters
base.param <- c('prey-max-velocity' = 25,
'prey-agility' = 3.5,
'prey-acceleration' = 20,
'prey-deceleration' = 25,
'prey-vision-distance' = 10,
'prey-vision-angle' = 240,
'time-to-turn' = 5,
'time-to-return-to-foraging' = 300,
'time-spent-circling' = 2,
'predator-max-velocity' = 35,
'predator-agility' = 3.5,
'predator-acceleration' = 20,
'predator-deceleration' = 25,
'predator-vision-distance' = 20,
'predator-vision-angle' = 200,
'time-to-give-up' = 120,
'number-of-safe-zones' = 1,
'number-of-target-patches' = 5,
'proportion-obstacles' = 0.05,
'obstacle-radius' = 2.0,
'obstacle-radius-range' = 0.5,
'obstacle-sensitivity-for-prey' = 0.95,
'obstacle-sensitivity-for-predators' = 0.95,
'safe-zone-attractiveness' = 500
)
## Get names of parameters
param.names <- names(base.param)
### Load the code of the simulation function (name: sim)
source(file=sim.path)
### Convert "base.param" to a matrix, as required by parApply
base.param <- matrix(base.param, nrow=1, ncol=24)
### Get the number of simulations we want to run
design.combinations <- length(base.param[[1]])
already.processed <- 0
### Initialise NetLogo
processors <- detectCores()
cl <- makeCluster(processors)
clusterExport(cl, 'sim')
gui <- FALSE
invisible(parLapply(cl, 1:processors, prepro, gui=gui, nl.path=nl.path, model.path=model.path))
### Run the simulation across all processors, using parApply
sim.result.base <- parApply(cl, base.param, 1, sim,
param.names,
no.repeated.sim = 100,
trace.progress = FALSE,
iter.length = design.combinations,
function.name = "base parameters")
### Run the simulation on a single processor
#sim.result.base <- sim(base.param,
# param.names,
# no.repeated.sim = 100,
# my.nl1,
# trace.progress = TRUE,
# iter.length = design.combinations,
# function.name = "base parameters")
Here's a mock up for the 'sim' function (adapted from Thiele's paper "Facilitating parameter estimation and sensitivity analyses of agent-based models - a cookbook using NetLogo and R"):
sim <- function(param.set, parameter.names, no.repeated.sim, trace.progress, iter.length, function.name) {
# Some security checks
if (length(param.set) != length(parameter.names))
{ stop("Wrong length of param.set!") }
if (no.repeated.sim <= 0)
{ stop("Number of repetitions must be > 0!") }
if (length(parameter.names) <= 0)
{ stop("Length of parameter.names must be > 0!") }
# Create an empty list to save the simulation results
eval.values <- NULL
# Run the repeated simulations (to control stochasticity)
for (i in 1:no.repeated.sim)
{
# Create a random-seed for NetLogo from R, based on min/max of NetLogo's random seed
NLCommand("random-seed",runif(1,-2147483648,2147483647))
## This is the stuff for one simulation
cal.crit <- NULL
# Set NetLogo parameters to current parameter values
lapply(seq(1:length(parameter.names)), function(x) {NLCommand("set ",parameter.names[x], param.set[x])})
NLCommand("setup")
# This should run "go" until prey-win =/= 5, i.e. when the pursuit ends
NLDoCommandWhile("prey-win = 5", "go")
# Report a value
prey <- NLReport("prey-win")
# Report another value
pred <- NLReport("predator-win")
## Extract the values we are interested in
cal.crit <- rbind(cal.crit, c(prey, pred))
# append to former results
eval.values <- rbind(eval.values,cal.crit)
}
## Make sure eval.values has column names
names(eval.values) <- c("PreySuccess", "PredSuccess")
# Return the mean of the repeated simulation results
if (no.repeated.sim > 1) {
return(colMeans(eval.values))
}
else {
return(eval.values)
}
}
I think the problem might lie in the "nl.obj" string that RNetLogo uses to identify the NetLogo instance you want to run the code on - however, I've tried several different methods of fixing this, and I haven't been able to come up with a solution that works. When I initialise NetLogo across all the processors using the code provided in Thiele's example, I don't set an "nl.obj" value for each instance, so I'm guessing RNetLogo uses some kind of default list? However, in Thiele's original code, the "sim" function requires you to specify which NetLogo instance you want to run it on - so R will spit an error when I try to run the final line (Error in checkForRemoteErrors(val) : one node produced an error: argument "nl.obj" is missing, with no default). I have modified the "sim" function code so that it doesn't require this argument and just accepts the default setting for nl.obj - but then my simulation only runs on a single processor. So, I think that by default, "sim" must only be running the code on a single instance of NetLogo. I'm not certain how to fix it.
This is also the first time I've used the 'parallel' package, so I could be missing something obvious to do with 'parApply'. Any insight would be much appreciated!
Thanks in advance!
I am still in the process of applying a similar technique to perform a Morris Elementary Effects screening with my NetLogo model. For me the parallel execution works fine. I compared your script to mine and noticed that in my version the 'parApply' call of the simulation function (simfun) is embedded in a function statement (see below). Maybe including the function already solves your issue.
sim.results.morris <- parApply(cl, mo$X, 1, function(x) {simfun(param.set=x,
no.repeated.sim=no.repeated.sim,
parameter.names=input.names,
iter.length=iter.length,
fixed.values=fixed.values,
model.seed=new.model.seed,
function.name="Morris")})

Handling internet connection R

I`m trying to download several stocks from google, but every time the connection stops, R stops the loop. How can I handle this problem?
stocks <- c(
'MSFT',
'GOOG',
...
)
for (symbol in stocks)
{
stock_price <- getSymbols(symbol,src='google', from=startDate,to=endDate,auto.assign = FALSE)
prices[,j] <- stock_price[,1]
j <- j + 1
}
From the R manual "quantmod.pdf:
If auto.assign=FALSE or env=NULL (as of 0.4-0) the data will be returnedfrom the call, and will require the user to assign the results himself.Note that only one symbol at a time may be requested when auto assignment is disabled.
You are trying to request more than one ticket symbol at a time with the auto.assign parameter set to false and this is not allowed. However, you should be able to obtain all your symbols at once by adapting the following code:
data <- new.env()
getSymbols.extra(stocks, src = 'google', from = startDate, to = endDate, env = data, auto.assign = T)
plot(data$MSFT)
Pay careful attention to the R manual for getSymbols
"Data is fetched through one of the available getSymbols methods and saved in the env specified - the .GloblEnv by default.

Resources