A part of the code is
sse <-c()
k <- c()
for (i in seq(3, 15, 1)) {
y_pred <-knn(train = newdata.training, test = newdata.test,
cl = newdata.trainLabels, k=i)
pred_y <- as.numeric(levels(y_pred)[y_pred])
sse[i] <- sum((newdata.trainLabels-pred_y)^2)
k[i] <- i
}
pred_y is a column for each i. I want to create a data frame with all the 13 columns. Can it be done by using a for loop? Or else how can this be accomplished? I need suggestions.
You can use foreach which has the added advantage that it can be run in parallel if you have multiple cores in your CPU. Here is the non-parallel code:
library("iterators")
library("foreach")
library("FNN")
data(iris3)
newdata.training <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
newdata.test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
newdata.trainlabels <- factor(c(rep(1,25), rep(2,25), rep(3,25)))
k.values = seq(3, 15, 1)
start = 2 # to index sse array using k.values
sse = numeric(length = length(k.values))
results = foreach(i = iter(k.values),.combine = cbind) %do%
{
y_pred <-knn(train = newdata.training, test = newdata.test,
cl = newdata.trainlabels, k=i, prob = TRUE)
pred_y <- as.numeric(levels(y_pred)[y_pred])
sse[i - start] <- sum((as.numeric(newdata.trainlabels)-pred_y)^2)
pred_y
}
results1 = data.frame(results)
colnames(results1) = k.values
Here is the parallel version:
# Parallel version
library("iterators")
library("foreach")
library("parallel")
library("doParallel")
library("FNN")
data(iris3)
newdata.training <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
newdata.test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
newdata.trainlabels <- factor(c(rep(1,25), rep(2,25), rep(3,25)))
num.cores = detectCores()
clusters <- makeCluster(num.cores)
registerDoParallel(clusters)
k.values = seq(3, 15, 1)
start = 2 # to index sse array using k.values
sse = numeric(length = length(k.values))
results = foreach(i = iter(k.values),.combine = cbind, .packages=c("FNN")) %dopar%
{
y_pred <-knn(train = newdata.training, test = newdata.test,
cl = newdata.trainlabels, k=i, prob = TRUE)
pred_y <- as.numeric(levels(y_pred)[y_pred])
sse[i - start] <- sum((as.numeric(newdata.trainlabels)-pred_y)^2)
pred_y
}
results1 = data.frame(results)
colnames(results1) = k.values
stopCluster(clusters)
There are only a few differences between the non-parallel code and the parallel code. First, there are additional libraries to load. Second, you need to create and register the clusters that will do the parallel computation (and stop the clusters when you are done). Third, foreach uses %dopar% infix operator instead of %do%. Fourth, the foreach function needs the .packages parameter to pass KNN to each of the clusters.
Related
this is my first question here so i hope i'm doing it right.
I'm trying to run a variant of RandomForest called Geographical Regression Forest (package SpatialML). So to train the models i'm doing a foreach loop in parallel and using a sample with replacement on the calibration data.
library(SpatialML)
library(doParallel)
rm(list=ls())
ds <- SpatialML::random.test.data()
ds
# Parallel settings
ncores <- detectCores() - 1
cl <- makePSOCKcluster(ncores)
y <- names(ds)[1]
x <- paste(names(ds)[c(2,3)], collapse = "+")
f <- as.formula(paste0(y,"~",x));f
clusterEvalQ(cl , expr = c(library(SpatialML)))
clusterExport(cl, list("ds"))
#### Bootstraps ####
registerDoParallel(cl)
time <- system.time(foreach (i = 1:10) %dopar% {
# sample with replacement
trainingREP <- sample.int(nrow(ds), 1*nrow(ds), replace = T)
# Geographical Regression Forest
grf.boot <- grf(formula = f, dframe = ds[trainingREP, ],
bw = round(nrow(ds)/10, digits = 0), kernel = "adaptive",
coords = ds[trainingREP, c(4,5)], ntree = 500, importance = T)
# Save GRF
modelFile <- paste("./bootModGRF_",i,".rds",sep="")
saveRDS( object = grf.boot, file = modelFile)
})stopCluster(cl)
but when i run this code i get
Error in { : task 1 failed - "object 'trainingREP' not found
why cant the foreach loop read an object that is created within the same loop?
I am coding a Rscript for carrying out Jtst for pair trading.I declared a function to find out the correlation between two single stocks in the first place, then I add a for each loop to do the task for a list of stocks.However the for each loop did not recognize the first function.
I have tried to use doSHOW function as suggested by information on the Internet, but it did not work.
pkgs <- list("quantmod", "doParallel", "foreach", "urca")
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)
jtest <- function(t1, t2) {
start <- sd
getSymbols(t1, from = start)
getSymbols(t2, from = start)
j <- summary(ca.jo(cbind(get(t1)[, 6], get(t2)[, 6])))
r <- data.frame(stock1 = t1, stock2 = t2, stat = j#teststat[2])
r[, c("pct10", "pct5", "pct1")] <- j#cval[2, ]
return(r)
}
pair <- function(lst) {
d2 <- data.frame(t(combn(lst, 2)))
stat <- foreach(i = 1:nrow(d2), .combine = rbind) %dopar% jtest(as.character(d2[i, 1]), as.character(d2[i, 2]))
stat <- stat[order(-stat$stat), ]
rownames(stat) <- NULL
return(stat)
}
sd <- "2018-01-01"
tickers <- c("FITB", "BBT", "MTB", "STI", "PNC", "HBAN", "CMA", "USB", "KEY", "JPM", "C", "BAC", "WFC")
pair(tickers)
Error in jtest(as.character(d2[i, 1]), as.character(d2[i, 2])) :
task 1 failed - "could not find function "jtest""
I had the same problem until I specified the necessary function in the foreach call. The function supposed to generate lags of the time series variable.
This version does not work:
Ylag = foreach(i = 1:maxlagsY,.combine = 'cbind') %dopar%{mylag(Y,k = i)}
While this one does:
Ylag = foreach(i = 1:maxlagsY,.export = "mylag",.combine = 'cbind') %dopar%{mylag(Y,k = i)}
So, the answer is in specifying the user-defined functions in the foreach call.
I'm new to paralleling the for loop using foreach and struggle to understand how it works. As an example for the exercise, I created a simple list (input2) based on a dataframe (input). I try to calculate b by looping through h and j.
library(doParallel)
library(foreach)
library(dplyr)
input <- data.frame(matrix(rnorm(200*200, 0, .5), ncol=200))
input[input <=0] =0
input['X201'] <- seq(from = 0, to = 20, length.out = 10)
input <- input %>% select(c(X201, 1:200))
input2 <- split(input, f= input$X201)
a = 0
b= 0
cl <- parallel::makeCluster(20)
doParallel::registerDoParallel(cl)
tm1 <- system.time(
y <-
foreach (h = length(input2),.combine = 'cbind') %:%
foreach (j = nrow(input2[[h]]),.combine = 'c',packages = 'foreach') %dopar%{
a = input2[[h]][j,3]
b = b + a
}
)
parallel::stopCluster(cl)
registerDoSEQ()
print("Cluster stopped.")
y is about 0.55 (the exact value depends on the random number one generated), which is the value of input2[[10]][20,3], not the accumulative value I desired. I checked the manual of the foreach package but still not sure I fully understand the mechanism of the foreach function.
R foreach returns back results instead allows the outside variable to be changed. So don't expect a, b to be updated correctly.
Try the following
cl <- parallel::makeCluster(20)
doParallel::registerDoParallel(cl)
tm2 <- system.time(
results <- foreach(h = (1:length(input2)), .combine = "c") %dopar%{
sum( input2[[h]][1:nrow(input2[[h]]),3])
},
b <- sum(results[1:length(results)])
)
parallel::stopCluster(cl)
registerDoSEQ()
b
tm2
I wonder I can use parallel computing in JAGS as I want.
Here is my R script.
library(foreach)
list.data2 <- foreach(i=1:n.rep) %do% {
foreach(j=1:2) %do% {list( cap = cap_data[[i]][[j]],
loc = loc_data[[i]][[j]],
eff = eff_data[[i]][[j]],
trap.numb = trap.numb2,
av = av,
forest = env$forest,
crop = env$crop,
bamboo = env$bamboo,
grass = env$grass,
abandoned = env$abandoned,
city = env$city,
rate = env$for_cr_rate,
m.numb = m.numb,
ones = matrix( 1, m.numb, 5 )
) #,bound_mat=bound_mat,bound_numb=bound_numb
}
}
inits2 <- foreach(j=1:2) %do% {list( n=n.inits2[[j]],
b0=0.5, b1=0.1, b2=0.1, b3=0.1, b4=0.1, b5=0.1, b6=0.1,
a0=5, a1=0.5, a2=0.5, a3=0.5, a4=0.5, a5=0.5, a6=0.5,
sd=1,
err=rep(0,m.numb),
r_capt=0.10
)
}
para2 <- c("a0","a1","a2","a3","a4", "a5","a6",
"b0","b1","b2","b3","b4", "b5","b6", "n28", "n29", "r_capt")
library(R2jags)
start.time <- Sys.time()
install.packages("doParallel")
library(doParallel)
registerDoParallel(cores=6)
x_real2 <- foreach( i = 1:2,
.packages = "R2jags"
) %dopar% {jags( "realdata_5years.txt",
data = list.data2[[i]][[?]],
inits = inits2[[i]],
para = para2,
n.chain = 3,
n.iter = n.1000000,
n.burnin = 400000,
n.thin = 200
)
}
sum_real2 <- foreach(i = 1:2) %do% {x_real2[[i]]$BUGSoutput$summary}
---------------------------------------------------------------------
So, I have two data sets and each has 30 ( == n.rep ) times repetition.
Therefore I have 60 data lists in total.
I would like to use six cores for both 2 data sets and each 3 MCMC chains.
Moreover, I need to repeat this calculation 30 ( == n.rep ) times.
However, I have no idea to write in this way. I have problems in the last 4 lines.
Should I use %dopar% twice?
or
Should I use jags.parallel in addition to the foreach?
Hi I am trying to understand how to get DEoptim to work using parallel processing, but am struggling to get the correct parameters to be put into the function to get it to work...below is a reproducible example (it has a financial context) but it is designed for creating a random portfolio of 7 assets to optimise for ES. It was inspired by this http://mpra.ub.uni-muenchen.de/28187/1/RJwrapper.pdf and also http://files.meetup.com/1772780/20120201_Ulrich_Parallel_DEoptim.pdf
the second of which does include a parallel option but want to use the unix forking rather than the SOCK clusters.
require(quantmod)
require(PerformanceAnalytics)
require(DEoptim)
tickers <- c("^GSPC","^IXIC","^TNX", "DIA","USO","GLD","SLV","UNG","^VIX","F","^FTSE","GS","MS","MSFT","MCD","COKE","AAPL","GOOG","T","C","BHP","RIO","CMG")
getSymbols(tickers)
tickers <- gsub("\\^","",tickers)
x <- lapply(tickers, function(x){ClCl(get(x))})
comb <- na.omit(do.call(merge,x))
colnames(comb) <- paste0(tickers,".cc")
obj <- function(w) {
if (sum(w) == 0) { w <- w + 1e-2 }
w <- w / sum(w)
CVaR <- ES(weights = w,
method = "gaussian",
portfolio_method = "component",
mu = mu,
sigma = sigma)
tmp1 <- CVaR$ES
tmp2 <- max(CVaR$pct_contrib_ES - 0.225, 0)
out <- tmp1 + 1e3 * tmp2
}
comb1 <- comb[,sample(1:ncol(comb),7)]
no.of.assets <- ncol(comb1)
mu <- colMeans(comb1)
sigma <- cov(comb1)
## The non-parallel version
output <- DEoptim(fn = obj,lower = rep(0, no.of.assets), upper = rep(1, no.of.assets))
## The parallel version that doesn't seem to work...
output <- DEoptim(fn = obj,lower = rep(0, no.of.assets), upper = rep(1, no.of.assets), DEoptim.control(itermax=10000, trace=250, parallelType="parallel", packages=c("PerformanceAnalytics"), parVar=c("mu","sigma")))
I get the following error message
Error in missing(packages) : 'missing' can only be used for arguments