I'm trying to obtain the return of daily prices for each stock I have. The data is cross-sectionnal and very large. Thus I use doParallel and nested foreach.
Here is the code I've been using so far. (this is a reproduceable example)
and here is a reproduce-able example
stock_name <- as.data.frame(sample(x = 1:100, size = 250, replace=TRUE))
price <- as.data.frame(sample(x = 1:100, size = 1000,replace=TRUE))
## Calculating daily returns.
stock_list<-as.tbl(distinct(stock_name))
numStock<- as.integer(count(stock_list)) #150 #as.integer(count(stock_list))
nCPUcores = detectCores()
if (nCPUcores < 3) {
registerDoSEQ()
}else{
cl = makeCluster(nCPUcores - 1)
registerDoParallel(cl)
}
d_ret<-c()
foreach (stock=1:numStock, .packages = c("doParallel","foreach","data.table","plyr","dplyr")) %dopar%{
s<-as.integer(unlist(stock_list[stock,]))
stock_price <- as.matrix(price[which(stock_name[1,]==s),])
u<-nrow(stock_price)
d_ret<-foreach (p=2:u) %:%{
c(d_ret,(stock_price[p,]-stock_price[p-1,])/stock_price[p-1,])
}
}
stopCluster(cl)
##--
But the code doesn't work. After Florian Prive's remark, I checked the library and it seems that I should write nested foreach loops like this:
x <- foreach(b=bvec, .combine='cbind') %:%
foreach(a=avec, .combine='c') %dopar% {
sim(a, b)
}
So what I understand is I shouldn't be writing anything between %:% and the second foreach.
However, in my case, the second loop would change with the first foreach because there aren't the same number of prices for each stocks. Therefore I can't just write ' foreach(a=avec) '.
The second foreach would ideally depend on variable u
u<-nrow(stock_price)
Is this even possible with the foreach library?
Thank you for the help
Related
I have been trying to get some output displayed from the foreach loop R. A reproducible example is
cl <- makeCluster(2)
registerDoParallel(cl)
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
cat("Time taken", proc.time() - ptm1)
}
}
I expect the time taken to be displayed. But, this does not display anything. Can you please suggest ways of capturing the messages in the foreach loop and displaying at the end of the loop.
I'm not sure if there's a way to output to the screen, but you can easily output to a log file using the sink function like so
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
sink("Report.txt", append=TRUE) #open sink file and add output
cat("Time taken", proc.time() - ptm1)
}
}
EDIT : As #Roland points out, this can be dangerous if you want to capture output from every iteration and not just the final one, because you don't want the workers to clobber each other. He links to a better alternative for this scenario in his comment.
the following is a parallel loop I am trying to run in R:
cl <- makeCluster(30,type="SOCK")
registerDoSNOW(cl)
results <- foreach (i = 1:30, .combine='bindlist', .multicombine=TRUE) %dopar% {
test <- i
test <- as.list(test)
list(test)
}
stopCluster(cl)
The output of my code is always a list and I want to combine the list into one large list. Thus I wrote the following .combine function:
bindlist <- function(x,y,...){
append(list(x),list(y),list(...))
}
As I am doing multiple runs and the number of variables change I tried to use .... However it does not work. How can I rewrite the .combine function so it can work with changing numbers of variables?
Have you considered using 'c'
results <- foreach (i = 1:4, .combine='c', .multicombine=TRUE) %dopar% {
test <- i
test <- as.list(test)
list(test)
}
If this adds an additional unwanted 'level' to your results, you could use 'unlist' to remove that level.
unlist(results, recursive = FALSE)
I have been trying to get some output displayed from the foreach loop R. A reproducible example is
cl <- makeCluster(2)
registerDoParallel(cl)
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
cat("Time taken", proc.time() - ptm1)
}
}
I expect the time taken to be displayed. But, this does not display anything. Can you please suggest ways of capturing the messages in the foreach loop and displaying at the end of the loop.
I'm not sure if there's a way to output to the screen, but you can easily output to a log file using the sink function like so
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
sink("Report.txt", append=TRUE) #open sink file and add output
cat("Time taken", proc.time() - ptm1)
}
}
EDIT : As #Roland points out, this can be dangerous if you want to capture output from every iteration and not just the final one, because you don't want the workers to clobber each other. He links to a better alternative for this scenario in his comment.
This works normally on my computer:
registerDoSNOW(makeCluster(2, type = "SOCK"))
foreach(i = 1:M,.combine = "c") %dopar% {
sum(rnorm(M))
}
So I can say that I can run parallelized code on this computer, right?
Ok. I have a piece of code that I wish to run on parallel with foreach. It runs perfectly when it's written with %do%, but doesn't work properly when I change it to %dopar%. (PS: I have already initialized the cluster with registerDoSNOW(makeCluster(2, type = "SOCK")) in the same way as before.)
My main interest in the code is getting the vector u.varpred. I get it nicely with %do%, but when I run it with %dopar%, the vector comes as a NULL.
Here is the loop with the code that's needed to run it all properly. It uses functions in the geoR package.
#you can pretty much ignore all this, it's just preparation for the loop
N=20
NN=10
set.seed(111);
datap <- grf(N, cov.pars=c(20, 5),nug=1)
grid.o <- expand.grid(seq(0, 1, l=100), seq(0, 1, l=100))
grid.c <- expand.grid(seq(0, 1, l=NN), seq(0,1, l=NN))
beta1=mean(datap$data)
emv<- likfit(datap, ini=c(10,0.4), nug=1)
krieging <- krige.conv(datap, loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars))
names(grid.c) = names(as.data.frame(datap$coords))
list.geodatas<-list()
valores<-c(datap$data,0)
list.dataframes<-list()
list.krigings<-list(); i=0; u.varpred=NULL;
#here is the foreach code
t<-proc.time()
foreach(i=1:length(grid.c[,1]), .packages='geoR') %do% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)))
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars));
u.varpred[i] <- mean(krieging$krige.var - list.krigings[[i]]$krige.var)
list.dataframes[[i]]<-0 #i dont need those objects anymore but since they
# are lists i dont want to put <-NULL as it'll ruin their ordering
list.krigings[[i]]<- 0
list.geodatas[[i]] <-0
}
t<-proc.time()-t
t
You can check that this runs nicely (provided you have the following packages: geoR, foreach and doSNOW). But once I use registerDoSNOW(......) and %dopar%, u.varpred comes as a NULL.
Could you guys please try to see if I made a mistake in the foreach statement/process or if it's just the code that can't be parallel? (I thought it could, because any given iteration does not deppend on any of the iterations before it..)
I am sorry both the code and this question are so long. Thanks in advance for taking the time to read it.
My friend helped me directly. Here is a way it works:
u.varpred <- foreach(i = 1:length(grid.c[,1]), .packages = 'geoR', .combine = "c") %dopar% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)));
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc = grid.o,
krige = krige.control(type.krige = "SK", trend.d = "cte",
beta = beta1, cov.pars = emv$cov.pars));
u.varpred <- mean(krieging$krige.var - list.krigings[[i]]$krige.var);
list.dataframes[[i]] <- 0;
list.krigings[[i]] <- 0;
list.geodatas[[i]] <- 0;
u.varpred #this makes the results go into u.varpred
}
He gave me an example on why this works:
a <- NULL
foreach(i = 1:10) %dopar% {
a <- 5
}
print(a)
# a is still NULL
a <- NULL
a <- foreach(i = 1:10) %dopar% {
a <- 5
a
}
print(a)
#now it works
Hope this helps anyone.
I am trying to use foreach to run different classifiers on my data, but it doesn't work. In fact it doesn't return me anything.
my purpose is to parallelize my process. here is the simplified of my code:
library(foreach)
library(doParallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
registerDoParallel(no_cores)
model_list<-foreach(i = 1:2,
.combine = c,.packages=c("e1071","randomeForest")) %dopar%
if (i==1){
model1<-svm(x = X,y = as.factor(Y),type = "C-classification",probability = T)
}
if (i==2){
mode2<-randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,importance = T)
}
My way of parallelizing is correct overall?
Thanks indeed.
The main problem is that you're not enclosing the body of the foreach loop in curly braces. Because %dopar% is a binary operator, you have to be careful about precedence, which is why I recommend always using curly braces.
Also, you shouldn't use c as the combine function. Since svm and randomForest return objects, the default behavior of returning the results in a list is appropriate. Combining them with c will give you a garbage result.
Finally, it doesn't make sense to call registerDoParallel twice. It doesn't hurt, but it makes your code confusing.
I suggest:
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(no_cores)
model_list <- foreach(i = 1:2,
.packages=c("e1071","randomForest")) %dopar% {
if (i==1) {
svm(x = X,y = as.factor(Y),type = "C-classification",
probability = T)
} else {
randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,
importance = T)
}
}
I also removed the two unnecessary variable assignments to model1 and model2. Those variables won't be defined correctly on the master, and it obscures how the foreach loop really works.