How to parallelize a nested for loop involving rasters? - r

I am working with raster data and trying to crop and mask a variety of buffers to each raster in a raster stack for various locations. The result is a list of a list of rasters. I got the code to work for a small subset of the data, but now I am trying it over the whole dataset, and it is working very slowly. See example code:
# Example data ------------------------------------------------------------
#create example raster stack
r1 = raster(nrows=1000,ncol=1000,xmn=60,xmx=90,ymn=0,ymx=25)
rr = lapply(1:100, function(i) setValues(r1,runif(ncell(r1))))
rrstack=stack()
for (i in 1:length(rr)){
stacknext=rr[[i]]
rrstack=stack(rrstack,stacknext)
}
#create example shapefile list
lats=runif(26,min=0,max=25)
lons=runif(26,min=60,max=90)
exnames=paste0("city_",letters)
coords=data.frame(names=exnames,lats=lats,lons=lons)
coords_sf = st_as_sf(coords,coords=c("lons","lats"),crs=4326,dim ="XY")
circle1=st_buffer(coords_sf, 1E3)
circle100=st_buffer(coords_sf,1E5)
circle500=st_buffer(coords_sf,5E5)
circlist=list(circle1=circle1,circle100=circle100,circle500=circle500)
circlist_reproj=lapply(circlist,function(x) st_transform(x,crs(rrstack[[1]])))
start <- proc.time()
citlist <- vector(mode='list',length=nrow(circlist_reproj[[1]]))
dellist <- vector(mode='list',length=length(circlist_reproj))
mystack <- stack()
for(k in 1:nrow(circlist_reproj[[1]])) {
for(j in 1:length(circlist_reproj)) {
for (i in 1:nlayers(rrstack)){
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
maskraster <- raster::crop(maskraster,circlist_reproj[[j]][k,])
mystack <- stack(mystack,maskraster)
}
dellist[[j]] <- mystack
mystack <- stack()
}
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
basetime <- proc.time()-start
#time taken for computation
basetime
user system elapsed
940.173 84.366 1029.688
As you can see for a set of data smaller than what I have, the computation takes a while. I wanted to try and parallelize the processing but am having trouble figuring out how to do so. I have two issues with it right now. First because of the nature of the nested for loop, I am not sure which for loop I should change to foreach. According to this post, it looks like it is the first one, though I am not sure that stands for all nested for loops. When I make the first for loop foreach I then get the error Error in { : task 1 failed - "could not find function "nlayers"" I then try and add the package argument in the foreach call resulting in a nested for loop that looks like
foreach(k = 1:nrow(circlist_reproj[[1]], .packages='raster')) %dopar% {
for(j in 1:length(circlist_reproj)) {
for (i in 1:nlayers(rrstack)) {
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
maskraster <- raster::crop(maskraster,circlist_reproj[[j]][k,])
mystack <- stack(mystack,maskraster)
}
dellist[[j]] <- mystack
mystack <- stack()
}
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
Which then gives the error
unused argument (.packages = "raster")
So I am not sure how to properly apply the .packages argument to the foreach function. What am I doing wrong here?
EDIT
Taking #HenrikB's comment, I have looked at my code and reworked it. I now have the following foreach loops. Now the code completes, but it results in all null values.
cores <- detectCores()
cl <- makeCluster(cores[1]-2) #not to overload your computer
registerDoParallel(cl)
start <- proc.time()
citlist <- vector(mode='list',length=nrow(circlist_reproj[[1]]))
dellist <- vector(mode='list',length=length(circlist_reproj))
mystack <- stack()
foreach(k = 1:nrow(circlist_reproj[[1]])) %:%
foreach(j = 1:length(circlist_reproj))%:%
foreach (i = 1:nlayers(rrstack), .packages=c('raster','sf')) %dopar% {
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
maskraster <- raster::crop(maskraster,circlist_reproj[[j]][k,])
mystack <- stack(mystack,maskraster)
dellist[[j]] <- mystack
mystack <- stack()
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
partime <- proc.time()-start

After taking #Henrik's comments and reworking my code a bit, I was able to come up with a solution that solves the problem via parallelization, however it is slower than the base solving. But that is for another post. Here is the solution:
cores <- detectCores()
cl <- makeCluster(cores[1]-2) #not to overload your computer
registerDoParallel(cl)
citlist <- vector(mode='list',length=nrow(circlist_reproj[[1]]))
dellist <- vector(mode='list',length=length(circlist_reproj))
for(k in 1:nrow(circlist_reproj[[1]])) {
for(j in 1:length(circlist_reproj)) {
parrasterstack <- foreach(i=1:nlayers(rrstack),.packages=c('raster','sf')) %dopar% {
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
raster::crop(maskraster,circlist_reproj[[j]][k,])
}
parrasterstack <- stack(parrasterstack)
dellist[[j]] <- parrasterstack
parrasterstack <- NULL
}
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
stopCluster(cl)

Related

Read many files in parallel and extract data

I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in

Using foreach instead of for loop

I am trying to learn foreach to parallelise my task
My for-loop looks like this:
# create an empty matrix to store results
mat <- matrix(-9999, nrow = unique(dat$mun), ncol = 2)
for(mun in unique(dat$mun)) {
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
unique(dat$mun) has a length of 5563.
I want to use foreach to pararellise my task.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
foreach(icount(iters)) %dopar% {
mun <- unique(dat$mun)[mun] # this is where I cannot figure out how to assing mun so that it read the data for mun
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
This could be one solution.
Do note that I'm using windows here, and i specified registerDoParallel() for it to work.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
registerDoParallel()
mun <- unique(dat$mun)
tableList <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
tot.dat <- sum(dat$x)
}
unlist(tableList)
Essentially, whatever result inside {...} will be stored inside a list.
In this case, the result (tot.dat which is a number) is compiled in tableList, and by performing unlist() we can convert it to a vector for further use.
The result inside {...} can be anything, a single number, a vector, a dataframe, or anything.
Another approach for your problem would be to combine all existing data together, labelling it with its appropriate source file, so the middle component will look something like
library(plyr)
tableAll <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
dat$source = mun[i]
}
rbind.fill(tableAll)
Then we can use it for further analysis.

How to apply function to a table in parallel and store the results - R

I'm trying to go through all the rows from a table to apply some functions. Something like:
for(i in 1:nrow(df)){
df[i,2] <- somefunction1(df[i,1])
df[i,3] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])
}
This works but it takes too long so I was thinking in parallelization with the library doParallel. But when I try:
foreach(i = 1:nrow(df) ) %dopar% {
df[i,2] <- somefunction1(df[i,1])
df[i,3] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])}
It doesn't change the table, but it returns a list with the last function results.
I guess maybe this is not the best approach for what I want to achieve so I am open to receive new ways to make this kind of code more efficient. This is something that an excel table makes automatically to all the cells at the same time without spending so much time, so I think R should be able to do this somehow.
If parallelization is the way to go, I would like to receive some orientation about how to store the results in the table directly inside the loop, without executing each function separatedly and store it after that (it makes the code slow and less reliable with the association of the results to the variables).
Thank you in advance.
That's a really inefficient way to perform a function on every row in the data frame. Do you have to use a for loop at all?
Here is some code that runs some simple functions on row in the data frame, in parallel:
a <- sample(1:1000)
df <- as.data.frame(cbind(a))
somefunction1 <- function(x) {
x/1
}
somefunction2 <- function(x) {
x/2
}
somefunction3 <- function(x) {
x/3
}
somefunction4 <- function(x) {
x/4
}
for(i in 1:nrow(df)){
df[i,1] <- somefunction1(df[i,1])
df[i,2] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])
}
library(foreach)
library(doMC)
library(abind)
registerDoMC(detectCores()-1)
acomb <- function(...) abind(..., along=1)
par_df <- foreach(i=icount(nrow(df)), .combine='acomb', .multicombine=TRUE)
%dopar%
{
df[i,1] <- somefunction1(df[i,1])
df[i,2] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])
df[i,]
}
par_df_2 <- data.frame(par_df, stringsAsFactors=FALSE)
This is not an issue with parallelization.
Your code between { } behaves like a function would. See the following example
myfun <- function() {
1
2
3
}
myfun()
# 3
There is an implicit return of the last evaluated value, and the other evaluated values are lost. The same is occurring with your "function"
foreach(i = 1:nrow(df) ) %dopar% {
df[i,2] <- somefunction1(df[i,1])
df[i,3] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])}
You can modify it as
foreach(i = 1:nrow(df) ) %dopar% {
c(somefunction1(df[i,1]), somefunction2(df[i,1]), somefunction3(df[i,1]), somefunction4(df[i,1])) }
to return a vector of the evaluated values
Second, and arguably more important, you should try to change your functions to accept vectors and return vectors. So instead of
df[i,2] <- somefunction1(df[i,1]) # single element in vector
Try
df[,2] <- somefunction1(df[,1]) # entire vector

R foreach could not find function "%dopar%"

When I using the doParallel library, I encountered this weird error, the system throws this
" Error in { : task 1 failed -could not find function "%dopar%"
To be specific, this is what I did
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
# Read the data
coin95 <-read.csv('~/Documents/coin95.csv')
coin95 <- coin95[,!(names(coin95) %in% c("X"))]
coin95[c("Person")] <- sapply(coin95[c("Person")],as.character)
# create the name list
coin95_name <- as.character(coin95$Person)
coin95_name <- unique(coin95_name)
n <- as.numeric(length(coin95_name))
# the average counting process
ntw <- function(now){
foreach (Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <-subset(coin95, subset = coin95$Person == Ii)$duration
stepfun(time,seq(0,length(time)))(now)/n
}
}
# the average cumulative hazard
lambda <- function(now,params){
b <- params[1]
sigma <- params[2]
mu <- params[3]
xi <- params[4]
beta1 <- params[5]
beta2 <- params[6]
k <- function(spread){
L0 <- (1+(spread -mu)*xi/sigma)^(-1/xi)
return(L0)
}
foreach(Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <- subset(coin95, subset = coin95$Person == Ii)$duration
noncov <- subset(coin95, subset = coin95$Person == Ii)$noncovered
reim <- subset(coin95, subset = coin95$Person == Ii)$reimbursement
(b*now+sum( exp(-k(now-time[(time < now)])+beta1*noncov[(time < now)]+beta2*reim[(time <now)]) ))/n
}
}
So far, everything is GOOD, I have created two functions ntw and lambda using the foreach. They worked perfectly.
Then I create the third function also using the foreach:
# the distance
Time <- coin95$duration
Time <- sort(as.double(Time))
jl <- function(params){
res<-foreach(Ii = Time,.combine = "rbind",.export = c("ntw","lambda")) %dopar% {
(ntw(Ii)-ntw(Ii-1e-7)) * (ntw(Ii)- lambda(Ii,params))^2
}
return(sqrt(sum(res)))
}
guess<-c(0.0,1.3333,0.0,0.1,-1.2,3e-3)
Type jl(guess):
> jl(guess)
Show Traceback
Rerun with Debug
Error in { : task 1 failed -could not find function "%dopar%"
Any Idea what's going wrong ?
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
Above packages are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. I should mention that it will most likely help even though you are not using these packages in your code.

Foreach code works for %do% but not for %dopar%

This works normally on my computer:
registerDoSNOW(makeCluster(2, type = "SOCK"))
foreach(i = 1:M,.combine = "c") %dopar% {
sum(rnorm(M))
}
So I can say that I can run parallelized code on this computer, right?
Ok. I have a piece of code that I wish to run on parallel with foreach. It runs perfectly when it's written with %do%, but doesn't work properly when I change it to %dopar%. (PS: I have already initialized the cluster with registerDoSNOW(makeCluster(2, type = "SOCK")) in the same way as before.)
My main interest in the code is getting the vector u.varpred. I get it nicely with %do%, but when I run it with %dopar%, the vector comes as a NULL.
Here is the loop with the code that's needed to run it all properly. It uses functions in the geoR package.
#you can pretty much ignore all this, it's just preparation for the loop
N=20
NN=10
set.seed(111);
datap <- grf(N, cov.pars=c(20, 5),nug=1)
grid.o <- expand.grid(seq(0, 1, l=100), seq(0, 1, l=100))
grid.c <- expand.grid(seq(0, 1, l=NN), seq(0,1, l=NN))
beta1=mean(datap$data)
emv<- likfit(datap, ini=c(10,0.4), nug=1)
krieging <- krige.conv(datap, loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars))
names(grid.c) = names(as.data.frame(datap$coords))
list.geodatas<-list()
valores<-c(datap$data,0)
list.dataframes<-list()
list.krigings<-list(); i=0; u.varpred=NULL;
#here is the foreach code
t<-proc.time()
foreach(i=1:length(grid.c[,1]), .packages='geoR') %do% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)))
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars));
u.varpred[i] <- mean(krieging$krige.var - list.krigings[[i]]$krige.var)
list.dataframes[[i]]<-0 #i dont need those objects anymore but since they
# are lists i dont want to put <-NULL as it'll ruin their ordering
list.krigings[[i]]<- 0
list.geodatas[[i]] <-0
}
t<-proc.time()-t
t
You can check that this runs nicely (provided you have the following packages: geoR, foreach and doSNOW). But once I use registerDoSNOW(......) and %dopar%, u.varpred comes as a NULL.
Could you guys please try to see if I made a mistake in the foreach statement/process or if it's just the code that can't be parallel? (I thought it could, because any given iteration does not deppend on any of the iterations before it..)
I am sorry both the code and this question are so long. Thanks in advance for taking the time to read it.
My friend helped me directly. Here is a way it works:
u.varpred <- foreach(i = 1:length(grid.c[,1]), .packages = 'geoR', .combine = "c") %dopar% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)));
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc = grid.o,
krige = krige.control(type.krige = "SK", trend.d = "cte",
beta = beta1, cov.pars = emv$cov.pars));
u.varpred <- mean(krieging$krige.var - list.krigings[[i]]$krige.var);
list.dataframes[[i]] <- 0;
list.krigings[[i]] <- 0;
list.geodatas[[i]] <- 0;
u.varpred #this makes the results go into u.varpred
}
He gave me an example on why this works:
a <- NULL
foreach(i = 1:10) %dopar% {
a <- 5
}
print(a)
# a is still NULL
a <- NULL
a <- foreach(i = 1:10) %dopar% {
a <- 5
a
}
print(a)
#now it works
Hope this helps anyone.

Resources