Application of mclapply() to a function writing to a global variable - r

I'm trying to use parallel::mclapply to speed up the calculation of the following code:
library(raster)
library(HistogramTools)#for AddHistogram
#Create a first h here for the first band... omitted for brevity
readNhist <- function(n,mconst) {
l <- raster(filename[i], varname=var[i], band=n, na.rm=T)
gain(l) <- mconst
h <<- AddHistograms(h, hist(l, plot=F, breaks=histbreaks,right=FALSE))
}
lapply( 1:10000, readNhist, mconst=1, mc.cores=7 )
#Then do stuff with the h histogram...
When performing the code above, all is fine. If using mclapply (below), the result is miles away from what I want to obtain: the histograms are all wrong.
library(raster)
library(HistogramTools)#for AddHistogram
library(parallel)
#Create a first h here for the first band... omitted for brevity
readNhist <- function(n,mconst) {
l <- raster(filename[i], varname=var[i], band=n, na.rm=T)
gain(l) <- mconst
h <<- AddHistograms(h, hist(l, plot=F, breaks=histbreaks,right=FALSE))
}
mclapply( 2:10000, readNhist, mconst=1 )
#Then do stuff with the h histogram...
I feel like there's something vital I'm missing with the application of parallel computation to this function.

The problem is the <<- which is bad practice in general as far as I can gather.
The function can be rearranged thusly:
readNhist <- function(n,mconst) {
l <- raster(filename, varname=var, band=n, na.rm=T)
gain(l) <- mconst
hist <- hist(l, plot=F, breaks=histbreaks,right=FALSE)
return(hist)
}
And called like this:
hists <- mclapply( 2:nbands, readNhist, mconst=gain, mc.cores=ncores )
ch <- AddHistograms(x=hists)
h <- AddHistograms(h, ch)
rm(ch, hists)
This is pretty fast even with a huge number of layers (and thus histograms).

Related

Code to evaluate an integral: translating from Matlab to R

Consider the following Matlab code to approximate integrals using simulation.
function f
numSim = 1000000;
points = rand(numSim,1);
r3 = mean(feval('func3', points));
points1 = rand(numSim,1);
r8 = mean(feval('func8', points, points1));
disp([r3, r8,]);
end %f
%%%%%%%%%% Nested funcitons %%%%%%%%%%%%
function y = func3(x)
y = exp(exp(x));
end %func3
function z = func8(x,y)
z = exp((x+y).^2);
end %func8
What I've tried in R
f <- function (func3,func8){
numSim <- 1000000
points <- runif(numSim)
r3 <- mean(evaluate(points, func3))
points1 <- runif(numSim)
r8 <- mean(evaluate( points1,func8))
newList<-list(r3,r8)
return(newList)
}
# Nested functions
func3<-function(x) {
func3 <- exp(exp(x))
return(func3)
}
func8 <- function(x,y) {
func8<-exp((x+y)^2)
return(func8)
}
The first problem was a warning message:
In mean.default(evaluate(points,function)) :
argument is not numeric or logical:returning NA
I added r3 <- mean(evaluate(points, func3),na.rm=TRUE)
and when I type r3 the output is [1] NA,
why is it not working correctly?
Additionally,
there was a comment about -Nested functions-, I don't understand how to do that in R.
This appears to work:
f <- function (func3,func8){
numSim <- 1000000
vals <- runif(numSim) ## changed the name: 'points' is a built-in function
r3 <- mean(sapply(vals, func3))
vals2 <- runif(numSim)
## use mapply() to evaluate over multiple parameter vectors
r8 <- mean(mapply(func8, vals, vals2))
newList <- list(r3,r8)
return(newList)
}
I simplified the function definitions.
func3 <- function(x) {
return(exp(exp(x)))
}
func8 <- function(x,y) {
return(exp((x+y)^2))
}
Try it out:
f(func3,func8)
I have no idea if this is correct, but I think it's a correct translation of your MATLAB code. Note that the implementation could be much faster by using vectorization: replace the sapply() and mapply() with mean(func3(vals)) and mean(func8(vals,vals2)) respectively (this only works if the functions to be evaluated are themselves appropriately vectorized, which they are in this case).

How to indicate the level of the for loop in R

Suppose I have a function including a for loop part. This for loop will work for, say, 10 iteration. How can I know from the result that the function is working now at level (iteration) number, say, 5.
That is, I would like my function to let me know the current iteration number.
For example,
I would like the result to be such this:
Iteration 1 starts
some result
iteration 1 ends
iteration 2 starts
some result
iteration 2 ends
...
...
Please note this is not my original function. In my original function I use optim function over a list of models, and I really need to know what is the current model.
Here is a general example:
Myfun <- function(x,y){
v <- list()
for(i in 1:100){
v[[i]] <- sum(x[[i]], y[[i]])
cat(v, "\n")
}
v
}
x <- rnorm(100)
y <- rnorm(100)
Myfun(x=x, y=y)
Method 1
Output the current iteration step inside the for loop.
Myfun <- function(x,y) {
v <- list()
for (i in 1:100) {
v[[i]] <- sum(x[[i]], y[[i]])
cat(sprintf("Step %i / 100 done\n", i))
}
v
}
Method 2
Use a progress bar (see ?txtProgressBar for details).
Myfun <- function(x,y) {
v <- list()
pb <- txtProgressBar(min = 0, max = 100, style = 3)
for (i in 1:100) {
v[[i]] <- sum(x[[i]], y[[i]])
setTxtProgressBar(pb, i)
}
close(pb)
v
}
Note that the line cat(v, "\n") from your original Myfun will give an error.

Arguments of a function where another function will be called

Consider a hypothetical example:
sim <- function(n,p){
x <- rbinom(n,1,p)
y <- (x==0) * rnorm(n)
z <- (x==1) * rnorm(n,5,2)
dat <- data.frame(x, y, z)
return(dat)
}
Now I want to write another function simfun where I will call the above sim function and check if y and z columns of the data frame is less than a value k.
simfun <- function(n, p, k){
dat <- sim(n, p)
dat$threshold <- (dat$y<=k & dat$z<=k)
return(dat$threshold)
}
But is it standard to use the argument of sim as the argument of simfun? Can I write simfun <- function(k) and call the sim function inside simfun?
I'd say it's fairly standard to do this sort of thing in R. A few pointers to consider:
Usually you should explicitly declare the argument names so as not to create any unwanted behaviour if changes are made. I.e., instead of sim(n, p), write sim(n = n, p = p).
To get simfun() down to just a k argument will require default values for n and p. There are lots of ways to do this. One way would be to hardcode inside simfun itself. E.g.:
simfun <- function(k) {
dat <- sim(n = 100, p = c(.4, .6))
dat$threshold <- (dat$y<=k & dat$z<=k)
return(dat$threshold)
}
simfun(.5)
A more flexible way would be to add default values in the function declaration. When you do this, it's good practice to put variables with default values AFTER variables without default values. So k would come first as follow:
simfun <- function(k, n = 100, p = c(.4, .6)){
dat <- sim(n, p)
dat$threshold <- (dat$y<=k & dat$z<=k)
return(dat$threshold)
}
simfun(.5)
The second option is generally preferable because you can still change n or p if you need to.
While not great, you could define n and p separately
n <- 1
p <- .5
simfun <- function(k){
dat <- sim(n, p)
dat$threshold <- (dat$y<=k & dat$z<=k)
return(dat$threshold)
}
You can read more about R Environments here: http://adv-r.had.co.nz/Environments.html

Package for Divide Chain of tesselations in R, spatstat package?

I am trying to create pretty figures of clustered points. Is there a package which will create the divide chain between tessellations of points? Ideally it would be fit for plotting in ggplot.
Here is some example code:
#DivideLineExample
library(spatstat)
W=owin(c(0,1),c(0,1)) # Set up the Window
p<-runifpoint(42, win=W) # Get random points
ll=cbind(p$x,p$y) # get lat/long for each point
zclust=kmeans(ll,centers=4) # Cluster the points spatially into 4 clusters
K<-pp<-D<-list()
plot(W,main="Clustered Points")
for (i in 1:4){ # this breaks up the points into separate ppp objects for each cluster
K[[i]]=ll[zclust$cluster==i,]
pp[[i]]=as.ppp(K[[i]],W)
plot(pp[[i]],col=i,add=TRUE,cex=1.5,pch=16)
D[[i]]=dirichlet(pp[[i]]) # This performs the Dirichlet Tessellation and plots
plot(D[[i]],col=i,add=TRUE)
}
This outputs as such:
http://imgur.com/CCXeOEB
What I'm looking for is this:
http://imgur.com/7nmtXjo
I know an algorithm exists.
Any ideas/alternatives?
I have written a function that I think will do what you want:
divchain <- function (X) {
stopifnot(is.ppp(X))
if(!is.multitype(X)) {
whinge <- paste(deparse(substitute(X)),
"must be a marked pattern with",
"factor valued marks.\n")
stop(whinge)
}
X <- unique(X, rule = "deldir", warn = TRUE)
w <- Window(X)
require(deldir)
dd <- deldir(X,z=marks(X),rw=c(w$xrange,w$yrange))
if (is.null(dd))
return(NULL)
ddd <- dd$dirsgs
sss <- dd$summary
z <- sss[["z"]]
rslt <- list()
nsgs <- nrow(ddd)
K <- 0
for (i in 1:nsgs) {
i1 <- ddd[i,5]
i2 <- ddd[i,6]
c1 <- z[i1]
c2 <- z[i2]
if(c1 != c2) {
K <- K+1
rslt[[K]] <- unlist(ddd[i,1:4])
}
}
class(rslt) <- "divchain"
attr(rslt,"rw") <- dd$rw
rslt
}
I have also written a plot method for class "divchain":
plot.divchain <- function(x,add=FALSE,...){
if(!add) {
rw <- attr(x,"rw")
plot(0,0,type="n",ann=FALSE,axes=FALSE,xlim=rw[1:2],ylim=rw[3:4])
bty <- list(...)$bty
box(bty=bty)
}
lapply(x,function(u){segments(u[1],u[2],u[3],u[4],...)})
invisible()
}
E.g.:
require(spatstat)
set.seed(42)
X <- runifpoint(50)
z <- factor(kmeans(with(X,cbind(x,y)),centers=4)$cluster)
marks(X) <- z
dcX <- divchain(X)
plot(dirichlet(X),border="brown",main="")
plot(X,chars=20,cols=1:4,add=TRUE)
plot(dcX,add=TRUE,lwd=3)
Let me know whether this is satisfactory. Sorry I can't help you with ggplot stuff; I don't do ggplot.
You could try point in polygon test for example like kirkpatrick data structure. Much easier is to divide the polygon in horizontal or vertical. Source:http://www.personal.kent.edu/~rmuhamma/Compgeometry/MyCG/Voronoi/DivConqVor/divConqVor.htm

Speed up sample function in R

In a quite big data frame, I have to pick up some random rows to execute a function. In my example, the first function I use is the variance and then a function closed to the real one I use in my script, called after f. I do not detail the purpose of f but it deals with truncated Gaussian distribution and maximum-likelihood estimation.
My problem is that my code is way too slow with the second function and I suppose a bit of optimization of the for loop or the sample function could help me.
Here is the code :
df <- as.data.frame(matrix(0,2e+6,2))
df$V1 <- runif(nrow(df),0,1)
df$V2 <- sample(c(1:10),nrow(df), replace=TRUE)
nb.perm <- 100 # number of permutations
res <- c()
for(i in 1:nb.perm) res <- rbind(res,tapply(df[sample(1:nrow(df)),"V1"],df$V2,var))
library(truncnorm)
f <- function(d) # d is a vector
{
f2 <- function(x) -sum(log(dtruncnorm(d, a=0, b=1, mean = x[1], sd = x[2])))
res <- optim(par=c(mean(d),sd(d)),fn=f2)
if(res$convergence!=0) warning("Optimization has not converged")
return(list(res1=res$par[1],res2=res$par[2]^2))
}
for(i in 1:nb.perm) res2 <- rbind(res,tapply(df[sample(1:nrow(df)),"V1"],df$V2,function(x) f(x)$res2))
I hope I am clear enough.

Resources