Fastest way to select a valid range for raster data - r

Using R, I need to select the valid range for a given raster (from package raster) in the fastest possible way. I tried this:
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)
r <- raster(ncol=100, nrow=100)
r[] <- runif(ncell(r))
#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)
#Benchmark!
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
times=100)
autoplot(compare) #Reclassify is much faster, precompiling doesn't help much.
#Check they are the same...
identical(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA))
) #TRUE
identical(
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
) #TRUE
The reclassify method is much faster, but I'm sure that it can be sped up more. How can I do so?

While the accepted answer to this question is true for the example raster, it is important to note that the fastest safe function is highly dependent on raster size: the functions h and i presented by #rengis are only faster with relatively small rasters (and relatively simple reclassifications). Just increasing the size of the raster r in the OP's example by a magnitude of ten makes reclassify quicker:
# Code from OP #AF7
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)
#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)
# Funcions from #rengis
h <- function(r, min, max) {
rr <- r[]
rr[rr < min | rr > max] <- NA
r[] <- rr
r
}
i <- cmpfun(h)
# Benchmark with larger raster (100k cells, vs 10k originally)
r <- raster(ncol = 1000, nrow = 100)
r[] <- runif(ncell(r))
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
times=100)
autoplot(compare)
The exact point when reclassify becomes quicker is dependent both on the number of the cells in the raster and on the complexity of the reclassification, but in this case the cross-over point is at about 50,000 cells (see below).
As the raster becomes even larger (or the calculation more complex), another way to speed up reclassification is using multi-threading, e.g. with the snow package:
# Reclassify, using clusterR to split into two threads
library(snow)
tryCatch({
beginCluster(n = 2)
clusterR(r, reclassify, args = list(rcl = c(-Inf, 0.2, NA, 0.8, Inf, NA)))
}, finally = endCluster())
Multi-threading involves even more overhead to set up, and so only makes sense with very large rasters and/or more complex calculations (in fact, I was surprised to note that it didn't come out as the best option under any of the conditions I tested below--perhaps with a more complex reclassification?).
To illustrate, I've plotted results from microbenchmark using the OP's setup at intervals up to 10 million cells (10 runs of each) below:
As a final note, compiling didn't make a difference at any of the tested sizes.

Here is one more way:
h <- function(r, min, max) {
rr <- r[]
rr[rr < min | rr > max] <- NA
r[] <- rr
r
}
i <- cmpfun(h)
identical(
i(r, 0.2, 0.8),
g(r, 0.2, 0.8)
)
#Benchmark!
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
times=100)
autoplot(compare)
Compiling doesn't help much in this instance.
You could even gain some further speed up, by accessing slots of the raster object directly using # (although usually discouraged).
j <- function(r, min, max) {
v <- r#data#values
v[v < min | v > max] <- NA
r#data#values <- v
r
}
k <- cmpfun(j)
identical(
j(r, 0.2, 0.8)[],
g(r, 0.2, 0.8)[]
)

The raster package has a function for that: clamp. It is faster than g but slower than h and i because it has some overhead (safety) built in.
compare <- microbenchmark(
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
clamp(r, 0.2, 0.8),
g(r, 0.2, 0.8),
times=100)
autoplot(compare)

Related

Implementing a particular approach to calculating a log-likelihood using matrix operations

I cam across a mathematical expression for log-likelihood in a CrossValidated.com answer and am unclear how I should implement in R. I'm not sure if SO can represent MathML the same as CV, but this is the first equation in the second (not accepted) anser:
$$
\begin{eqnarray}
\ell(\mu, \Sigma) &=& C - \frac{m}{2}\log|\Sigma|-\frac{1}{2} \sum_{i=1}^m \text{tr}\left[(\mathbf{x}^{(i)}-\mu)^T \Sigma^{-1} (\mathbf{x}^{(i)}-\mu)\right]\\
$$
I focusing on the 3rd term in that equation and I do not think the trace operation is necessary according to another answer on that page. I suppose I could look at one of the several implementations in the various packages that exist, but I'm thinking they use more economical approaches that don't clearly follow that equation's procedure, as did #onyambu in the answer here:
I'm ripping out code from an earlier SO example:
library(MASS)
# Make covariance matrix. See note above re the implications of using a correlation matrix.
S = matrix(c(1.0, 0.2, 0.1, 0.35, 0.0,
0.2, 1.0, 0.0, 0.4, 0.0,
0.1, 0.0, 1.0, 0.0, 0.4,
0.35, 0.4, 0.0, 1.0, 0.6,
0.0, 0.0, 0.4, 0.6, 1.0), ncol = 5)
colnames(S) = c("Y1", "X1", "X2", "Z1" ,"Z2")
rownames(S) = colnames(S)
# Make mean vector
mus = c(1, 2, 3, 4, 5); names(mus) = colnames(S)
# Generate 5347 observations
obs = mvrnorm(n = 200, mu = mus, Sigma = S)
This effort was in response to a question correctly answered now but not using a summation of a matrix expression. I think I can do it with a for-loop to create individual contributions for each data point:
llmat.term3 <- matrix(NA, 200,1)
for(n in 1:200) {
llmat.term3[n] <- t(obs[n,]-mus) %*% solve(S) %*% (obs[n,]-mus) }
sum(llmat.term3)
#[1] 982.7356
.... but I'm wondering if there is a more compact matrix approach? Or I suppose, filled in the gaps in my linear algebra knowledge that explains why sum(u * solve(sig, u) is the same as sum{i=1,N} ( t(obs[n,]-mu) %*% S^-1 %*% (obs[n,]-mu) ).
in your code you have
S = matrix(c(1.0, 0.2, 0.1, 0.35, 0.0,
0.2, 1.0, 0.0, 0.4, 0.0,
0.1, 0.0, 1.0, 0.0, 0.4,
0.35, 0.4, 0.0, 1.0, 0.6,
0.0, 0.0, 0.4, 0.6, 1.0), ncol = 5)
colnames(S) = c("Y1", "X1", "X2", "Z1" ,"Z2")
rownames(S) = colnames(S)
# Make mean vector
mus = c(1, 2, 3, 4, 5); names(mus) = colnames(S)
# Generate 5347 observations
set.seed(123)
obs = MASS::mvrnorm(n = 200, mu = mus, Sigma = S)
llmat.term3 <- matrix(NA, 200,1)
for(n in 1:200) {
llmat.term3[n] <- t(obs[n,]-mus) %*% solve(S) %*% (obs[n,]-mus) }
sum(llmat.term3)
#[1] 982.7356
compare to more compact approaches:
u <- t(obs) - mus
sum(diag(solve(S, tcrossprod(u))))
#> [1] 982.7356
sum(u * solve(S, u))
#> [1] 982.7356
Though the two expressions give similar results, The first one seems to be quicker than the second. I do not know why since in the first one there is a computation of n * n matrix. The for loop takes for-ever to compute.
Unit: milliseconds
expr min lq mean median uq max neval
a 4532.6753 4679.4043 5470.94765 4815.1294 6061.3284 7789.5116 10
b 2.8991 3.2693 3.73495 3.3675 3.7777 6.9719 10
c 7.8176 8.5473 12.03060 9.2542 16.4089 20.1742 10
set.seed(123)
n <- 200000
obs = MASS::mvrnorm(n = n, mu = mus, Sigma = S)
u <- t(obs) -mus
microbenchmark::microbenchmark(a = {
llmat.term3 <- matrix(NA, n,1)
for(i in seq(n)) {
llmat.term3[i] <- t(obs[i,]-mus) %*% solve(S) %*% (obs[i,]-mus) }
sum(llmat.term3)
},
b = sum(diag(solve(S, tcrossprod(u)))),
c = sum(u * solve(S, u)),
check = 'equal', times = 10)
NB: took me a while to get the seed you used. Next time include it in your data generation

How do I need to assign values to each other in triplets using R?

The situation is as follows:
I need to create a dataset of triplets where we have discrete distribution of stock prices S <- c(80,100,120,140,160), with probability P <- c(0.2, 0.3, 0.2, 0.2, 0.1), call option C <- max(S-120,0) = c(0,0,0,20,40) and liability of an option which pays 30 if in a certain region otherwise zero, namely L = I{110 \leq S \leq 150} = c(0,0,30,30,0) <- c(0,0,30,30,0). It is important to mention that if P[1] = 80, then C[1] and L[1]. This holds for i = 1,2,3,4,5. How do you create a dataset for N = 10000 simulations where each value for i corresponds to the other two values for the same i?
This is the code I had for now. Note that X_1 = S, X_2 = C and Y = L.
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
X_2 <- function(n) {
sample(X_1 - 120, size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
Y <- function(n) {
sample(L, size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
##Creating triplets##
df <- data.frame(S_T = X_1(10000), C_T = X_2(10000), L_T =Y(10000))
df```
I'm not sure if you want C_T to be dependent on the S_T values. If you do, I think you just want to call X_1, assign the results to an object, then use that as the argument to X_2 (or just subtract 120, which is what X_2 does).
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
# Call that function
S_T <- X_1(10) # for practice
C_T <- S_T - 120 # that's all you're doing in function X_2, if you want to use S_T
If you want to C_T to contain values independent of S_T, you can create function within function
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
X_2 <- function(n) {
X_1(n) - 120
}
S_T <- X_1(10) # Same as above
C_T <- X_2(10) # Gives values not dependent on S_T
EDIT to address comment below:
It's hard to read the comment, but it looks like you want create a function that takes the results of function X_1 and returns a result based on a condition. Use ifelse to read each element one at at time. You can create another function and then input the results of function X_1
Y <- function(X_1_func){
ifelse( X_1_func == 80,
return(0),
ifelse(X_1_func == 100,
return(0),
ifelse(X_1_func == 120,
return(30),
return(60) # Add a default value here or the last possible value if others are F
)
)
)
}
sapply(X_1(10), Y) # Use an apply to input one element of function X_1 at a time. Assign results to L or whatever you with to call.
If this all works for you, you can accept the answer.

How to for loop using different columns of data frame?

Basically I was working on a portfolio return problem. The stock return is like:
AMZN <- c(0.1, 0.3, 0.4, 0.2)
BBY <- c(0.2, 0.4, 0.5, 0.3)
TGT <- c(-0.1, -0.3, -0.2,-0.5)
df1 <- data.frame(AMZN, BBY, TGT)
date <- c("2000-01-01","2000-02-01", "2000-03-01", "2000-04-01")
date <- as.Date(date, "%Y-%m-%d")
df1 <- cbind(date, df1)
xts <- xts(df1[,-1], order.by=df1[,1])
I want to use Return.portfolio(xts, weight) to calculate portfolio return. So
The weight is like
w1 <- c(0.2, 0.3, 0.1, 0.4)
w2 <- c(0.5, 0.1, 0.1, 0.3)
w3 <- c(0.1, 0.1, 0.4, 0.4)
Weights <- data.frame(w1, w2, w3)
Since there are several groups of weights assigned, I need to get multiple portfolio return.
The code I tried is
for (i in colnames(Weights)){
Return.portfolio(xts, (Weights[[i]]))
}
Although R does not report any error, the only thing I got is a value which i is "w3".
I think you may need to initialize a NULL object first. Maybe something like this
Return<-NULL
for (i in 1:ncol(Weights)){
Return<- cbind(Return, Return.portfolio(xts, (Weights[[i]])))
}

Sampling iteratively without a for loop in R

Even though I think the issue I have may be simple, I nevertheless can't figure it out. Here's the thing:
I have the following list and vector. The list is used to fill up the vector:
probabilities = list(c(0.2, 0.3, 0.5), c(0.1, 0.1, 0.8), c(0.3,0.4,0.3))
nextState = c()
for(counter in 1:3){
nextState[counter] = sample(1:3, size = 1, prob = probabilities[[counter]])
}
The code works fine. However, when expanding to larger lists (>10,000 elements), the loop becomes aggravatingly slow. Since the loop above is used multiple times in the larger code, the time consumed is way too much. Would there be a way to achieve the same result without looping?
Additional question:
Thanks guys, you've been a big help. One additional question: How would approach the same issue if the probabilities and the nextState were interdependent Meaning, how could I avoid the for loop? Perhaps some code to clarify:
M <- list(matrix(c(0.1, 0.2, 0.7, 0.2, 0.2, 0.6, 0.3, 0.3, 0.4), nrow = 3, ncol = 3),
matrix(c(0.3, 0.3, 0.4, 0.5, 0.5, 0, 0.1, 0.1, 0.8), nrow = 3, ncol = 3))
probabilities <- list()
nextState <- c(2, NA, NA)
for(i in 1:2){
probabilities[[i]] <- M[[i]][nextState[i], ]
nextState[i + 1] <- sample(1:3, size = 1, prob = probabilities[[i]])
}
If you've got any idea, then you truly are miracle workers!!
try sapply
nextstate <- sapply( probabilities, function(x) {sample(1:3, size = 1, prob = x)})
benchmarks
# Unit: microseconds
# expr min lq mean median uq max neval
# for 2115.170 2223.475 2436.0797 2283.2755 2371.546 10048.64 100
# sapply 24.704 29.524 164.0261 37.3565 41.123 12763.03 100
microbenchmark::microbenchmark(
`for` = {
nextState = c()
for(counter in 1:3){
nextState[counter] = sample(1:3, size = 1, prob = probabilities[[counter]])
}
},
sapply = sapply( probabilities, function(x) {sample(1:3, size = 1, prob = x)}),
times = 100)
Another possibility with purrr package:
library(purrr)
nexstate <- map_int(probabilities, function(x) {sample(1:3, size = 1, prob = x)})
Data:
probabilities = list(c(0.2, 0.3, 0.5), c(0.1, 0.1, 0.8), c(0.3,0.4,0.3))

Iteratively define user-defined discrete distributions

I am writing a script that, using -distr-, defines some discrete distributions based on the following objects:
margins <- c("discrete1", "discrete2")
vec1 <- list(support=c(0,1,2), probabilities=c(0.2, 0.2, 0.6))
vec2 <- list(support=c(12,14,20), probabilities=c(0.1, 0.15, 0.75))
Here you have the code that works as expeced: it creates the two distributions.
library("distr")
discrete1 <- DiscreteDistribution (supp = vec1[[1]], prob = vec1[[2]])
ddiscrete1 <- d(discrete1) # Density function
pdiscrete1 <- p(discrete1) # Distribution function
qdiscrete1 <- q(discrete1) # Quantile function
rdiscrete1 <- r(discrete1)
discrete2 <- DiscreteDistribution (supp = vec2[[1]], prob = vec2[[2]])
ddiscrete2 <- d(discrete2)
pdiscrete2 <- p(discrete2)
qdiscrete2 <- q(discrete2)
rdiscrete2 <- r(discrete2)
Once the two (or possibly more) distributions are defined, my final goal is to sample random numbers from them:
rdiscrete1(100)
rdiscrete2(100)
The problem with this code is that the number of distributions can be very high.. I wonder how it could be possible to automatize the creation of the functions in a more elegant manner.
Also, I need the two functions to be of class DiscreteDistribution and not as nested in lists (see is(discrete1) in my example).
l <- list(list(support = c(0, 1, 2), probabilities = c(0.2, 0.2, 0.6)),
list(support = c(12, 14, 20), probabilities = c(0.1, 0.15, 0.75)))
distrs <- lapply(1:length(l), function(n) {
d <- DiscreteDistribution(supp = l[[n]][[1]], prob = l[[n]][[2]])
list(d = d, dd = d(d), pd = p(d), qd = q(d), rd = r(d))
})
# First object of class DiscreteDistribution
is(distrs[[1]][[1]])
# [1] "DiscreteDistribution" "UnivariateDistribution" "AcDcLcDistribution"
# [4] "Distribution" "UnivDistrListOrDistribution"
# Random numbers
dim(sapply(distrs, function(x) x[[5]](100)))
# [1] 100 2

Resources