Is there an auto-thresholding algorithm such as 'Otsu' available for R raster object. I have tried using the "authothresholder" package, however it is inefficient as it works on matrix and doesn't work with 32 bit tif files. I am trying to convert an NDWI image into a binary layer.
This is implemented in the EBImage package available from Bioconductor. Here is an example use:
library(EBImage)
img <- readImage(system.file("images", "sample.png", package = "EBImage"))
thr <- img > otsu(img)
display(img)
display(thr)
The implementation is essentially the following (pulled from the function definition of EBImage::otsu i.e. not my work), so you should be able to adapt the following for whatever image analysis toolset you are using:
img # assuming img is a numeric matrix or vector
range = c(0, 1) # assuming values in the matrix range from 0 to 1
levels = 256L
breaks = seq(range[1], range[2], length.out = levels + 1)
h = hist.default(img, breaks = breaks, plot = FALSE)
counts = as.double(h$counts)
mids = as.double(h$mids)
len = length(counts)
w1 = cumsum(counts)
w2 = w1[len] + counts - w1
cm = counts * mids
m1 = cumsum(cm)
m2 = m1[len] + cm - m1
var = w1 * w2 * (m2/w2 - m1/w1)^2
maxi = which(var == max(var, na.rm = TRUE))
(mids[maxi[1]] + mids[maxi[length(maxi)]])/2
Related
I would like to understand how the R kknn package calculates weights, distances, and class probabilities for binary classification problems. In the R code below, there are three observations in the training sample and one observation in the holdout sample. The two predictor variables are height and weight. With Euclidean distance, the distances for each observation in the training sample are then:
sqrt((6-8)^2 + (4-5)^2) = 2.24
sqrt((6-3)^2 + (4-7)^2) = 4.24
sqrt((6-7)^2 + (4-3)^2) = 1.41.
With k=3 and with equal weights, I get a probability for the holdout as:
(1/3 * 1) + (1/3 * 0) + (1/3 * 1) = 0.67.
With k=2 and with equal weights, I get a probability for the holdout as:
(1/2 * 1) + (1/2 * 1) = 1.00.
I would like to understand how the R kknn package makes these same calculations with the "triangular," "gaussian," and "inverse" weights (and more generally).
library(kknn)
training <- data.frame(class = c(1, 0, 1), height = c(8, 3, 7), weight = c(5, 7, 3))
holdouts <- data.frame(class = 1, height = 6, weight = 4)
triangular_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "triangular", k = 3)
triangular_kernel[["fitted.values"]]
triangular_kernel[["W"]]
triangular_kernel[["D"]]
gaussian_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "gaussian", k = 3)
gaussian_kernel[["fitted.values"]]
gaussian_kernel[["W"]]
gaussian_kernel[["D"]]
inverse_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "inv", k = 3)
inverse_kernel[["fitted.values"]]
inverse_kernel[["W"]]
inverse_kernel[["D"]]
Calling kknn::kknn prints the source code for the kknn function in the console. With it, one can go through the function line by line to see what it does.
Distance
kknn calls a compiled C code dmEuclid. To obtain its source code, we follow this guide, writing the following code in R:
untar(download.packages(pkgs = "kknn", destdir = ".", type = "source")[,2])
and then open the src directory of kknn_1.3.1.tar in your working directory (getwd()) to find and open dm.C using any text editor. Scroll about halfway to find dmEuclid. To test the exact outputs of dmEuclid, you could install the build tools, and open a C++ file in Rstudio by selecting it in the dropdown menu, and run the code with different inputs.
Following the function outputs, in your case the dmtmp$dm results in
3.779645e-01 1.133893e+00 1.000000e+150 3.685210e-156
Per your specification k, the first 3 values are chosen as distance D.
This is manually converted to maxdist = 1e-06 by the package author, as the max distance is smaller than that in your case.
Weights
The kknn function uses the following section to allocate a weight scheme, per your defined kernel.
W <- D/maxdist
W <- pmin(W, 1 - (1e-06))
W <- pmax(W, 1e-06)
at this point your W values are larger than 1, and so W is then coerced to approximately 1.
if (kernel == "inv"
W <- 1/W
if (kernel == "triangular")
W <- 1 - W
if (kernel == "gaussian") {
alpha = 1/(2 * (k + 1))
qua = abs(qnorm(alpha))
W = W * qua
W = dnorm(W, sd = 1)
}
the explanation for which can be found in the paper linked by gowerc.
W is then converted to matrix W <- matrix(W, p, k) with 1 row (p=1), 3 columns (k=3)
Fitted value
p = 1 in your case is 1, k=3, cl = c(1,0,1).
C <- matrix(dmtmp$cl, nrow = p, ncol = k + 1)
C <- C[, 1:k] + 1
CL <- matrix(cl[C], nrow = p, ncol = k)
W <- matrix(W, p, k)
fit <- rowSums(W * CL)/pmax(rowSums(W), 1e-06)
I have a linear model with the exchange rate as a dependent variable and 7 others independent variables(e.g. inflation, interest rate etc.). I have quarterly data from 1993Q1-2011Q4.
I would like to create a rolling window regression (with the model above) with window size 60(from 1993Q1-2007Q4) and use the estimated regression to forecast the rest sample. Also, I would like to compare this model with the Random Walk model(exchange rate follows a R.W.). In the end, I would like to perform the dm.test and clarkwest test(does not run). Is my code right?
X = embed(data)
X = as.data.frame(X)
install.packages("foreach")
library(foreach)
w_size=60
n_windows = nrow(X) - 60 #until 2007Q4
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
return(c(f1, f2))
}
e1 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,2]
library(tseries)
library(forecast)
dm.test(e1,e2, "l") #p-value is more than 5% for all the cases( two.sided, greater, less)
clarkwest(e1,e2)
It seems like the clarkwest() function is not supported anymore. I recently wrote my own function: CW Note that I used normal standard errors and not Newey-West corrected.
To investigate your loop you could try:
i=1
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
Here you can see the composition the loop creates when i=1
I everybody,
I have a function to optimize, subject to linear constraints.
I am actually using maxLik R-package, but this is a wrapper for various method, thus what I am actually running in constrOptim.
The problem is the following: I have a matrix of constraints which is n^2 x n, but n is ~ 10^3, so the matrix is huge and the routine stops for memory problems.
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
It seemed quite natural to me to shift to sparse matrices (indeed my matrix is very sparse) with the Matrix package, but I always get the following error
Error: Matrices must have same dimensions in ineqA * gi.old
even for small n.
Does it mean that sparseMatrix is not supported in constrOptim?
Do you know any way out?
reproducible example
you can find the dataset I am using to optimize here:
http://konect.uni-koblenz.de/downloads/extraction/opsahl.tar.bz2
and here you have the code
#read edgelist
edgelist <- read.table('out.opsahl-usairport',skip=2)
colnames(edgelist) = c('V1','V2','weight')
require(igraph)
g = graph_from_data_frame(edgelist)
s_in = strength(g,v=V(g), mode= 'in')
s_out = strength(g,v=V(g),mode='out')
n = length(s_in)
# optimization function
objective_fun = function(x){
theta_out = x[1:(length(x)/2)]; theta_in = x[(length(x)/2+1):length(x)];
llikelihood(s_out,s_in,theta_out,theta_in)
}
llikelihood = function(s_out,s_in,theta_out, theta_in){
theta_sum_mat = outer(theta_out,rep(1,length(theta_out))) + outer(rep(1,length(theta_in)),theta_in)
theta_sum_mat = log(1-exp(-theta_sum_mat))
diag(theta_sum_mat) = 0 # avoid self loops
f = -sum(s_out*theta_out+s_in*theta_in) + sum(theta_sum_mat)
f
}
#choose appropriate starting point
starting_point = function(s_out,s_in){
s_tot = sum(s_in) # =sum(s_out)
s_mean = mean(mean(s_in),mean(s_out))
z = log((s_tot + s_mean^2)/(s_mean^2))
list(theta_out = rep(1,length(s_out)), theta_in=rep(z-1,length(s_in))) # starting parameters
}
#gradient
grad = function(x){
theta_out = x[1:(length(x)/2)]; theta_in = x[(length(x)/2+1):length(x)];
ret = grad_fun(s_out,s_in,theta_out,theta_in)
ret
}
grad_fun = function(s_out,s_in, theta_out, theta_in){
theta_sum_mat = outer(theta_out,rep(1,length(theta_out))) + outer(rep(1,length(theta_in)),theta_in)
theta_sum_mat = exp(-theta_sum_mat)/(1-exp(-theta_sum_mat))
diag(theta_sum_mat) = 0 # avoid self loops
c(-s_out + rowSums(theta_sum_mat), -s_in + colSums(theta_sum_mat))
}
#constraints
constraints = function(n){
a1 = Diagonal(n); a2 = sparseMatrix(c(1:n),rep(1,n), x=1, dims=c(n,n)) # Diagonal is a sparse diagonal matrix
a12 = cBind(a1,a2)
a12[1,] = 0 # avoid self loops
dd = function(j){
sparseMatrix(c(1:n),rep(j,n), x=rep(1,n), dims=c(n,n))
}
b1 = sparseMatrix(i=1, j=1, x=1, dims=c(n^2,1)) # 1,0,0,... n^2 vector
for(j in c(2:n)) {
a = cBind(Diagonal(n),dd(j))
a[j,]=0 # avoid self loops
a12 = rBind(a12, a)
b1[(j-1)*n+j] = 1 # add 1 to ''self loops'' rows, in order to have the inequality satisfied
}
return(list(A=a12, B=b1))
}
# starting point
theta_0 = starting_point(s_out,s_in)
x_0 = c(theta_0$theta_out, theta_0$theta_in)
#constraints
constr = list(ineqA=constraints(n)$A, ineqB=constraints(n)$B)
# optimization
co = maxControl(printLevel = 1, iterlim=500, tol=1e-4) #tol=1e-8 (def) iterlim=150 (def)
res = maxLik(objective_fun, grad=grad, start=x_0, constraints=constr, control=co)
suppose that we have a pde that describes the evolution of a variable y(t,x) over time t and space x, and I would like to plot its evolution on a three dimensional diagram (t,x,y). With deSolve I can solve the pde, but I have no idea about how to obtain this kind of diagram.
The example in the deSolve package instruction is the following, where y is aphids, t=0,...,200 and x=1,...,60:
library(deSolve)
Aphid <- function(t, APHIDS, parameters) {
deltax <- c (0.5, rep(1, numboxes - 1), 0.5)
Flux <- -D * diff(c(0, APHIDS, 0)) / deltax
dAPHIDS <- -diff(Flux) / delx + APHIDS * r
list(dAPHIDS )
}
D <- 0.3 # m2/day diffusion rate
r <- 0.01 # /day net growth rate
delx <- 1 # m thickness of boxes
numboxes <- 60
Distance <- seq(from = 0.5, by = delx, length.out = numboxes)
APHIDS <- rep(0, times = numboxes)
APHIDS[30:31] <- 1
state <- c(APHIDS = APHIDS) # initialise state variables
times <-seq(0, 200, by = 1)
out <- ode.1D(state, times, Aphid, parms = 0, nspec = 1, names = "Aphid")
"out" produces a matrix containing all the data that we need, t, y(x1), y(x2), ... y(x60). How can I produce a surface plot to show the evolution and variability of y in (t,x)?
The ways change a bit depending on using package. But you can do it with little cost because out[,-1] is an ideal matrix form to draw surface. I showed two examples using rgl and plot3D package.
out2 <- out[,-1]
AphID <- 1:ncol(out2)
library(rgl)
persp3d(times, AphID, out2, col="gray50", zlab="y")
# If you want to change color with value of Z-axis
# persp3d(times, AphID, out2, zlab="y", col=topo.colors(256)[cut(c(out2), 256)])
library(plot3D)
mat <- mesh(times, AphID)
surf3D(mat$x, mat$y, out2, bty="f", ticktype="detailed", xlab="times", ylab="AphID", zlab="y")
Assume A follows Exponential distribution; B follows Gamma distribution
How to plot the PDF of 0.5*(A+B)
This is fairly straight forward using the "distr" package:
library(distr)
A <- Exp(rate=3)
B <- Gammad(shape=2, scale=3)
conv <- 0.5*(A+B)
plot(conv)
plot(conv, to.draw.arg=1)
Edit by JD Long
Resulting plot looks like this:
If you're just looking for fast graph I usually do the quick and dirty simulation approach. I do some draws, slam a Gaussian density on the draws and plot that bad boy:
numDraws <- 1e6
gammaDraws <- rgamma(numDraws, 2)
expDraws <- rexp(numDraws)
combined <- .5 * (gammaDraws + expDraws)
plot(density(combined))
output should look a little like this:
Here is an attempt at doing the convolution (which #Jim Lewis refers to) in R. Note that there are probably much more efficient ways of doing this.
lower <- 0
upper <- 20
t <- seq(lower,upper,0.01)
fA <- dexp(t, rate = 0.4)
fB <- dgamma(t,shape = 8, rate = 2)
## C has the same distribution as (A + B)/2
dC <- function(x, lower, upper, exp.rate, gamma.rate, gamma.shape){
integrand <- function(Y, X, exp.rate, gamma.rate, gamma.shape){
dexp(Y, rate = exp.rate)*dgamma(2*X-Y, rate = gamma.rate, shape = gamma.shape)*2
}
out <- NULL
for(ix in seq_along(x)){
out[ix] <-
integrate(integrand, lower = lower, upper = upper,
X = x[ix], exp.rate = exp.rate,
gamma.rate = gamma.rate, gamma.shape = gamma.shape)$value
}
return(out)
}
fC <- dC(t, lower=lower, upper=upper, exp.rate=0.4, gamma.rate=2, gamma.shape=8)
## plot the resulting distribution
plot(t,fA,
ylim = range(fA,fB,na.rm=TRUE,finite = TRUE),
xlab = 'x',ylab = 'f(x)',type = 'l')
lines(t,fB,lty = 2)
lines(t,fC,lty = 3)
legend('topright', c('A ~ exp(0.4)','B ~ gamma(8,2)', 'C ~ (A+B)/2'),lty = 1:3)
I'm not an R programmer, but it might be helpful to know that for independent random variables with PDFs f1(x) and f2(x), the PDF
of the sum of the two variables is given by the convolution f1 * f2 (x) of the two input PDFs.