Finding a boundary in a density plot - r

I am very new to machine learning so I am open to suggestions as well. I read something called minimax risk today and I was wondering if this is possible in my case.
I have two datasets and am interested in finding a line (or a boundary to be more precise) such that the area under the left curve to the right of the vertical line is equal to the area under the right curve to the left of the vertical line. Is there a way this can be done in R i.e., find out the exact location to draw the vertical line?
I put up some sample data here that can be used to plot the following graph: https://gist.github.com/Legend/2f299c3b9ba94b9328b2

Suppose you are using the density function to get the estimated kernel density for each response, then follow this link to get the estimated kernel CDF, then your question would become to find a value t, such that: 1 - cdf1(t) = cdf2(t), which can be solved by regular root find function:
x1 <- subset(data, Type == 'Curve 1')$Value
x2 <- subset(data, Type == 'Curve 2')$Value
pdf1 <- density(x1)
f1 <- approxfun(pdf1$x, pdf1$y, yleft = 0, yright = 0)
cdf1 <- function(z){
integrate(f1, -Inf, z)$value
}
pdf2 <- density(x2)
f2 <- approxfun(pdf2$x, pdf2$y, yleft = 0, yright = 0)
cdf2 <- function(z){
integrate(f2, -Inf, z)$value
}
Target <- function(t){
1 - cdf1(t) - cdf2(t)
}
uniroot(Target, range(c(x1, x2)))$root
R > uniroot(Target, range(c(x1, x2)))$root
[1] 0.06501821

Related

Plotting an 'n' sized vector between a given function with given interval in R

Let me make my question clear because I don't know how to ask it properly (therefore I don't know if it was answered already or not), I will go through my whole problem:
There is a given function (which is the right side of an explicit first order differential equation if it matters):
f = function(t,y){
-2*y+3*t
}
Then there's a given interval from 'a' to 'b', this is the range the function is calculated in with 'n' steps, so the step size in the interval (dt) is:
dt=abs(a-b)/n
In this case 'a' is always 0 and 'b' is always positive, so 'b' is always greater than 'a' but I tried to be generic.
The initial condition:
yt0=y0
The calculation that determines the vector:
yt=vector("numeric",n)
for (i in 1:(n-1))
{
yt[1]=f(0,yt0)*dt+yt0
yt[i+1]=(f(dt*i,yt[i]))*dt+yt[i]
}
The created vector is 'n' long, but this is an approximate solution to the differential equation between the interval ranging from 'a' to 'b'. And here comes my problem:
When I try plotting it alongside the exact solution (using deSolve), it is not accurate. The values of the vector are accurate, but it does not know that these values belong to an approximate function that's between the interval range 'a' to 'b' .
That's why the graphs of the exact and approximate solution are not matching at all. I feel pretty burnt out, so I might not describe my issue properly, but is there a solution to this? To make it realise that its values are between 'a' and 'b' on the 'x' axis and not between '1' and 'n'?
I thank you all for the answers in advance!
The deSolve lines I used (regarding 'b' is greater than 'a'):
df = function(t, y, params) list(-2*y+3*t)
t = seq(a, b, length.out = n)
ddf = as.data.frame(ode(yt0, t, df, parms=NULL))
I tried to reconstruct the comparison between an "approximate" solution using a loop (that is in fact the Euler method), and a solution with package deSolve. It uses the lsoda solver by default that is more precise than Euler'S method, but it is of course also an approximation (default relative and absolute tolerance set to 1e-6).
As the question missed some concrete values and the plot functions, it was not clear where the original problem was, but the following example may help to re-formulate the question. I assume that the problem may be confusion between t (absolute time) and dt between the two approaches. Compare the lines marked as "original code" with the "suggestion":
library(deSolve)
f = function(t, y){
-2 * y + 3 * t
}
## some values
y0 <- 0.1
a <- 3
b <- 5
n <- 100
## Euler method using a loop
dt <- abs(a-b)/n
yt <- vector("numeric", n)
yt[1] <- f(0, y0) * dt + y0 # written before the loop
for (i in 1:(n-1)) {
#yt[i+1] = (f( dt * i, yt[i])) * dt + yt[i] # original code
yt[i+1] <- (f(a + dt * i, yt[i])) * dt + yt[i] # suggestion
}
## Lsoda integration wit package deSolve
df <- function(t, y, params) list(-2*y + 3*t)
t <- seq(a, b, length.out = n)
ddf = as.data.frame(ode(y0, t, df, parms=NULL))
## Plot of both solutions
plot(ddf, type="l", lwd=5, col="orange", ylab="y", las=1)
lines(t, yt, lwd=2, lty="dashed", col="blue")
legend("topleft", c("deSolve", "for loop"),
lty=c("solid", "dashed"), lwd=c(5, 2), col=c("orange", "blue"))

r density function of convolution of exponentially distributed r.v. sometimes produces wrong values

I'm trying to replicate a plot in figure 3.4 in "Coalescent Theory: An introduction" by John Wakely. It consists of the density function of "T total", which I will not define further.
My problem is that I cant seem to correctly write a function for the distribution function.
The distribution function should be a standard convolution of (n-1) random exponentially distributed variables, where the rate of the i'th r.v. is (i/2). The basic distribution can be written as:
I think I have coded this correctly:
DistExpConv <- function(lambdas, t) {
product = function(vector, entrance) {
sapply(entrance, function(x){prod(vector[-x] / (vector[-x] - vector[x]))})
}
sapply(t, function(y){
sum(sapply(lambdas, function(x){
x * exp(-x * y) * product(lambdas, which(lambdas == x))
}))
})
}
I think this should give the correct response, at least I cannot see, where it should be wrong. I have then implemented the distribution of T total:
T_totaldist <- function(n, t) {
lambdas = sapply(2:n, function(x){ (x-1)/2 })
DistExpConv(lambdas , t)
}
As seen, this function just forms n-1 lambdas and then sends them into DistExpConv.
So far so good. My problem arises when trying to plot T_totaldist for n = 100 using the curve function:
n = c(2, 5, 10, 20, 50, 100)
col = rainbow(length(n))
for(i in 1:length(n)){
curve(T_totaldist(n = n[i], x), 0, 14, col = col[i], add = i!=1)
}
legend(8,0.5,paste("n = ",n, sep = ""), col=col, lty=1)
The (in this case pink) curve produced by n = 100 jumps in and out into the extreme negatives and positives. Thus I must conclude my previous functions are doing something wrong. The weird thing is, that the curves of the other n's look perfectly as they should, and from t >= 2 the pink curve is also correct. So I think I might be producing some error for small values of t?
I have no idea how to proceed in making the plot pretty.
I can see in the x- and y-values of the curve, that e.g. T_totaldist(100, 0.14) outputs 4.252961e+13 while T_totaldist(100, 0.28) outputs -4.982278e+10. This is clearly not what I want, since the density should be very close to zero at these small values for t, and since the density should never be negative.

Spatial correlogram using the raster package

Dear Crowd
Problem
I tried to calculate a spatial correlogram with the packages nfc, pgirmess, SpatialPack and spdep. However, I was troubling to define the start and end-point of the distance. I'm only interested in the spatial autocorrelation at smaller distances, but there on smaller bins. Additionally, as the raster is quite large (1.8 Megapixels), I run into memory troubles with these packages but the SpatialPack.
So I tried to produce my own code, using the function Moran from the package raster. But I must have some error, as the result for the complete dataset is somewhat different than the one from the other packages. If there is no error in my code, it might at least help others with similar problems.
Question
I'm not sure, whether my focal matrix is erroneous. Could you please tell me whether the central pixel needs to be incorporated? Using the testdata I can't show the differences between the methods, but on my complete dataset, there are differences visible, as shown in the Image below. However, the bins are not exactly the same (50m vs. 69m), so this might explain parts of the differences. However, at the first bin, this explanation seems not to be plausible to me. Or might the irregular shape of my raster, and different ways to handle NA's cause the difference?
Comparison of Own method with the one from SpatialPack
Runable Example
Testdata
The code for calculating the testdata is taken from http://www.petrkeil.com/?p=1050#comment-416317
# packages used for the data generation
library(raster)
library(vegan) # will be used for PCNM
# empty matrix and spatial coordinates of its cells
side=30
my.mat <- matrix(NA, nrow=side, ncol=side)
x.coord <- rep(1:side, each=side)*5
y.coord <- rep(1:side, times=side)*5
xy <- data.frame(x.coord, y.coord)
# all paiwise euclidean distances between the cells
xy.dist <- dist(xy)
# PCNM axes of the dist. matrix (from 'vegan' package)
pcnm.axes <- pcnm(xy.dist)$vectors
# using 8th PCNM axis as my atificial z variable
z.value <- pcnm.axes[,8]*200 + rnorm(side*side, 0, 1)
# plotting the artificial spatial data
r <- rasterFromXYZ(xyz = cbind(xy,z.value))
plot(r, axes=F)
Own Code
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
formerBreak <- 0 #for the first run important
for (i in c(seq(10,200,10))) #Calculate the Morans I for these bins
{
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (formerBreak>0) #if it is the second run
{
midpoint <- ceiling(ncol(w)/2) # get the midpoint
w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)] <- w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)]*(wOld==0)#set the previous focal weights to 0
w <- w*(1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w = w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
formerBreak <- i/res(r)[1]#divides the breaks by the resolution of the raster to be able to translate them to the focal window
}
plot(x=sp.Corr[,2],y = sp.Corr[,1],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
Other methods to calculate the Spatial Correlogram
library(SpatialPack)
sp.Corr <- summary(modified.ttest(z.value,z.value,coords = xy,nclass = 21))
plot(x=sp.Corr$coef[,1],y = data$coef[,4],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
library(ncf)
ncf.cor <- correlog(x.coord, y.coord, z.value,increment=10, resamp=1)
plot(ncf.cor)
In order to compare the results of the correlogram, in your case, two things should be considered. (i) your code only works for bins proportional to the resolution of your raster. In that case, a bit of difference in the bins could make to include or exclude an important amount of pairs. (ii) The irregular shape of the raster has a strong impact of the pairs that are considered to compute the correlation for certain distance interval. So your code should deal with both, allow any value for the length of bin and consider the irregular shape of the raster. A small modification of your code to tackle those problems are below.
# SpatialPack correlation
library(SpatialPack)
test <- modified.ttest(z.value,z.value,coords = xy,nclass = 21)
# Own correlation
bins <- test$upper.bounds
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
for (i in bins) {
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (i > bins[1]) {
midpoint <- ceiling(dim(w)/2) # get the midpoint
half_range <- floor(dim(wOld)/2)
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])] <-
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])]*(wOld==0)
w <- w * (1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w=w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
}
# Comparing
plot(x=test$upper.bounds, test$imoran[,1], col = 2,type = "b",ylab = "Moran's I",xlab="Upper bound of distance", lwd = 2)
lines(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
points(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
legend('topright', legend = c('SpatialPack', 'Own code'), col = 2:3, lty = 1, lwd = 2:1)
The image shows that the results of using the SpatialPack package and the own code are the same.

Adding bias in Taylor diagram in R

I am using the taylor.diagram function in the plotrix package e.g.
obs = runif(100,1,100)
mod1 = runif(100,1,100)
mod2 = runif(100,1,100)
mod3 = runif(100,1,100)
taylor.diagram(obs,mod1)
taylor.diagram(obs,mod2,add=TRUE)
taylor.diagram(obs,mod3,add=TRUE)
In the conventional Taylor diagram there is no bias but in his paper (Taylor, 2001, K.E. Summarizing multiple aspects of model performance in a single diagram Taylor JGR, 106, 7183-7192) Taylor says that
"Although the diagram has been designed to convey information about centered pattern differences it is also possible to indicate differences in overall means (i.e., the bias). This can be done on the diagram by attaching to each plotted point a line segment drawn at a right angle to the straight line defined by the point and the reference point. If the length of the attached line segment is equal to the bias, then the distance from the reference point to the end of the line segment will be equal to the total (uncentered) RMS error"
I admit that I don't know where to start to try and do this. Has anyone succeeded at adding this information on the plot?
If I understand correctly the bias is the difference in means between the model vector and the observation vector. Then, the problem is to, (a) find the line between the observation and model points, (b) find a line perpendicular to this line, (c) find a point along the perpendicular line, at a distance from the model point equal to the bias.
One possible solution is:
taylor.bias <- function(ref, model, normalize = FALSE){
R <- cor(model, ref, use = "pairwise")
sd.f <- sd(model)
sd.r <- sd(ref)
m.f <- mean(model)
m.r <- mean(ref)
## normalize if requested
if (normalize) {
m.f <- m.f/sd.r
m.r <- m.r/sd.r
sd.f <- sd.f/sd.r
sd.r <- 1
}
## calculate bias
bias <- m.f - m.r
## coordinates for model and observations
dd <- rbind(mp = c(sd.f * R, sd.f * sin(acos(R))), rp = c(sd.r, 0))
## find equation of line passing through pts
v1 <- solve(cbind(1, dd[,1])) %*% dd[,2]
## find perpendicular line
v2 <- c(dd[1,2] + dd[1,1]/v1[2], -1/v1[2])
## find point defined by bias
nm <- dd[1,] - c(0, v2[1])
nm <- nm / sqrt(sum(nm^2))
bp <- dd[1,] + bias*nm
## plot lines
arrows(x0 = dd[1,1], x1 = bp[1], y0 = dd[1,2], y1 = bp[2], col = "red", length = 0.05, lwd = 1.5)
lines(rbind(dd[2,], bp), col = "red", lty = 3)
lines(dd, col = "red", lty = 3)
}
Then,
library(plotrix)
obs = runif(100,1,100)
mod1 = runif(100,1,100)
taylor.diagram(obs,mod1)
taylor.bias(obs,mod1)
Where the length of the red vector indicates the bias and the length of dotted line joining the vector's tip to the reference point is the RMS error. The direction of the red vector indicates the sign of the bias -- in the picture below, negative bias.

filling color gridient under normal curve in X direction in r

I am trying to shade under curve (contrast to y direction in this post). Just the following is hypothesis of filling direction.
curve(dnorm(x,0,1),xlim=c(-3,3),main='Standard Normal')
I am trying to write a function, where I can fill very small polygons with different colors ( I do not know if this is right approach), then it will look like gradient.
The idea is to extend the following filling of single polygon to n polygons.
codx <- c(-3,seq(-3,-2,0.01),-2)
cody <- c(0,dnorm(seq(-3,-2,0.01)),0)
curve(dnorm(x,0,1),xlim=c(-3,3),main='Standard Normal')
polygon(codx,cody,col='red')
I tried to extend it to a function:
x1 <- NULL
y1 <- NULL
polys <- function ( lwt, up, itn) {
x1 <- c(lwt,seq(lwt,up, itn),up)
y1 <- c(0,dnorm(seq(lwt,up,tn)),0)
out <- list (x1, y1)
return (out)
}
out <- polys(lwt = 0, up = 1, itn = 0.1)
library(RColorBrewer)
plotclr <- brewer.pal(10,"YlOrRd")
Neither I could workout the function nor I could brew more colors than 9 this way. Help appreciated.
You can use segments to achieve "roughly" what you want
x <- seq(from=-3, to=3,by=0.01)
curve(dnorm(x,0,1), xlim=c(-3,3))
segments(x, rep(0,length(x)),x,dnorm(x,0,1) , col=heat.colors(length(x)), lwd=2)

Resources