Adding bias in Taylor diagram in R - r

I am using the taylor.diagram function in the plotrix package e.g.
obs = runif(100,1,100)
mod1 = runif(100,1,100)
mod2 = runif(100,1,100)
mod3 = runif(100,1,100)
taylor.diagram(obs,mod1)
taylor.diagram(obs,mod2,add=TRUE)
taylor.diagram(obs,mod3,add=TRUE)
In the conventional Taylor diagram there is no bias but in his paper (Taylor, 2001, K.E. Summarizing multiple aspects of model performance in a single diagram Taylor JGR, 106, 7183-7192) Taylor says that
"Although the diagram has been designed to convey information about centered pattern differences it is also possible to indicate differences in overall means (i.e., the bias). This can be done on the diagram by attaching to each plotted point a line segment drawn at a right angle to the straight line defined by the point and the reference point. If the length of the attached line segment is equal to the bias, then the distance from the reference point to the end of the line segment will be equal to the total (uncentered) RMS error"
I admit that I don't know where to start to try and do this. Has anyone succeeded at adding this information on the plot?

If I understand correctly the bias is the difference in means between the model vector and the observation vector. Then, the problem is to, (a) find the line between the observation and model points, (b) find a line perpendicular to this line, (c) find a point along the perpendicular line, at a distance from the model point equal to the bias.
One possible solution is:
taylor.bias <- function(ref, model, normalize = FALSE){
R <- cor(model, ref, use = "pairwise")
sd.f <- sd(model)
sd.r <- sd(ref)
m.f <- mean(model)
m.r <- mean(ref)
## normalize if requested
if (normalize) {
m.f <- m.f/sd.r
m.r <- m.r/sd.r
sd.f <- sd.f/sd.r
sd.r <- 1
}
## calculate bias
bias <- m.f - m.r
## coordinates for model and observations
dd <- rbind(mp = c(sd.f * R, sd.f * sin(acos(R))), rp = c(sd.r, 0))
## find equation of line passing through pts
v1 <- solve(cbind(1, dd[,1])) %*% dd[,2]
## find perpendicular line
v2 <- c(dd[1,2] + dd[1,1]/v1[2], -1/v1[2])
## find point defined by bias
nm <- dd[1,] - c(0, v2[1])
nm <- nm / sqrt(sum(nm^2))
bp <- dd[1,] + bias*nm
## plot lines
arrows(x0 = dd[1,1], x1 = bp[1], y0 = dd[1,2], y1 = bp[2], col = "red", length = 0.05, lwd = 1.5)
lines(rbind(dd[2,], bp), col = "red", lty = 3)
lines(dd, col = "red", lty = 3)
}
Then,
library(plotrix)
obs = runif(100,1,100)
mod1 = runif(100,1,100)
taylor.diagram(obs,mod1)
taylor.bias(obs,mod1)
Where the length of the red vector indicates the bias and the length of dotted line joining the vector's tip to the reference point is the RMS error. The direction of the red vector indicates the sign of the bias -- in the picture below, negative bias.

Related

Computing the Tukey median

I am trying to compute the data depth of two variables with the following function:
library(depth)
x <- data.frame(data$`math score`, data$`reading score`)
depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-8, ndir = 1000)
the first variable after depth is u which stands for Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
I have 1000 observations however I get the following error message:
Error in depth(1000, x, method = "Tukey", approx = FALSE, eps = 1e-08, :
Dimension mismatch between the data and the point u.
Does someone know how to solve this issue?
Thank you in advance!
If you look at the documentation for the function depth, it says:
u    Numerical vector whose depth is to be calculated. Dimension has to be the same as that of the observations.
So u has to be a point in multidimensional space represented by a vector with n components, whereas x has to be a matrix or data frame of m by n components, (m rows for m points). You are comparing u to all the other multidimensional points in the set x to find the minimum number of points that could share a half-space with u.
Let's create a very example in two dimensional space:
library(depth)
set.seed(100)
x <- data.frame(x = c(rnorm(10, -5, 2), rnorm(10, 5, 2)), y = rnorm(20, 0, 2))
plot(x)
The depth function calculates the depth of a particular point relative to the data. So let's use the origin:
u <- data.frame(x = 0, y = 0)
points(u, col = "red", pch = 16)
Naively we might think that the origin here has a depth of 10/20 points (i.e. the most obvious way to partition this dataset is a vertical line through the origin with 10 points on each side, but instead we find:
depth(u, x)
#> [1] 0.35
This indicates that there is a half-space including the origin that only contains 0.35 of the points, i.e. 7 points out of 20:
depth(u, x) * nrow(x)
#> [1] 7
And we can see that visually like this:
abline(0, -0.07)
points(x[x$y < (-0.07 * x$x),], col = "blue", pch = 16)
Where we have coloured these 7 points blue.
So it's not clear what result you expect from the depth function, but you will need to give it a value of c(math_score, reading_score) where math_score and reading_score are test values for which you want to know the depth.

Calculating the distance along a line that each point would intersect at

I would like to fit a line through two points from a random distribution of points, then calculate the location along that line that each point intersects it orthogonally. I am not interested in the residual distance of each point from the line (points above/below the line are treated equally), I am only interested in calculating the location along the line of where that point would intersect (e.g. points at different distances from the line but at the same orthogonal location would have the same value). The data aren't connected to the line explicitly as the abline is drawn from the location of only 2 points, and so i can't extract these values in a classic residual type way. I don't think this is difficult, but I can't wrap by head around how to calculate it and it's really bugging me!
I have explored the dist2d function but that calculates the orthogonal distance of each point to the line. Is there a way to use that value to the then calculate the hypotenuse from the data point to some fixed constant point on the line, and then in turn calculate the adjacent distance from that constant? I would really appreciate any help!
#here is some example starter code here to visualise what I mean
#get random data
r = rnorm(100)
t = rnorm(100)
#bind and turn into a df
data = cbind(r,t)
data = as.data.frame(data)
head(data)
#plot
plot(data)
#want to draw abline between 2 points
#isolate points of interest
#here randomly select first two rows
d = data[c(1:2),]
head(d)
#calculate abline through selected points
lm = lm(t ~ r, d)
abline(lm)
#draw points to see which ones they cut through
points(d$r, d$t, bg = "red", pch = 21)
This code below works.
# Create dataframe
data = data.frame(x = rnorm(100), y = rnorm(100))
plot(data, xlim=c(-3, 3), ylim=c(-3, 3))
# Select two points
data$x1_red <- data[1,1]; data$y1_red <- data[1,2]; data$x2_red <- data[2,1]; data$y2_red <- data[2,2];
points(data$x1_red, data$y1_red, bg = "red", pch = 21); points(data$x2_red, data$y2_red, bg = "red", pch = 21);
# Show a red line where the points intersect
# Get its slope (m_red) and intercept (b_red)
data$m_red <- (data[2,2] - data[1,2]) / (data[2,1] - data[1,1])
data$b_red <- data$y1_red - data$m * data$x1_red
abline(data$b_red, data$m_red, col='red')
# Calculate the orthogonal slope
data$m_blue <- (-1/data$m_red)
abline(0, data$m_blue, col='blue')
# Solve for each point's b-intercept (if using the blue slope)
# y = m_blue * x + b
# b = y - m_blue * x
data$b <- data$y - data$m_blue * data$x
# Solve for where each point (using the m_blue slope) intersects the red line (x' and y')
# y' = m_blue * x' + b
# y' = m_red * x' + b_red
# Set those equations equal to each other and solve for x'
data$x_intersect <- (data$b_red - data$b) / (data$m_blue - data$m_red)
# Then solve for y'
data$y_intersect <- data$m_blue * data$x_intersect + data$b
# Calculate the distance between the point and where it intersects the red line
data$dist <- sqrt( (data$x - data$x_intersect)^2 + (data$y - data$y_intersect)^2 )

Identify all local extrema of a fitted smoothing spline via R function 'smooth.spline'

I have a 2-dimensional data set.
I use the R's smooth.spline function to smooth my points graph following an example in this article:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.smooth.spline.html
So that I get the spline graph similar to the green line on this picture
I'd like to know the X values, where the first derivative of the smoothing spline equals zero (to determine exact minimum or maximum).
My problem is that my initial dataset (or a dataset that I could auto-generate) to feed into the predict() function does not contain such exact X values that correspond to the smoothing spline extrema.
How can I find such X values?
Here is the picture of the first derivative of the green spline line above
But exact X coordinate of extremums are still not exact.
My approximate R script to generate the pictures looks like the following
sp1 <- smooth.spline(df)
pred.prime <- predict(sp1, deriv=1)
pred.second <- predict(sp1, deriv=2)
d1 <- data.frame(pred.prime)
d2 <- data.frame(pred.second)
dfMinimums <- d1[abs(d1$y) < 1e-4, c('x','y')]
I think that there are two problems here.
You are using the original x-values and they are spaced too far apart AND
Because of the wide spacing of the x's, your threshold for where you consider the derivative "close enough" to zero is too high.
Here is basically your code but with many more x values and requiring smaller derivatives. Since you do not provide any data, I made a coarse approximation to it that should suffice for illustration.
## Coarse approximation of your data
x = runif(300, 0,45000)
y = sin(x/5000) + sin(x/950)/4 + rnorm(300, 0,0.05)
df = data.frame(x,y)
sp1 <- smooth.spline(df)
Spline code
Sx = seq(0,45000,10)
pred.spline <- predict(sp1, Sx)
d0 <- data.frame(pred.spline)
pred.prime <- predict(sp1, Sx, deriv=1)
d1 <- data.frame(pred.prime)
Mins = which(abs(d1$y) < mean(abs(d1$y))/150)
plot(df, pch=20, col="navy")
lines(sp1, col="darkgreen")
points(d0[Mins,], pch=20, col="red")
The extrema look pretty good.
plot(d1, type="l")
points(d1[Mins,], pch=20, col="red")
The points identified look like zeros of the derivative.
You can use my R package SplinesUtils: https://github.com/ZheyuanLi/SplinesUtils, which can be installed by
devtools::install_github("ZheyuanLi/SplinesUtils")
The function to be used are SmoothSplinesAsPiecePoly and solve. I will just use the example under the documentation.
library(SplinesUtils)
## a toy dataset
set.seed(0)
x <- 1:100 + runif(100, -0.1, 0.1)
y <- poly(x, 9) %*% rnorm(9)
y <- y + rnorm(length(y), 0, 0.2 * sd(y))
## fit a smoothing spline
sm <- smooth.spline(x, y)
## coerce "smooth.spline" object to "PiecePoly" object
oo <- SmoothSplineAsPiecePoly(sm)
## plot the spline
plot(oo)
## find all stationary / saddle points
xs <- solve(oo, deriv = 1)
#[1] 3.791103 15.957159 21.918534 23.034192 25.958486 39.799999 58.627431
#[8] 74.583000 87.049227 96.544430
## predict the "PiecePoly" at stationary / saddle points
ys <- predict(oo, xs)
#[1] -0.92224176 0.38751847 0.09951236 0.10764884 0.05960727 0.52068566
#[7] -0.51029209 0.15989592 -0.36464409 0.63471723
points(xs, ys, pch = 19)
One caveat in the #G5W implementation that I found is that it sometimes returns multiple records close around extrema instead of a single one. On the diagram they cannot be seen, since they all fall into one point effectively.
The following snippet from here filters out single extrema points with the minimum value of the first derivative:
library(tidyverse)
df2 <- df %>%
group_by(round(y, 4)) %>%
filter(abs(d1) == min(abs(d1))) %>%
ungroup() %>%
select(-5)

Finding the parts of the ellipse from my data in R

Given a set of data I have calculated an ellipse that fit to them using the next command:
eli<-ellipse(cor(x,y),scale=c(sd(x),sd(y)), centre=c(mean(x), mean(y)), level = 0.95)
Where "x" and "y" are the columns of my bivariate data. I would like to know how to find the elements of my ellipse (in red), say: the foci and the a" and "b" values.
In an attempt to find the semi-axis distance I tried to get a lineal regression of the data but I truly doubt of my method
How can I find those parameters? Or get the equation of the ellipse?
Since ellipse generates 100 points, this approach may be accurate enough. Of course you could set npoints to higher value to increase accuracy. I've also made plots to explain.
#rm(list = ls()) #Remove everything from the environment
#Generate some points
set.seed(42)
x = rnorm(20,5,1)
y = rnorm(20,5,2)
#Fit Ellipse
require(ellipse)
eli = ellipse(cor(x,y),scale=c(sd(x),sd(y)), centre=c(mean(x), mean(y)), level = 0.95, npoints = 250)
#Draw ellipse and points
plot(eli[,1], eli[,2], type = "l", asp = 1)
points(x,y)
#Calculate the center of ellipse
eli_center = c(mean(eli[,1]), mean(eli[,2]))
#Plot eli_center
points(eli_center[1], eli_center[2], pch = 19, cex = 1.5)
#A function to calculate distance between points 'x1' and 'x2'
dist_2_points <- function(x1, x2) {
return(sqrt(sum((x1 - x2)^2)))
}
#Compute distance of each point in ellipse from eli_center
distance = numeric(0)
for (i in 1:nrow(eli)){
distance[i] = dist_2_points(eli_center, eli[i,])
}
#The maximum distance from eli_center is 'a'
a = distance[which.max(distance)]
a_point = eli[ which.max(distance), ]
#Draw 'a'
points(a_point[1],a_point[2], pch = 5)
lines(rbind(eli_center, a_point))
#The minimum distance from eli_center is 'b'
b = distance[which.min(distance)]
b_point = eli[ which.min(distance), ]
#Draw 'b'
points(b_point[1],b_point[2], pch = 5)
lines(rbind(eli_center, b_point))
#find foci
foci = sqrt(a^2 - b^2)
This is the code that the car:::ellipse function uses after doing some error checking and other "housekeeping":
angles <- (0:segments) * 2 * pi/segments
unit.circle <- cbind(cos(angles), sin(angles))
Q <- chol(shape, pivot = TRUE)
order <- order(attr(Q, "pivot"))
ellipse <- t(center + radius * t(unit.circle %*% Q[, order]))
colnames(ellipse) <- c("x", "y")
You will notice that the regression line you drew was a bit "off-axis". If you drew in the line from X regressed on Y it would also be "off-axis" in the other direction. Do a search on "total least squares regression" or "Deming regression" (and you'll find some other names that I'm not coming up with off the top of my head.) Regression lines determined by ordinary least squares lines do not go through the major axis of the ellipse that that you are calculating.

Spatial correlogram using the raster package

Dear Crowd
Problem
I tried to calculate a spatial correlogram with the packages nfc, pgirmess, SpatialPack and spdep. However, I was troubling to define the start and end-point of the distance. I'm only interested in the spatial autocorrelation at smaller distances, but there on smaller bins. Additionally, as the raster is quite large (1.8 Megapixels), I run into memory troubles with these packages but the SpatialPack.
So I tried to produce my own code, using the function Moran from the package raster. But I must have some error, as the result for the complete dataset is somewhat different than the one from the other packages. If there is no error in my code, it might at least help others with similar problems.
Question
I'm not sure, whether my focal matrix is erroneous. Could you please tell me whether the central pixel needs to be incorporated? Using the testdata I can't show the differences between the methods, but on my complete dataset, there are differences visible, as shown in the Image below. However, the bins are not exactly the same (50m vs. 69m), so this might explain parts of the differences. However, at the first bin, this explanation seems not to be plausible to me. Or might the irregular shape of my raster, and different ways to handle NA's cause the difference?
Comparison of Own method with the one from SpatialPack
Runable Example
Testdata
The code for calculating the testdata is taken from http://www.petrkeil.com/?p=1050#comment-416317
# packages used for the data generation
library(raster)
library(vegan) # will be used for PCNM
# empty matrix and spatial coordinates of its cells
side=30
my.mat <- matrix(NA, nrow=side, ncol=side)
x.coord <- rep(1:side, each=side)*5
y.coord <- rep(1:side, times=side)*5
xy <- data.frame(x.coord, y.coord)
# all paiwise euclidean distances between the cells
xy.dist <- dist(xy)
# PCNM axes of the dist. matrix (from 'vegan' package)
pcnm.axes <- pcnm(xy.dist)$vectors
# using 8th PCNM axis as my atificial z variable
z.value <- pcnm.axes[,8]*200 + rnorm(side*side, 0, 1)
# plotting the artificial spatial data
r <- rasterFromXYZ(xyz = cbind(xy,z.value))
plot(r, axes=F)
Own Code
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
formerBreak <- 0 #for the first run important
for (i in c(seq(10,200,10))) #Calculate the Morans I for these bins
{
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (formerBreak>0) #if it is the second run
{
midpoint <- ceiling(ncol(w)/2) # get the midpoint
w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)] <- w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)]*(wOld==0)#set the previous focal weights to 0
w <- w*(1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w = w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
formerBreak <- i/res(r)[1]#divides the breaks by the resolution of the raster to be able to translate them to the focal window
}
plot(x=sp.Corr[,2],y = sp.Corr[,1],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
Other methods to calculate the Spatial Correlogram
library(SpatialPack)
sp.Corr <- summary(modified.ttest(z.value,z.value,coords = xy,nclass = 21))
plot(x=sp.Corr$coef[,1],y = data$coef[,4],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
library(ncf)
ncf.cor <- correlog(x.coord, y.coord, z.value,increment=10, resamp=1)
plot(ncf.cor)
In order to compare the results of the correlogram, in your case, two things should be considered. (i) your code only works for bins proportional to the resolution of your raster. In that case, a bit of difference in the bins could make to include or exclude an important amount of pairs. (ii) The irregular shape of the raster has a strong impact of the pairs that are considered to compute the correlation for certain distance interval. So your code should deal with both, allow any value for the length of bin and consider the irregular shape of the raster. A small modification of your code to tackle those problems are below.
# SpatialPack correlation
library(SpatialPack)
test <- modified.ttest(z.value,z.value,coords = xy,nclass = 21)
# Own correlation
bins <- test$upper.bounds
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
for (i in bins) {
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (i > bins[1]) {
midpoint <- ceiling(dim(w)/2) # get the midpoint
half_range <- floor(dim(wOld)/2)
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])] <-
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])]*(wOld==0)
w <- w * (1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w=w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
}
# Comparing
plot(x=test$upper.bounds, test$imoran[,1], col = 2,type = "b",ylab = "Moran's I",xlab="Upper bound of distance", lwd = 2)
lines(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
points(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
legend('topright', legend = c('SpatialPack', 'Own code'), col = 2:3, lty = 1, lwd = 2:1)
The image shows that the results of using the SpatialPack package and the own code are the same.

Resources