Extracting the Linear Discriminant Equation - r

So I have this data and I would like to extract the coefficients from the equation it produces. That way I would be able to plug in a new data point and see where it would be placed.
library(MASS)
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
Sp = rep(c("s","c","v"), rep(50,3)))
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
## c s v
## 22 23 30
z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
I know I can get this:
> z
Call:
lda(Sp ~ ., data = Iris, prior = c(1, 1, 1)/3, subset = train)
Prior probabilities of groups:
c s v
0.3333333 0.3333333 0.3333333
Group means:
Sepal.L. Sepal.W. Petal.L. Petal.W.
c 5.969231 2.753846 4.311538 1.3384615
s 5.075000 3.541667 1.500000 0.2583333
v 6.700000 2.936000 5.552000 1.9880000
Coefficients of linear discriminants:
LD1 LD2
Sepal.L. -0.5458866 0.5215937
Sepal.W. -1.5312824 1.7891248
Petal.L. 1.8087255 -1.2637188
Petal.W. 2.8620894 3.2868849
Proportion of trace:
LD1 LD2
0.9893 0.0107
but is there a way to get just the equation so I would not have to calculate the new observation by hand?

Just turning this into an answer. You need predict(), the predict.lda method in the MASS package has your exact example in its help page:
tr <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
z <- lda(train, cl)
predict(z, test)$class

The default method is "plug-in" so this is the code from MASS:::predict.lda. object is the fit-object and x comes from the newdata argument converted to a matrix:
# snipped preamble and error checking
means <- colSums(prior * object$means)
scaling <- object$scaling
x <- scale(x, center = means, scale = FALSE) %*% scaling
dm <- scale(object$means, center = means, scale = FALSE) %*%
scaling
method <- match.arg(method)
dimen <- if (missing(dimen))
length(object$svd)
else min(dimen, length(object$svd))
N <- object$N
if (method == "plug-in") {
dm <- dm[, 1L:dimen, drop = FALSE]
dist <- matrix(0.5 * rowSums(dm^2) - log(prior), nrow(x),
length(prior), byrow = TRUE) - x[, 1L:dimen, drop = FALSE] %*%
t(dm)
dist <- exp(-(dist - apply(dist, 1L, min, na.rm = TRUE)))
}
# snipped two other methods
}
posterior <- dist/drop(dist %*% rep(1, ng))
This mostly put in to demonstrate why Gregor's answer is the most sensible approach. Trying to pull out an "equation" seems unfruitful. (I can remember using the results of linear regression to do such an exercise in my first year-regression class in grad school.)

Related

Misuse predict.rq in the package quantreg?

I am using quantreg package to predict new data based on training set. However, I noticed a discrepancy between predict.rq or predict and doing it manually. Here is an example:
The quantile regression setting is
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
The new data set I want to predict is
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
I use predict.rq or predict to predict newdata. Both return the same result:
fit_use_predict <- predict.rq( fit, newdata = as.data.frame(newdata) )
Also I manually do the prediction based on the coefficients matrix:
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
I expect both are numerically identical, but they are not:
diff <- fit_use_predict - fit_use_multiplication
print(diff)
Their difference cannot be negligible.
However, predicting the original data set X, both return the same result, i.e.,
predict(fit, newdata = data.frame(X)) = X %*% coef_mat ## True
Do I miss something when using the function? Thanks!
A more serious problem here, before we get to prediction is that the model is forcing all of the fitted quantile functions through the origin of design space and since the covariates are centered at the origin all of the quantile functions are forced to cross there. Even if the X's all lie in the positive orthant it is quite a strong assumption to say that the distribution of the response is degenerate at the origin.
I think you just have to retain the 'X' name in your data as it was in the training data.
library(quantreg)
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
fit_use_predict <- predict.rq( fit, newdata = data.frame(X=I(newdata)) )
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
diff <- fit_use_predict - fit_use_multiplication
max( abs(diff) )
Output is 0

Coverage probability calculation for LM

I am trying to calculate coverage probability for a set of residual bootstrap replicates I generated on the intercept and slope of regression . Can anyone show me how to calculate coverage probability of confidence intervals? Many thanks.
Note that I manually ran the regression using Qr decomposition but you can use lm() if that's easier. I just thought doing it manually will be faster.
set.seed(42) ## for sake of reproducibility
n <- 100
x <- rnorm(n)
e <- rnorm(n)
y <- as.numeric(50 + 25*x + e)
dd <- data.frame(id=1:n, x=x, y=y)
mo <- lm(y ~ x, data=dd)
# Manual Residual Bootstrap
resi <- residuals(mo)
fit <- fitted(mo)
ressampy <- function() fit + sample(resi, length(resi), replace=TRUE)
# Sample y values:
head(ressampy())
# Qr decomposition of X values
qrX <- qr(cbind(Intercept=1, dd[, "x", drop=FALSE]), LAPACK=TRUE)
# faster than LM
qr.coef(qrX, dd[, "y"])
# One Bootstrap replication
boot1 <- qr.coef(qrX, ressampy())
# 1000 bootstrap replications
boot <- t(replicate(1000, qr.coef(qrX, ressampy())))
EDIT
Incorporating jay.sf's answer, I rewrote the code that ran the lm() method and compared the first and second approach of calculating coverage probability in the link shared by jay.sf:
library(lmtest);library(sandwich)
ci <- confint(coeftest(mo, vcov.=vcovHC(mo, type="HC3")))
ci
FUNInter <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "X(Intercept)", level = 0.95)
}
FUNBeta <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "Xx", level = 0.95)
}
set.seed(42)
R <- 1000
Interres <- replicate(R, FUNInter(), simplify=FALSE)
Betares <- replicate(R, FUNBeta(), simplify=FALSE)
ciinter <- t(sapply(Interres, function(x, y) x[grep(y, rownames(x)), ], "X\\(Intercept\\)"))
cibeta <- t(sapply(Betares, function(x, y) x[grep(y, rownames(x)), ], "Xx"))
#second approach of calculating CP
sum(ciinter[,"2.5 %"] <=50 & 50 <= ciinter[,"97.5 %"])/R
[1] 0.842
sum(cibeta[,"2.5 %"] <=25 & 25 <= cibeta[,"97.5 %"])/R
[1] 0.945
#first approach of calculating CP
sum(apply(ciinter, 1, function(x) {
all(data.table::between(x, ci[1,1], ci[1,2]))
}))/R
[1] 0.076
sum(apply(cibeta, 1, function(x) {
all(data.table::between(x, ci[2,1], ci[2,2]))
}))/R
[1] 0.405
According to Morris et. al 2019, Table 6, the coverage probability is defined as the probability how often real theta lies within a bootstrapped confidence interval (CI) (i.e. those of the model applied on many samples based on the actual data, or—in other words—new experiments):
Hence, we want to compute CIs based on OP's proposed i.i.d. bootstrap R times and calculate the ratio of how often theta is or is not in these CIs.
First, we estimate our model mo using the actual data.
mo <- lm(y ~ x)
To avoid unnecessary unpacking fitted values yhat, residuals u, model matrix X, and coefficients coef0 in the replications, we extract them beforehand.
yhat <- mo$fitted.values
u <- as.matrix(mo$residuals)
X <- model.matrix(mo)
theta <- c(50, 25) ## known from data generating process of simulation
In a bootstrap function FUN we wrap all the steps we want to do in one replication. In order to apply the very fast .lm.fit, we have to calculate the white standard errors manually (identical to lmtest::coeftest(fit, vcov.=sandwich::vcovHC(fit, type="HC1"))).
FUN <- function() {
## resampling residuals
y.star <- yhat + sample(u, length(u), replace=TRUE)
## refit model
fit <- .lm.fit(X, y.star)
coef <- fit$coefficients[sort.list(fit$pivot)]
## alternatively using QR, but `.lm.fit` is slightly faster
# qrX <- qr(X, LAPACK=TRUE)
# coef <- qr.coef(qrX, y.star)
## white standard errors
v.cov <- chol2inv(chol(t(X) %*% X))
meat <- t(X) %*% diag(diag(u %*% t(u))) %*% X
## degrees of freedom adjust (HC1)
d <- dim(X)
dfa <- d[1] / (d[1] - d[2])
white.se <- sqrt(diag(v.cov %*% meat %*% v.cov)*dfa)
## 95% CIs
ci <- coef + qt(1 - .025, d[1] - d[2])*white.se %*% t(c(-1, 1))
## coverage
c(intercept=theta[1] >= ci[1, 1] & theta[1] <= ci[1, 2],
x=theta[2] >= ci[2, 1] & theta[2] <= ci[2, 2])
}
Now we execute the bootstrap using replicate.
R <- 5e3
set.seed(42)
system.time(res <- t(replicate(R, FUN())))
# user system elapsed
# 71.19 28.25 100.28
head(res, 3)
# intercept x
# [1,] TRUE TRUE
# [2,] FALSE TRUE
# [3,] TRUE TRUE
The mean of TRUEs in both columns simultaneously across the rows, or in each column respectively, gives the coverage probability we are looking for.
(cp.t <- mean(rowSums(res) == ncol(res))) ## coverage probability total
(cp.i <- colMeans(res)) ## coverage probability individual coefs
(cp <- c(total=cp.t, cp.i))
# total intercept x
# 0.8954 0.9478 0.9444
## values with other R:
# total intercept x
# 0.90700 0.95200 0.95200 ## R == 1k
# 0.89950 0.95000 0.94700 ## R == 2k
# 0.89540 0.94780 0.94440 ## R == 5k
# 0.89530 0.94570 0.94680 ## R == 10k
# 0.89722 0.94694 0.94777 ## R == 100k
And this is how it looks like after 100k repetitions
Code for plot:
r1 <- sapply(seq(nrow(res)), \(i) mean(rowSums(res[1:i,,drop=FALSE]) == ncol(res)))
r2 <- t(sapply(seq(nrow(res)), \(i) colMeans(res[1:i,,drop=FALSE])))
r <- cbind(r1, r2)
matplot(r, type='l', col=2:4, lty=1, main='coverage probability', xlab='R',
ylab='cum. mean',ylim=c(.89, .955))
grid()
sapply(seq(cp), \(i) abline(h=cp[i], lty=2, col=i + 1))
legend('right', col=2:4, lty=1, legend=names(cp), bty='n')
Data:
set.seed(42)
n <- 1e3
x <- rnorm(n)
y <- 50 + 25*x + rnorm(n)

Fitting probit model inr R

For my thesis I have to fit some glm models with MLEs that R doesn't have, I was going ok for the models with close form but now I have to use de Gausian CDF, so i decide to fit a simple probit model.
this is the code:
Data:
set.seed(123)
x <-matrix( rnorm(50,2,4),50,1)
m <- matrix(runif(50,2,4),50,1)
t <- matrix(rpois(50,0.5),50,1)
z <- (1+exp(-((x-mean(x)/sd(x)))))^-1 + runif(50)
y <- ifelse(z < 1.186228, 0, 1)
data1 <- as.data.frame(cbind(y,x,m,t))
myprobit <- function (formula, data)
{
mf <- model.frame(formula, data)
y <- model.response(mf, "numeric")
X <- model.matrix(formula, data = data)
if (any(is.na(cbind(y, X))))
stop("Some data are missing.")
loglik <- function(betas, X, y, sigma) { #loglikelihood
p <- length(betas)
beta <- betas[-p]
eta <- X %*% beta
sigma <- 1 #because of identification, sigma must be equal to 1
G <- pnorm(y, mean = eta,sd=sigma)
sum( y*log(G) + (1-y)*log(1-G))
}
ls.reg <- lm(y ~ X - 1)#starting values using ols, indicating that this model already has a constant
start <- coef(ls.reg)
fit <- optim(start, loglik, X = X, y = y, control = list(fnscale = -1), method = "BFGS", hessian = TRUE) #optimizar
if (fit$convergence > 0) {
print(fit)
stop("optim failed to converge!") #verify convergence
}
return(fit)
}
myprobit(y ~ x + m + t,data = data1)
And i get: Error in X %*% beta : non-conformable arguments, if i change start <- coef(ls.reg) with start <- c(coef(ls.reg), 1) i get wrong stimatives comparing with:
probit <- glm(y ~ x + m + t,data = data1 , family = binomial(link = "probit"))
What am I doing wrong?
Is possible to correctly fit this model using pnorm, if no, what algorithm should I use to approximate de gausian CDF. Thanks!!
The line of code responsible for your error is the following:
eta <- X %*% beta
Note that "%*%" is the matrix multiplication operator. By reproducing your code I noticed that X is a matrix with 50 rows and 4 columns. Hence, for matrix multiplication to be possible your "beta" needs to have 4 rows. But when you run "betas[-p]" you subset the betas vector by removing its last element, leaving only three elements instead of the four you need for matrix multiplication to be defined. If you remove [-p] the code will work.

Predict using felm output with standard errors

Is there way to get predict behavior with standard errors from lfe::felm if the fixed effects are swept out using the projection method in felm? This question is very similar to the question here, but none of the answers to that question can be used to estimate standard errors or confidence/prediction intervals. I know that there's currently no predict.felm, but I am wondering if there are workarounds similar to those linked above that might also work for estimating the prediction interval
library(DAAG)
library(lfe)
model1 <- lm(data = cps1, re74 ~ age + nodeg + marr)
predict(model1, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T, interval="prediction")$fit
# Result: fit lwr upr
# 1 18436.18 2339.335 34533.03
model2 <- felm(data = cps1, re74 ~ age | nodeg + marr)
predict(model2, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T, interval="prediction")$fit
# Does not work
The goal is to estimate a prediction interval for yhat, for which I think I'd need to compute the full variance-covariance matrix (including the fixed effects). I haven't been able to figure out how to do this, and I'm wondering if it's even computationally feasible.
After conversations with several people, I don't believe it is possible to obtain an estimate the distribution of yhat=Xb (where X includes both the covariates and the fixed effects) directly from felm, which is what this question boils down to. It is possible bootstrap them, however. The following code does so in parallel. There is scope for performance improvements, but this gives the general idea.
Note: here I do not compute full prediction interval, just the SEs on Xb, but obtaining the prediction interval is straightforward - just add the root of sigma^2 to the SE.
library(DAAG)
library(lfe)
library(parallel)
model1 <- lm(data = cps1, re74 ~ age + nodeg + marr)
yhat_lm <- predict(model1, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T)
set.seed(42)
boot_yhat <- function(b) {
print(b)
n <- nrow(cps1)
boot <- cps1[sample(1:n, n, replace=T),]
lm.model <- lm(data=demeanlist(boot[, c("re74", "age")], list(factor(boot$nodeg), factor(boot$marr))),
formula = re74 ~ age)
fe <- getfe(felm(data = boot, re74 ~ age | nodeg + marr))
bootResult <- predict(lm.model, newdata = data.frame(age = 40)) +
fe$effect[fe$fe == "nodeg" & fe$idx==0] +
fe$effect[fe$fe == "marr" & fe$idx==1]
return(bootResult)
}
B = 1000
yhats_boot <- mclapply(1:B, boot_yhat)
plot(density(rnorm(10000, mean=yhat_lm$fit, sd=yhat_lm$se.fit)))
lines(density(yhats), col="red")
From your first model predict(.) yields this:
# fit lwr upr
# 1 18436.18 2339.335 34533.03
Following 李哲源 we can achieve these results manually, too.
beta.hat.1 <- coef(model1) # save coefficients
# model matrix: age=40, nodeg = 0, marr=1:
X.1 <- cbind(1, matrix(c(40, 0, 1), ncol=3))
pred.1 <- as.numeric(X.1 %*% beta.hat.1) # prediction
V.1 <- vcov(model1) # save var-cov matrix
se2.1 <- unname(rowSums((X.1 %*% V.1) * X.1)) # prediction var
alpha.1 <- qt((1-0.95)/2, df = model1$df.residual) # 5 % level
pred.1 + c(alpha.1, -alpha.1) * sqrt(se2.1) # 95%-CI
# [1] 18258.18 18614.18
sigma2.1 <- sum(model1$residuals ^ 2) / model1$df.residual # sigma.sq
PI.1 <- pred.1 + c(alpha.1, -alpha.1) * sqrt(se2.1 + sigma2.1) # prediction interval
matrix(c(pred.1, PI.1), nrow = 1, dimnames = list(1, c("fit", "lwr", "upr")))
# fit lwr upr
# 1 18436.18 2339.335 34533.03
Now, your linked example applied to multiple FE, we get this results:
lm.model <- lm(data=demeanlist(cps1[, c(8, 2)],
list(as.factor(cps1$nodeg),
as.factor(cps1$marr))), re74 ~ age)
fe <- getfe(model2)
predict(lm.model, newdata = data.frame(age = 40)) + fe$effect[fe$idx=="1"]
# [1] 15091.75 10115.21
The first value is with and the second without added FE (try fe$effect[fe$idx=="1"]).
Now we're following the manual approach above.
beta.hat <- coef(model2) # coefficient
x <- 40 # age = 40
pred <- as.numeric(x %*% beta.hat) # prediction
V <- model2$vcv # var/cov
se2 <- unname(rowSums((x %*% V) * x)) # prediction var
alpha <- qt((1-0.95)/2, df = model2$df.residual) # 5% level
pred + c(alpha, -alpha) * sqrt(se2) # CI
# [1] 9599.733 10630.697
sigma2 <- sum(model2$residuals ^ 2) / model2$df.residual # sigma^2
PI <- pred + c(alpha, -alpha) * sqrt(se2 + sigma2) # PI
matrix(c(pred, PI), nrow = 1, dimnames = list(1, c("fit", "lwr", "upr"))) # output
# fit lwr upr
# 1 10115.21 -5988.898 26219.33
As we see, the fit is the same as the linked example approach, but now with prediction interval. (Disclaimer: The logic of the approach should be straightforward, the values of the PI should still be evaluated, e.g. in Stata with reghdfe.)
Edit: In case you want to achieve exactly the same output from felm() which predict.lm() yields with the linear model1, you simply need to "include" again the fixed effects in your model (see model3 below). Just follow the same approach then. For more convenience you easily could wrap it into a function.
library(DAAG)
library(lfe)
model3 <- felm(data = cps1, re74 ~ age + nodeg + marr)
pv <- c(40, 0, 1) # prediction x-values
predict0.felm <- function(mod, pv.=pv) {
beta.hat <- coef(mod) # coefficient
x <- cbind(1, matrix(pv., ncol=3)) # prediction vector
pred <- as.numeric(x %*% beta.hat) # prediction
V <- mod[['vcv'] ] # var/cov
se2 <- unname(rowSums((x %*% V) * x)) # prediction var
alpha <- qt((1-0.95)/2, df = mod[['df.residual']]) # 5% level
CI <- structure(pred + c(alpha, -alpha) * sqrt(se2),
names=c("CI lwr", "CI upr")) # CI
sigma2 <- sum(mod[['residuals']] ^ 2) / mod[['df.residual']] # sigma^2
PI <- pred + c(alpha, -alpha) * sqrt(se2 + sigma2) # PI
mx <- matrix(c(pred, PI), nrow = 1,
dimnames = list(1, c("PI fit", "PI lwr", "PI upr"))) # output
list(CI, mx)
}
predict0.felm(model3)[[2]]
# PI fit PI lwr PI upr
# 1 18436.18 2339.335 34533.03
By this with felm() you can achieve the same prediction interval as with predict.lm().

Specifying a correlation structure for a linear mixed model using the ramps package in R

I am trying to create a linear mixed model (lmm) that allows for a spatial correlation between points (have lat/long for each point). I would like the spatial correlation to be based upon the great circular distance between points.
The package ramps includes a correlation structure that computes the ‘haversine’ distance – although I am having trouble implementing it. I have previously used other correlation structures (corGaus, corExp) and not had any difficulties. I am assuming the corRGaus with the 'haversine' metric can be implemented in the same way.
I am able to successfully create an lmm with spatial correlation calculated on a planar distance using the lme function.
I am also able to create a linear model (not mixed) with spatial correlation calculated using great circular distance although there are errors with the correlation structure using the gls command.
When trying to the use the gls command for a linear model with the great circular distance I have the following errors:
x = runif(20, 1,50)
y = runif(20, 1,50)
gls(x ~ y, cor = corRGaus(form = ~ x + y))
Generalized least squares fit by REML
Model: x ~ y
Data: NULL
Log-restricted-likelihood: -78.44925
Coefficients:
(Intercept) y
24.762656602 0.007822469
Correlation Structure: corRGaus
Formula: ~x + y
Parameter estimate(s):
Error in attr(object, "fixed") && unconstrained :
invalid 'x' type in 'x && y'
When I increase the size of the data there are memory allocation errors (still a very small dataset):
x = runif(100, 1, 50)
y = runif(100, 1, 50)
lat = runif(100, -90, 90)
long = runif(100, -180, 180)
gls(x ~ y, cor = corRGaus(form = ~ x + y))
Error in glsEstimate(glsSt, control = glsEstControl) :
'Calloc' could not allocate memory (18446744073709551616 of 8 bytes)
When trying to run a mixed model using the lme command and the corRGaus from the ramps package the following results:
x = runif(100, 1, 50)
y = runif(100, 1, 50)
LC = c(rep(1, 50) , rep(2, 50))
lat = runif(100, -90, 90)
long = runif(100, -180, 180)
lme(x ~ y,random = ~ y|LC, cor = corRGaus(form = ~ long + lat))
Error in `coef<-.corSpatial`(`*tmp*`, value = value[parMap[, i]]) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In nlminb(c(coef(lmeSt)), function(lmePars) -logLik(lmeSt, lmePars), :
NA/NaN function evaluation
2: In nlminb(c(coef(lmeSt)), function(lmePars) -logLik(lmeSt, lmePars), :
NA/NaN function evaluation
I am unsure about how to proceed with this method. The "haversine" function is what I want to use to complete my models, but I am having trouble implementing them. There are very few questions anywhere about the ramps package, and I have seen very few implementations. Any helps would be greatly appreciated.
I have previously attempted to modify the nlme package and was unable to do so. I posted a question about this, where I was recommended to use the ramps package.
I am using R 3.0.0 on a Windows 8 computer.
OK, here is an option that implements various spatial correlation structures in gls/nlme with haversine distance.
The various corSpatial-type classes already have machinery in place to construct a correlation matrix from spatial covariates, given a distance metric. Unfortunately, dist does not implement haversine distance, and dist is the function called by corSpatial to compute a distance matrix from the spatial covariates.
The distance matrix computations are performed in getCovariate.corSpatial. A modified form of this method will pass the appropriate distance to other methods, and the majority of methods will not need to be modified.
Here, I create a new corStruct class, corHaversine, and modify only getCovariate and one other method (Dim) that determines which correlation function is used. Those methods which do not need modification, are copied from equivalent corSpatial methods. The (new) mimic argument in corHaversine takes the name of the spatial class with the correlation function of interest: by default, it is set to "corSpher".
Caveat: beyond ensuring that this code runs for spherical and Gaussian correlation functions, I haven't really done a lot of checking.
#### corHaversine - spatial correlation with haversine distance
# Calculates the geodesic distance between two points specified by radian latitude/longitude using Haversine formula.
# output in km
haversine <- function(x0, x1, y0, y1) {
a <- sin( (y1 - y0)/2 )^2 + cos(y0) * cos(y1) * sin( (x1 - x0)/2 )^2
v <- 2 * asin( min(1, sqrt(a) ) )
6371 * v
}
# function to compute geodesic haversine distance given two-column matrix of longitude/latitude
# input is assumed in form decimal degrees if radians = F
# note fields::rdist.earth is more efficient
haversineDist <- function(xy, radians = F) {
if (ncol(xy) > 2) stop("Input must have two columns (longitude and latitude)")
if (radians == F) xy <- xy * pi/180
hMat <- matrix(NA, ncol = nrow(xy), nrow = nrow(xy))
for (i in 1:nrow(xy) ) {
for (j in i:nrow(xy) ) {
hMat[j,i] <- haversine(xy[i,1], xy[j,1], xy[i,2], xy[j,2])
}
}
as.dist(hMat)
}
## for most methods, machinery from corSpatial will work without modification
Initialize.corHaversine <- nlme:::Initialize.corSpatial
recalc.corHaversine <- nlme:::recalc.corSpatial
Variogram.corHaversine <- nlme:::Variogram.corSpatial
corFactor.corHaversine <- nlme:::corFactor.corSpatial
corMatrix.corHaversine <- nlme:::corMatrix.corSpatial
coef.corHaversine <- nlme:::coef.corSpatial
"coef<-.corHaversine" <- nlme:::"coef<-.corSpatial"
## Constructor for the corHaversine class
corHaversine <- function(value = numeric(0), form = ~ 1, mimic = "corSpher", nugget = FALSE, fixed = FALSE) {
spClass <- "corHaversine"
attr(value, "formula") <- form
attr(value, "nugget") <- nugget
attr(value, "fixed") <- fixed
attr(value, "function") <- mimic
class(value) <- c(spClass, "corStruct")
value
} # end corHaversine class
environment(corHaversine) <- asNamespace("nlme")
Dim.corHaversine <- function(object, groups, ...) {
if (missing(groups)) return(attr(object, "Dim"))
val <- Dim.corStruct(object, groups)
val[["start"]] <- c(0, cumsum(val[["len"]] * (val[["len"]] - 1)/2)[-val[["M"]]])
## will use third component of Dim list for spClass
names(val)[3] <- "spClass"
val[[3]] <- match(attr(object, "function"), c("corSpher", "corExp", "corGaus", "corLin", "corRatio"), 0)
val
}
environment(Dim.corHaversine) <- asNamespace("nlme")
## getCovariate method for corHaversine class
getCovariate.corHaversine <- function(object, form = formula(object), data) {
if (is.null(covar <- attr(object, "covariate"))) { # if object lacks covariate attribute
if (missing(data)) { # if object lacks data
stop("need data to calculate covariate")
}
covForm <- getCovariateFormula(form)
if (length(all.vars(covForm)) > 0) { # if covariate present
if (attr(terms(covForm), "intercept") == 1) { # if formula includes intercept
covForm <- eval(parse(text = paste("~", deparse(covForm[[2]]),"-1",sep=""))) # remove intercept
}
# can only take covariates with correct names
if (length(all.vars(covForm)) > 2) stop("corHaversine can only take two covariates, 'lon' and 'lat'")
if ( !all(all.vars(covForm) %in% c("lon", "lat")) ) stop("covariates must be named 'lon' and 'lat'")
covar <- as.data.frame(unclass(model.matrix(covForm, model.frame(covForm, data, drop.unused.levels = TRUE) ) ) )
covar <- covar[,order(colnames(covar), decreasing = T)] # order as lon ... lat
}
else {
covar <- NULL
}
if (!is.null(getGroupsFormula(form))) { # if groups in formula extract covar by groups
grps <- getGroups(object, data = data)
if (is.null(covar)) {
covar <- lapply(split(grps, grps), function(x) as.vector(dist(1:length(x) ) ) ) # filler?
}
else {
giveDist <- function(el) {
el <- as.matrix(el)
if (nrow(el) > 1) as.vector(haversineDist(el))
else numeric(0)
}
covar <- lapply(split(covar, grps), giveDist )
}
covar <- covar[sapply(covar, length) > 0] # no 1-obs groups
}
else { # if no groups in formula extract distance
if (is.null(covar)) {
covar <- as.vector(dist(1:nrow(data) ) )
}
else {
covar <- as.vector(haversineDist(as.matrix(covar) ) )
}
}
if (any(unlist(covar) == 0)) { # check that no distances are zero
stop("cannot have zero distances in \"corHaversine\"")
}
}
covar
} # end method getCovariate
environment(getCovariate.corHaversine) <- asNamespace("nlme")
To test that this runs, given range parameter of 1000:
## test that corHaversine runs with spherical correlation (not testing that it WORKS ...)
library(MASS)
set.seed(1001)
sample_data <- data.frame(lon = -121:-22, lat = -50:49)
ran <- 1000 # 'range' parameter for spherical correlation
dist_matrix <- as.matrix(haversineDist(sample_data)) # haversine distance matrix
# set up correlation matrix of response
corr_matrix <- 1-1.5*(dist_matrix/ran)+0.5*(dist_matrix/ran)^3
corr_matrix[dist_matrix > ran] = 0
diag(corr_matrix) <- 1
# set up covariance matrix of response
sigma <- 2 # residual standard deviation
cov_matrix <- (diag(100)*sigma) %*% corr_matrix %*% (diag(100)*sigma) # correlated response
# generate response
sample_data$y <- mvrnorm(1, mu = rep(0, 100), Sigma = cov_matrix)
# fit model
gls_haversine <- gls(y ~ 1, correlation = corHaversine(form=~lon+lat, mimic="corSpher"), data = sample_data)
summary(gls_haversine)
# Correlation Structure: corHaversine
# Formula: ~lon + lat
# Parameter estimate(s):
# range
# 1426.818
#
# Coefficients:
# Value Std.Error t-value p-value
# (Intercept) 0.9397666 0.7471089 1.257871 0.2114
#
# Standardized residuals:
# Min Q1 Med Q3 Max
# -2.1467696 -0.4140958 0.1376988 0.5484481 1.9240042
#
# Residual standard error: 2.735971
# Degrees of freedom: 100 total; 99 residual
Testing that it runs with Gaussian correlation, with range parameter = 100:
## test that corHaversine runs with Gaussian correlation
ran = 100 # parameter for Gaussian correlation
corr_matrix_gauss <- exp(-(dist_matrix/ran)^2)
diag(corr_matrix_gauss) <- 1
# set up covariance matrix of response
cov_matrix_gauss <- (diag(100)*sigma) %*% corr_matrix_gauss %*% (diag(100)*sigma) # correlated response
# generate response
sample_data$y_gauss <- mvrnorm(1, mu = rep(0, 100), Sigma = cov_matrix_gauss)
# fit model
gls_haversine_gauss <- gls(y_gauss ~ 1, correlation = corHaversine(form=~lon+lat, mimic = "corGaus"), data = sample_data)
summary(gls_haversine_gauss)
With lme:
## runs with lme
# set up data with group effects
group_y <- as.vector(sapply(1:5, function(.) mvrnorm(1, mu = rep(0, 100), Sigma = cov_matrix_gauss)))
group_effect <- rep(-2:2, each = 100)
group_y = group_y + group_effect
group_name <- factor(group_effect)
lme_dat <- data.frame(y = group_y, group = group_name, lon = sample_data$lon, lat = sample_data$lat)
# fit model
lme_haversine <- lme(y ~ 1, random = ~ 1|group, correlation = corHaversine(form=~lon+lat, mimic = "corGaus"), data = lme_dat, control=lmeControl(opt = "optim") )
summary(lme_haversine)
# Correlation Structure: corHaversine
# Formula: ~lon + lat | group
# Parameter estimate(s):
# range
# 106.3482
# Fixed effects: y ~ 1
# Value Std.Error DF t-value p-value
# (Intercept) -0.0161861 0.6861328 495 -0.02359033 0.9812
#
# Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
# -3.0393708 -0.6469423 0.0348155 0.7132133 2.5921573
#
# Number of Observations: 500
# Number of Groups: 5
See if this answer on R-Help is useful: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+winsemius+haversine#query:list%3Aorg.r-project.r-help%20winsemius%20haversine+page:1+mid:ugecbw3jjwphu2pb+state:results
I just checked and and doesn't appear that the ramps or nlme packages have been modified to incorporate those changes suggested by Malcolm Fairbrother, so you will need to do some hacking. I don't want to be considered for the bounty since I am not posting a tested solution and I didn't dream it up either.

Resources