Remove linear trend from raster stack R - r

Trying remove the linear trend (detrend) from a monthly precipitation raster stack for the US from 1979-2015 (https://www.northwestknowledge.net/metdata/data/monthly/pr_gridMET.nc). These data are large enough that using those data as an example would be a bit unruly here so I am going to use the data from the raster package for sake of efficiency. The working model I have currently is to use `raster"::calc`` on a linear model and pull the residuals. My understanding is that those residuals are the detrended series, but I am not 100% sure that is correct. The code I am using is as follows:
library(raster)
fn <- raster(system.file("external/test.grd", package="raster"))
fn2 <- fn+1000
fn3 <- fn +500
fn4 <- fn +750
fn5 <- fn+100
fns <- stack(fn, fn2, fn3, fn4, fn5)
time <- 1:nlayers(fns)
# Get residuals to detrend the raw data
get_residuals <- function(x) {
if (is.na(x[1])){
rep(NA, length(x)) }
else {
m <- lm(x~time)
q <- residuals(m)
return(q)
}
}
detrended_fns <- calc(fns, get_residuals) # Create our residual (detrended) time series stack
I feel like I'm missing something here. Can anyone confirm that I'm on the right track here? If I'm not any suggestions on how to properly detrend these data would be helpful! thanks!

The residuals remove the slope and the intercept and you get anomalies. Perhaps you only want to remove the slope? In that case you could add the intercept to the residuals in get_residuals
q <- residuals(m) + coefficients(m)[1]
Or better:
q <- residuals(m) + predict(m)[1]
So that you use year 1 (and not year 0) as the base, and it would also work if time is, say, 2000:2004
You could also take the last year, mid year, or average as base.

Related

Fitting a confidence interval to dlmForecast in R

I've fit a Dyanmic Linear Model to some data using the dlmFilter in R [from the dlm package]. From said filter I have predicted 7 steps ahead using the dlmForecast function. The predicted outcome is very good, but I would like to add a 95% confidence interval and [after a lot of testing] have struggled to do so.
I've mocked up some similar code, below:
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
The forecast outcome appears to be very good [although the model to some extent over-fits the data due to the small number of observations]. However, I would like to add a 95% confidence interval and have struggled to figure out how to do so.
Any advice would be appreciated?
Cheers
hwidth <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(0.025, lower = FALSE)) +as.vector(t(fut1$f)))

Bootstrapping regression coefficients from random subsets of data

I’m attempting to perform a regression calibration on two variables using the yorkfit() function in the IsoplotR package. I would like to estimate the confidence interval of the bootstrapped slope coefficient from this model; however, instead of using the typical bootstrap method below, I’d like to only perform the iterations on 75% of the data (randomly selected) at a time. So far, using the following sample data, I managed to bootstrap the slope coefficient result of the yorkfit() function:
library(boot)
library(IsoplotR)
X <- c(9.105,8.987,8.974,8.994,8.996,8.966,9.035,9.215,9.239,
9.307,9.227,9.17, 9.102)
Y <- c(28.1,28.9,29.6,29.5,29.0,28.8,28.5,27.3,27.1,26.5,
27.0,27.5,28.4)
n <- length(X)
sX <- X*0.02
sY <- Y*0.05
rXY <- rep(0.8,n)
dat <- cbind(X,sX,Y,sY,rXY)
fit <- york(dat)
boot.test <- function(data,indices){
sample = data[indices,]
mod = york(sample)
return (mod$b)
}
result <- boot(data=dat, statistic = boot.test, R=1000)
boot.ci(result, type = 'bca')
...but I'm not really sure where to go from here. Any help to move me in the right direction would be greatly appreciated. I’m new to R so I apologize if question is ambiguous. Thanks.
Based on the package documentation, you should be able to use the ran.gen argument, with sim="parametric", to sample using a custom function. In this case, the sample is a certain percent of the total observations, chosen at random. Something like the following should accomplish what you want:
result <- boot(
data=dat,
statistic =boot.test,
R=1000,
sim="parametric",
ran.gen=function(data, percent){
n=nrow(data)
indic=runif(n)
data[rank(indic, ties.method="random")<=round(n*percent,0),]
},
percent=0.75)

Linear Regression in R for Date and some dependant output

Actually I need to calculate the parameters theta0 and theta1 using linear regression.
My data frame (data.1) consists of two columns, first one is a date-time and the second one is a result which is dependent on this date.
Like this:
data.1[[1]] data.1[[2]]
2004-07-08 14:30:00 12.41
Now, I have this code for which iterates over a number of times to calculate the parameter theta0, theta1
x=as.vector(data.1[[1]])
y=as.vector(data.1[[2]])
plot(x,y)
theta0=10
theta1=10
alpha=0.0001
initialJ=100000
learningIterations=200000
J=function(x,y,theta0,theta1){
m=length(x)
sum=0
for(i in 1:m){
sum=sum+((theta0+theta1*x[i]-y[i])^2)
}
sum=sum/(2*m)
return(sum)
}
updateTheta=function(x,y,theta0,theta1){
sum0=0
sum1=0
m=length(x)
for(i in 1:m){
sum0=sum0+(theta0+theta1*x[i]-y[i])
sum1=sum1+((theta0+theta1*x[i]-y[i])*x[i])
}
sum0=sum0/m
sum1=sum1/m
theta0=theta0-(alpha*sum0)
theta1=theta1-(alpha*sum1)
return(c(theta0,theta1))
}    
for(i in 1:learningIterations){
thetas=updateTheta(x,y,theta0,theta1)
tempSoln=0
tempSoln=J(x,y,theta0,theta1)
if(tempSoln<initialJ){
initialJ=tempSoln
}
if(tempSoln>initialJ){
break
}
theta0=thetas[1]
theta1=thetas[2]
#print(thetas)
#print(initialJ)
plot(x,y)
lines(x,(theta0+theta1*x), col="red")
}
lines(x,(theta0+theta1*x), col="green")
Now I want to calculate theta0 and theta1 using the following scenarios:
y=data.1[[2]] and x=dates which are similar irrespective of the year
y=data.1[[2]] and x=months which are similar irrespective of the year
Please suggest..
As #Nicola said, you need to use the lm function for linear regression in R.
If you'd like to learn more about linear regression check out this or follow this tutorial
First you would have to determine your formula. You want to calculate Theta0 and Theta1 using data.1[[2]] and dates/months.
Your first formula would be something along the lines of:
formula <- Theta0 ~ data.1[[2]] + dates
Then you would create the linear model
variablename <- lm(formula, dataset)
After this you can use the output for various calculations.
For example you can calculate anova, or just print the summary:
anova(variablename)
summary(variablename)
Sidenote:.
I noticed your assigning variables by using =. This is not recommended parenthesis. For more information check out Google's R Style Guide
In R it would be preferred to use <- to assign variables.
Taking the first bit of your code, it would become:
x <- as.vector(data.1[[1]])
y <- as.vector(data.1[[2]])
plot(x,y)
theta0 <- 10
theta1 <- 10
alpha <- 0.0001
initialJ <- 100000
learningIterations <- 200000

Problems with points and apply R for linear discriminant analysis

I have some coding question, which arise doing some exercises in linear discriminant analysis. We are using the Iris data:
## Read in dataset, set seed, load package
Iris <- iris[,-(1:2)]
grIris <- as.integer(iris[,"Species"])
set.seed(16)
library(MASS)
## Read n
n <- nrow(Iris)
As you can see, we delte the first and second column of iris. What I want to do is a bootstrap for this data using linear discriminant analysis, here is my code:
ind <- replicate(B,sample(seq(1:n),n,replace=TRUE))
This generates the indices I want to use. Note B is some large number, e.g. 1000. Now I want to use apply, but why does the following code doesn't work?
bst.sample <- apply(ind,2,lda(Species~Petal.Length+Petal.Width,data=Iris[ind,]))
where Species, Petal.Length etc. are the data from iris. If I use a for loop everything works fine, but of course I would like to implement in this more elegant way.
My second question is about points. I also wanted to calculate the estimated means, which I've done by the following code
est.lda <- vector("list",B)
est.qda <- vector("list",B)
mu_hat_1 <- mu_hat_2 <- mu_hat_3 <- matrix(0,ncol=B,nrow=2)
for (i in 1:B){
est.lda[[i]] <- lda(Species~Petal.Length+Petal.Width,data=Iris[ind[,i],])
mu_hat_1[,i] <- est.lda[[i]]$means[1,]
mu_hat_2[,i] <- est.lda[[i]]$means[2,]
mu_hat_3[,i] <- est.lda[[i]]$means[3,]
est.qda[[i]] <- qda(Species~Petal.Length+Petal.Width,data=Iris[ind[,i],])
}
plot(mu_hat_1[1,],mu_hat_1[2,],pch=4)
points(mu_hat_2[1,],mu_hat_2[2,],pch=4,col=2)
points(mu_hat_3[1,],mu_hat_3[2,],pch=4,col=3)
The plot at the end should show three region with the expected mean of the three classes. However just the first plot is shown.
Thank you for your help.
B <- 10
ind <- replicate(B,sample(seq(1:n),n,replace=TRUE))
#you need to pass a function to apply
bst.sample <- apply(ind,2,
function(i) lda(Species~Petal.Length+Petal.Width,data=Iris[i,]))
#extract means
bst.means <- lapply(bst.sample,function(x) x$means)
#bind means into array
library(abind)
bst.means <- do.call(function(...) abind(..., along=3), bst.means)
#you need to make sure that alle points are inside the axis limits
plot(bst.means[1,1,],bst.means[1,2,],
xlim=range(bst.means[,1,]), ylim=range(bst.means[,2,]),
xlab=dimnames(bst.means)[[2]][1],ylab=dimnames(bst.means)[[2]][2],
col=1)
points(bst.means[2,1,],bst.means[2,2,], col=2)
points(bst.means[3,1,],bst.means[3,2,], col=3)
legend("topleft", legend=dimnames(bst.means)[[1]], col=1:3, pch=1)

Obtain t-statistic for regression coefficients of an “mlm” object returned by `lm()`

I've used lm() to fit multiple regression models, for multiple (~1 million) response variables in R. Eg.
allModels <- lm(t(responseVariablesMatrix) ~ modelMatrix)
This returns an object of class "mlm", which is like a huge object containing all the models. I want to get the t-statistic for the first coefficient in each model, which I can do using the summary(allModels) function, but its very slow on this large data and returns a lot of unwanted info too.
Is there a faster way of calculating the t-statistic manually, that might be faster than using the summary() function
Thanks!
You can hack the summary.lm() function to get just the bits you need and leave the rest.
If you have
nVariables <- 5
nObs <- 15
y <- rnorm(nObs)
x <- matrix(rnorm(nVariables*nObs),nrow=nObs)
allModels <-lm(y~x)
Then this is the code from the lm.summary() function but with all the excess baggage removed (note, all the error handling has been removed as well).
p <- allModels$rank
rdf <- allModels$df.residual
Qr <- allModels$qr
n <- NROW(Qr$qr)
p1 <- 1L:p
r <- allModels$residuals
f <- allModels$fitted.values
w <- allModels$weights
mss <- if (attr(allModels$terms, "intercept"))
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- allModels$coefficients[Qr$pivot[p1]]
tval <- est/se
tval is now a vector of the t statistics as also give by
summary(allModels)$coefficients[,3]
If you have problems on the large model you might want to rewrite the code so that it keeps fewer objects by compounding multiple lines/assignments into fewer lines.
Hacky solution I know. But it will be about as fast as possible. I suppose it would be neater to put all the lines of code into a function as well.

Resources