Calculate trend in array using linear regression - r

I have an array of the dimensions c(54,71,360) which contains climatalogical data. The first two dimensions describe the grid of the region, while the third one serves as time dimension. So in this case, there are 360 time steps (months).
Here is code to produce a sample array:
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
Now I would like to calculate the trend of each grid cell. The trend is equal to the slope of the linear regression equation. This is why the calculation of the linear regression of every grid cell with the time needs to be performed. And this is exactly what I am struggeling with.
To clearly show what I wish to do, here is an example with one grid cell, which is taken from the array as a vector of the length 360:
grid_cell <- my_array[1,1,]
The linear regression of this vector with the time needs to be calculated. For that purpose, we create a simple time vector:
time_vec <- 1:360
Since I am only interested at the slope coefficient, it can be done this way:
trend <- lm(grid_cell ~ time_vec)$coefficients[2]
This leads to a value of 1.347029e-05 in this case.
I would like to do this for every grid cell of the array, so that the output is a matrix of the dimensions c(54,71), meaning one trend value for each grid cell.
I tried the following, which did not work:
trend_mat <- apply(my_array, 1:2, lm(my_array ~ time_vec)$coefficients[2])
I receive the error message:Error in model.frame.default: variable lengths differ.
This is kind of surprising, since both, the third dimension of the array and the time_vec are both of the length 360.
Anybody with an idea how to achieve this?
Of course I am also open for other solutions which may work totally differently, as long as they lead to the same result.

The problems with the code in the question are that
the third argument of apply should be a function and the question's code provides an expression instead of a function.
it applies lm many times. We show how to do it applying lm only once and in the second alternative we don't use lm at all. this gives one and two order of magnitude speedups as shown in the Performance section below.
It is easier to illustrate if we use smaller data as shown in the Note at the end. To use it on your example just replace dims with the line shown in the commented out line in the Note.
1) First we reshape the array into a matrix, perform lm and then reshape it back. This invokes lm once rather than invoking it prod(dims[1:2]) times.
m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
# check
coef(lm(a[1,1,] ~ timevec))[[2]]
## [1] 0.2636792
coef(lm(a[2,1,] ~ timevec))[[2]]
## [1] -0.4453307
coef(lm(a[1,2,] ~ timevec))[[2]]
## [1] 0.5682025
coef(lm(a[2,2,] ~ timevec))[[2]]
## [1] 0.2338086
coef(lm(a[1,3,] ~ timevec))[[2]]
## [1] -0.255538
coef(lm(a[2,3,] ~ timevec))[[2]]
## [1] 0.254682
2) Alternately, we can remove lm entirely by using the formula for the slope coefficient like this:
m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
Performance
We see that the single lm runs about 8x faster than apply and eliminating lm runs about 230x times faster than apply. Because the apply is brutally slow on my laptop I only used 3 replications but if you have a faster machine or more patience you can increase it. The main conclusions are unlikely to change much though.
library(microbenchmark)
set.seed(5)
dims <- c(54,71,360)
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
microbenchmark(times = 3L,
apply = apply(a, 1:2, function(x) coef(lm(x ~ timevec))[2]),
lm = { m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
},
cov = { m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
})
giving:
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 13446.7953 13523.6016 13605.25037 13600.4079 13684.4779 13768.5479 3 b
lm 264.5883 275.7611 476.82077 286.9338 582.9370 878.9402 3 a
cov 56.9120 57.8830 58.71573 58.8540 59.6176 60.3812 3 a
Note
Test data.
set.seed(5)
# dims <- c(54,71,360)
dims <- 2:4
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])

There is a anonymous function missing in the question's regression code. Here I will use the new lambdas, introduced in R 4.1.0.
I also use the recommended extractor coef.
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
time_vec <- 1:360
trend_mat <- apply(my_array, 1:2, \(x) coef(lm(x ~ time_vec))[2])

Related

Trying to store estimates when using "replicate" in R

I am new to RStudio and have a question I was hoping one could help me with. I am using the "replicate" function to simulate a log-GARCH model, what I get is 100 replications and what I want to do is store the estimates such that I can calculate the average. How can I do that?
Code:
library(lgarch)
replicate(n=100,{ x <- lgarchSim(500, constant=0.5) mymod <- lgarch(x) }, simplify=FALSE )
This produced the following output:
Output
Attached is an image of replication [99] and [100], what I essentially want is to store "intercept" "arch1" and "garch1" in a list.
The result is giving you a list of models. You can use lapply to extract the coefficients from each model, rbind them together into a matrix and get the colMeans to get your averages coefficients:
library(lgarch)
my_list <- replicate(n = 100, {
x <- lgarchSim(500, constant = 0.5)
mymod <- lgarch(x)
}, simplify = FALSE )
colMeans(do.call(rbind, lapply(my_list, coefficients)))
#> intercept arch1 garch1 Elnz2
#> 1.15879953 0.04965793 0.82525587 -1.26277188

get means across samples from bootstrap

I want to get the means and sds across 20 sampled data, but not sure how to do that. My current code can give me the means within each sample, not across samples.
## create data
data <- round(rnorm(100, 5, 3))
data[1:10]
## obtain 20 boostrap samples
## display the first of the boostrap samples
resamples <- lapply(1:20, function(i) sample(data, replace = T))
resamples[1]
## calculate the means for each bootstrap sample
r.mean <- sapply(resamples, mean)
r.median
## calculate the sd of the distribution of medians
sqrt(var(r.median))
From the above code, I got 20 means from each of the sampled data, and sd of the distribution of the means. How can I get 100 means, each mean from the distribution of the 20 samples? and same for the standard deviation?
Many thanks!!
Though the answer by #konvas is probably what you want, I would still take a look at base package boot when it comes to bootstrapping.
See if the following example can get you closer to what you are trying to do.
set.seed(6929) # Make the results reproducible
data <- round(rnorm(100, 5, 3))
boot_mean <- function(data, indices) mean(data[indices])
boot_sd <- function(data, indices) sd(data[indices])
Runs <- 100
r.mean <- boot::boot(data, boot_mean, Runs)
r.sd <- boot::boot(data, boot_sd, Runs)
r.mean$t
r.sd$t
sqrt(var(r.mean$t))
# [,1]
#[1,] 0.3152989
sd(r.mean$t)
#[1] 0.3152989
Now, see the distribution of the bootstrapped means and standard errors.
op <- par(mfrow = c(1, 2))
hist(r.mean$t)
hist(r.sd$t)
par(op)
Make a matrix with your samples
mat <- do.call(rbind, resamples)
Then
rowMeans(mat)
will give you the "within sample" mean and
colMeans(mat)
the "across sample" mean. For other quantities, e.g. standard deviation you can use apply, e.g. apply(mat, 1, sd) or functions from the matrixStats package, e.g. matrixStats::rowSds(mat).

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Replacing a loop in R: multivariate k-nearest neighbor regression example

I'm new to R and am trying to replace the loop in the appended block of code with something more efficient. For context, this is a simple, synthetic example of a k-nearest neighbor regression with a multivariate (3-dimensional) target.
rm(list=ls())
set.seed(1)
# Fast nearest neighbor package
library(FNN)
k <- 3
# Synthetic 5d predictor and noisy 3d target data
x <- matrix(rnorm(50), ncol=5)
y <- 5*x[,1:3] + matrix(rnorm(30), ncol=3)
print(x)
print(y)
# New synthetic 5d predictor data (4 cases)
x.new <- matrix(rnorm(20), ncol=5)
print(x.new)
# Identify k-nearest neighbors
nn <- knnx.index(data=x, query=x.new, k=k)
print(nn)
At present, I am taking the unweighted average of the k-nearest neighbours (nn) by the following loop:
# Unweighted k-nearest neighbor regression predictions based on y and nn
y.new <- matrix(0, ncol=ncol(y), nrow=nrow(x.new))
for(i in 1:nrow(nn))
y.new[i,] <- colMeans(y[nn[i,],,drop=FALSE])
print(y.new)
but there must be a simple way to avoid looping here. Thanks.
One option in these situations is to build a big matrix and manipulate the indices:
y2<-array(colMeans(matrix(y[t(nn),],nrow=ncol(nn))),dim(y.new))
identical(y2,y.new)
## [1] TRUE
In this case, my code runs about twice as fast as yours:
microbenchmark(
loop = for(i in 1:nrow(nn))
y.new[i,] <- colMeans(y[nn[i,],,drop=FALSE]),
matrix=y2<-array(colMeans(matrix(y[t(nn),],nrow=ncol(nn))),dim(y.new)))
## Unit: microseconds
## expr min lq median uq max neval
## loop 43.680 47.8805 49.1675 49.975 128.698 100
## matrix 23.807 25.4330 25.9985 26.761 80.491 100
The loop in this case isn't really that bad. In general, as long as you're doing a lot of work in a loop (in this case subsetting a matrix and calling colMeans), then the amount of overhead per iteration will be small compared to the actual meat of the loop. The times you really need to avoid loops in R are where each iteration is only doing a small amount of work, in which case the overhead of iterating in R will truly be the bottleneck, and avoiding the loop can give a dramatic performance improvement.
The advantage of the loop is that it is very clear what you are doing, whereas my code is pretty incomprehensible. However, doing matrix index manipulation like this will usually be faster, sometimes by a lot, because you're only subsetting the y matrix once, as opposed to once each time through the loop.

ML estimation of Rician distribution parameters in R

I have data samples arranged in a 1000 x 56 array, and I would like to extract the parameters of a Rician distribution that best fits the data in each column. I am using the VGAM package, which seems like a perfect fit, and given the example in the documentation for riceff
vee = exp(2); sigma = exp(1);
y = rrice(n <- 1000, vee, sigma)
fit = vglm(y ~ 1, riceff, trace=TRUE, crit="c")
I figured the following code would work without a problem
nu <- rep(-1,ncol(data))
sigma <- rep(-1,ncol(data))
for( coln in seq(ncol(data)) ) {
fdata <- c(data[,coln])
fit <- vglm( fdata ~ 1, riceff, trace=TRUE, crit="c" )
sigma[coln] <- matrix(Coef(fit)[1])[1,1]
nu[coln] <- matrix(Coef(fit)[2])[1,1]
}
but instead I get the error
VGLM linear loop 1 : coefficients = -723936.834084, 598.301767
Error in if ((temp <- sum(wz[, 1:M, drop = FALSE] < wzepsilon))) warning(paste(temp, :
argument is not interpretable as logical
as for my data, I ran some basic checks
> is.matrix(data)
[1] TRUE
> dim(data)
[1] 1000 56
> summary(data)
V1
Min. :1.402e-05
1st Qu.:9.533e-04
Median :1.548e-03
Mean :1.640e-03
3rd Qu.:2.175e-03
Max. :4.657e-03
... (omitted for brevity)
V56
Min. :5.252e-05
1st Qu.:1.125e-03
Median :1.692e-03
Mean :1.776e-03
3rd Qu.:2.293e-03
Max. :5.903e-03
None of the information in the summary indicates that there is a NaN hidden somewhere, so I am at a loss as to why vglm is failing.
Does anyone have an idea as to what may be the problem? Any insight is greatly appreciated.
As suggested by Ben Bolker, here is the "solution" to my own problem (for future reference):
The vglm function in the VGAM package does not necessarily behave well for all data inputs. Since a lot of data is often close to being Rayleigh distributed, the command just exits with that bizarre error (Koay inversion also fails, for similar reasons I assume). If I fit my data against a generalized Rayleigh distribution via genrayleigh, everything works well enough.
One way to try both, as Ben suggested, is to use try or tryCatch to attempt both, or to emit NA values when the fitting function breaks down.
tryCatch( {
fit <- vglm( fdata ~ 1, riceff, trace=TRUE, crit="c" )
# extract fit parameters here
# ...
}, error = function(ex) {
# insert NA value into your data here
# ...
} )

Resources