Related
I have an array of the dimensions c(54,71,360) which contains climatalogical data. The first two dimensions describe the grid of the region, while the third one serves as time dimension. So in this case, there are 360 time steps (months).
Here is code to produce a sample array:
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
Now I would like to calculate the trend of each grid cell. The trend is equal to the slope of the linear regression equation. This is why the calculation of the linear regression of every grid cell with the time needs to be performed. And this is exactly what I am struggeling with.
To clearly show what I wish to do, here is an example with one grid cell, which is taken from the array as a vector of the length 360:
grid_cell <- my_array[1,1,]
The linear regression of this vector with the time needs to be calculated. For that purpose, we create a simple time vector:
time_vec <- 1:360
Since I am only interested at the slope coefficient, it can be done this way:
trend <- lm(grid_cell ~ time_vec)$coefficients[2]
This leads to a value of 1.347029e-05 in this case.
I would like to do this for every grid cell of the array, so that the output is a matrix of the dimensions c(54,71), meaning one trend value for each grid cell.
I tried the following, which did not work:
trend_mat <- apply(my_array, 1:2, lm(my_array ~ time_vec)$coefficients[2])
I receive the error message:Error in model.frame.default: variable lengths differ.
This is kind of surprising, since both, the third dimension of the array and the time_vec are both of the length 360.
Anybody with an idea how to achieve this?
Of course I am also open for other solutions which may work totally differently, as long as they lead to the same result.
The problems with the code in the question are that
the third argument of apply should be a function and the question's code provides an expression instead of a function.
it applies lm many times. We show how to do it applying lm only once and in the second alternative we don't use lm at all. this gives one and two order of magnitude speedups as shown in the Performance section below.
It is easier to illustrate if we use smaller data as shown in the Note at the end. To use it on your example just replace dims with the line shown in the commented out line in the Note.
1) First we reshape the array into a matrix, perform lm and then reshape it back. This invokes lm once rather than invoking it prod(dims[1:2]) times.
m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
# check
coef(lm(a[1,1,] ~ timevec))[[2]]
## [1] 0.2636792
coef(lm(a[2,1,] ~ timevec))[[2]]
## [1] -0.4453307
coef(lm(a[1,2,] ~ timevec))[[2]]
## [1] 0.5682025
coef(lm(a[2,2,] ~ timevec))[[2]]
## [1] 0.2338086
coef(lm(a[1,3,] ~ timevec))[[2]]
## [1] -0.255538
coef(lm(a[2,3,] ~ timevec))[[2]]
## [1] 0.254682
2) Alternately, we can remove lm entirely by using the formula for the slope coefficient like this:
m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
Performance
We see that the single lm runs about 8x faster than apply and eliminating lm runs about 230x times faster than apply. Because the apply is brutally slow on my laptop I only used 3 replications but if you have a faster machine or more patience you can increase it. The main conclusions are unlikely to change much though.
library(microbenchmark)
set.seed(5)
dims <- c(54,71,360)
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
microbenchmark(times = 3L,
apply = apply(a, 1:2, function(x) coef(lm(x ~ timevec))[2]),
lm = { m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
},
cov = { m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
})
giving:
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 13446.7953 13523.6016 13605.25037 13600.4079 13684.4779 13768.5479 3 b
lm 264.5883 275.7611 476.82077 286.9338 582.9370 878.9402 3 a
cov 56.9120 57.8830 58.71573 58.8540 59.6176 60.3812 3 a
Note
Test data.
set.seed(5)
# dims <- c(54,71,360)
dims <- 2:4
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
There is a anonymous function missing in the question's regression code. Here I will use the new lambdas, introduced in R 4.1.0.
I also use the recommended extractor coef.
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
time_vec <- 1:360
trend_mat <- apply(my_array, 1:2, \(x) coef(lm(x ~ time_vec))[2])
For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})
I wrote a simple function for maximum likelihood and would like this function to give different result based on the different values of its parameters using for loop in R. That is my function include an expression based on for loop. My function works well and the result are saved in a list. Then, Since I have two different results, I would like to apply the optim function to my function based on each part of my function. For example,
ff <- function(x,mu=c(2,0.5),sd=c(0.2,0.3)){
out <- vector("list",2)
for (i in 1:2){
out[[i]] <- -sum(log(dnorm(x,mu[[i]],sd[[i]]))) ## here I have two different part of my funcitons wrap as one using for loop.
}
return(out)
}
set.seed(123)
x <- rnorm(10,2,0.5)
x
Then the result of my function is:
> ff(x)
[[1]]
[1] 25.33975
[[2]]
[1] 101.4637
Then, since my function has two different parts wrap as one using for loop, I would like to apply the optim function to this function based on each part of it. I tried many own methods and they did not work. Here is one of my tries:
op <- vector("list",2)
for(i in 1:2){
op <- optim(c(0.5,0.5),fn=ff[[i]],i=i)
}
That is, I want the optim function to evaluate my function at the first value of my argument i=1 and then evaluate the function for the second one i=2.
So my funcitons without the wrap is as follows:
ff_1 <- function(x,mu=c(2,0.5),sd=c(0.2,0.3)){
-sum(log(dnorm(x,mu[[1]],sd[[1]])))
return(out)
}
ff_2 <- function(x,mu=c(2,0.5),sd=c(0.2,0.3)){
-sum(log(dnorm(x,mu[[2]],sd[[2]])))
return(out)
}
and I then need to use two different optim functions for each functions.
I search many website and R help sites but I couldnot find a solution to this question.
Any help please?
Try this one, it's just the way of passing the arguments to optim, I suppose
# given data
set.seed(123)
x <- rnorm(10,2,0.5)
# use vector parOpt instead of specifying two; for convience
# with optim
ff <- function(x, parOpt){
out <- -sum(log(dnorm(x, parOpt[1], parOpt[2])))
return(out)
}
# parameters in mu,sd vectors arranged in list
params <- list(set1 = c(2, 0.2), set2 = c(0.5, 0.3))
# output list
out <- list()
for(i in 1:2){
# pass params (mu and sd) to optim, function ff and the data
# note, since function ff has x argument, specify that in optim
out[[i]] <- optim(par = params[[i]], fn=ff ,x=x)
}
Should give something like this:
[[1]]
[[1]]$par
[1] 2.0372546 0.4523918
[[1]]$value
[1] 6.257931
[[1]]$counts
function gradient
55 NA
[[1]]$convergence
[1] 0
[[1]]$message
NULL
[[2]]
[[2]]$par
[1] 2.037165 0.452433
[[2]]$value
[1] 6.257932
[[2]]$counts
function gradient
73 NA
[[2]]$convergence
[1] 0
[[2]]$message
NULL
Hope this helps.
As an alternative, you can find the same solution using the command fitdist of the fitdistrplus package:
library(fitdistrplus)
set.seed(123)
x <- rnorm(10,2,0.5)
mu.start <- c(2,0.5)
sd.start <- c(0.2,0.3)
op <- vector("list",2)
for(i in 1:2){
op[[i]] <- fitdist(x,"norm", start=c(mu.start[i],sd.start[i]))
}
op
The result is:
[[1]]
Fitting of the distribution ' norm ' by maximum likelihood
Parameters:
estimate Std. Error
1 2.0372546 0.1430588
2 0.4523918 0.1011464
[[2]]
Fitting of the distribution ' norm ' by maximum likelihood
Parameters:
estimate Std. Error
1 2.037165 0.1430719
2 0.452433 0.1011694
Right now, I have a combn from the built in dataset iris. So far, I have been guided into being able to find the coefficient of lm() of the pair of values.
myPairs <- combn(names(iris[1:4]), 2)
formula <- apply(myPairs, MARGIN=2, FUN=paste, collapse="~")
model <- lapply(formula, function(x) lm(formula=x, data=iris)$coefficients[2])
model
However, I would like to go a few steps further and use the coefficient from lm() to be used in further calculations. I would like to do something like this:
Coefficient <- lm(formula=x, data=iris)$coefficients[2]
Spread <- myPairs[1] - coefficient*myPairs[2]
library(tseries)
adf.test(Spread)
The procedure itself is simple enough, but I haven't been able to find a way to do this for each combn in the data set. (As a sidenote, the adf.test would not be applied to such data, but I'm just using the iris dataset for demonstration).
I'm wondering, would it be better to write a loop for such a procedure?
You can do all of this within combn.
If you just wanted to run the regression over all combinations, and extract the second coefficient you could do
fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2]
combn(names(iris[1:4]), 2, fun)
You can then extend the function to calculate the spread
fun <- function(x) {
est <- coef(lm(paste(x, collapse="~"), data=iris))[2]
spread <- iris[,x[1]] - est*iris[,x[2]]
adf.test(spread)
}
out <- combn(names(iris[1:4]), 2, fun, simplify=FALSE)
out[[1]]
# Augmented Dickey-Fuller Test
#data: spread
#Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
#alternative hypothesis: stationary
Compare results to running the first one manually
est <- coef(lm(Sepal.Length ~ Sepal.Width, data=iris))[2]
spread <- iris[,"Sepal.Length"] - est*iris[,"Sepal.Width"]
adf.test(spread)
# Augmented Dickey-Fuller Test
# data: spread
# Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
# alternative hypothesis: stationary
Sounds like you would want to write your own function and call it in your myPairs loop (apply):
yourfun <- function(pair){
fm <- paste(pair, collapse='~')
coef <- lm(formula=fm, data=iris)$coefficients[2]
Spread <- iris[,pair[1]] - coef*iris[,pair[2]]
return(Spread)
}
Then you can call this function:
model <- apply(myPairs, 2, yourfun)
I think this is the cleanest way. But I don't know what exactly you want to do, so I was making up the example for Spread. Note that in my example you get warning messages, since column Species is a factor.
A few tips: I wouldn't name things that you with the same name as built-in functions (model, formula come to mind in your original version).
Also, you can simplify the paste you are doing - see the below.
Finally, a more general statement: don't feel like everything needs to be done in a *apply of some kind. Sometimes brevity and short code is actually harder to understand, and remember, the *apply functions offer at best, marginal speed gains over a simple for loop. (This was not always the case with R, but it is at this point).
# Get pairs
myPairs <- combn(x = names(x = iris[1:4]),m = 2)
# Just directly use paste() here
myFormulas <- paste(myPairs[1,],myPairs[2,],sep = "~")
# Store the models themselves into a list
# This lets you go back to the models later if you need something else
myModels <- lapply(X = myFormulas,FUN = lm,data = iris)
# If you use sapply() and this simple function, you get back a named vector
# This seems like it could be useful to what you want to do
myCoeffs <- sapply(X = myModels,FUN = function (x) {return(x$coefficients[2])})
# Now, you can do this using vectorized operations
iris[myPairs[1,]] - iris[myPairs[2,]] * myCoeffs[myPairs[2,]]
If I am understanding right, I believe the above will work. Note that the names on the output at present will be nonsensical, you would need to replace them with something of your own design (maybe the values of myFormulas).
I'm working on a project where I need to collect the intercept, slope, and R squared of several linear regressions. Since I need to at least 200 samples of different sample sizes I set-up the code below, but it only saves the last iteration of the loop. Any suggestions on how I can record each loop so that I can have all of the coefficients and r-squares that I require.
for (i in 1:5) {
x <- as.data.frame(mydf[sample(1:1000,25,replace=FALSE),])
mylm <- lm(spd66305~spd66561, data=x)
coefs <- rbind(lman(mylm))
total.coefs <- rbind(coefs)
}
total.coefs
The function used in the loop is below if that is needed.
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
tbl <- c(intercept,slope,r2,r)
}
Thanks for the help.
Before starting your loop, you can write
total.coefs <- data.frame(), to initialise an empty data.frame. Then in your loop you want to update the total.coefs, as follows: total.coefs <- rbind(total.coefs, coefs). Finally replace the last line in lman by:
tbl <- data.frame(intercept=intercept, slope=slope, r2=r2, r=r).
Here's how I'd do it, for example on the mtcars data. Note: It's not advisable to use rbind inside the loop if you're building a data structure. You can call rbind after the looping has been done and things are much less stressful. I prefer to do this type of operation with a list.
Here I wrapped my lapply loop with rbind, and then do.call binds the list elements together recursively. Another thing to note is that I take the samples prior to entering the loop. This makes debugging easier and can be more efficient overall
reps <- replicate(3, sample(nrow(mtcars), 5), simplify = FALSE)
do.call(rbind, lapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}))
# (Intercept) hp R
# [1,] 33.29360 -0.08467169 0.5246208
# [2,] 29.97636 -0.06043852 0.4770310
# [3,] 28.33462 -0.05113847 0.8514720
The following transposed vapply loop produces the same result, and is often faster when you know the type of result you expect
t(vapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}, numeric(3)))
Another way to record each loop would be to make the work reproducible and keep your datasets around in case you have extreme values, missing values, new questions about the datasets, or other surprises that need investigated.
This is a similar case using the iris dataset.
# create sample data
data(iris)
iris <- iris[ ,c('Sepal.Length','Petal.Length')]
# your function with data.frame fix on last line
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
data.frame(intercept,slope,r2,r)
}
# set seed to make reproducible
set.seed(3)
# create all datasets
alldatasets <- lapply(1:200,function(x,df){
df[sample(1:nrow(df),size = 50,replace = F), ]
},df = iris)
# create all models based on alldatasets
allmodels <- lapply(alldatasets,lm,formula = Sepal.Length ~ Petal.Length)
# run custom function on all models
lmanresult <- lapply(allmodels,lman)
# format results
result <- do.call('rbind',lmanresult)
row.names(result) <- NULL
# inspect the 129th sample, model, and result
alldatasets[[129]]
summary(allmodels[[129]])
result[129, ]