I am trying to fit a glm model where y, x1 + x2....xn are layers in a rasterStack object. I have tried converting the raster stack to a dataframeobject but I get a vector size error as shown below. instead, I'd like to try fitting the regression model with raster layers as the input - without having to convert the layers to a data frame given the file size and memory error. Would that be possible and how would you configure that?
The model that I'd line to fit is of nature: m1<-glm(y1~x1 + x2, family=binomial(), data=layers), but I don't get to this point because I cant convert the data to a dataframe for model fitting.
dat<-as.data.frame(stack(layers[c(y1,x1,x2)]))
Error: cannot allocate vector of size 40GB
Here are some regression examples with Raster* data (from ?calc):
Create example data
r <- raster(nrow=10, ncol=10)
s1 <- lapply(1:12, function(i) setValues(r, rnorm(ncell(r), i, 3)))
s2 <- lapply(1:12, function(i) setValues(r, rnorm(ncell(r), i, 3)))
s1 <- stack(s1)
s2 <- stack(s2)
Regression of values in one brick (or stack) with another
s <- stack(s1, s2)
# s1 and s2 have 12 layers; coefficients[2] is the slope
fun <- function(x) { lm(x[1:12] ~ x[13:24])$coefficients[2] }
x1 <- calc(s, fun)
Regression of values in one brick (or stack) with 'time'
time <- 1:nlayers(s)
fun <- function(x) { lm(x ~ time)$coefficients[2] }
x2 <- calc(s, fun)
Get multiple layers, e.g. the slope and intercept
fun <- function(x) { lm(x ~ time)$coefficients }
x3 <- calc(s, fun)
In some cases, a much (> 100 times) faster approach is to directly use linear algebra and pre-compute some constants
# add 1 for a model with an intercept
X <- cbind(1, time)
# pre-computing constant part of least squares
invXtX <- solve(t(X) %*% X) %*% t(X)
## much reduced regression model; [2] is to get the slope
quickfun <- function(y) (invXtX %*% y)[2]
x4 <- calc(s, quickfun)
Related
I have proposed my own model and now am trying to implement it using R, I have got stuck on how to find the observed matrix applying my formula i have use glm() to fit logistic model with penalty term, using binary data set x1, x2, x3 ,y (all binary 0,1) fit1 if the glm() model def.new is the penalise deviance.
X.tilde <- as.matrix(x) # n*p matrix of the data set
W <- Diagonal(length(y), weights) # n*n diagonal matrix of the weights
qq <- exp(fit1$fitted.values)/(1 + exp(fit1$fitted.values)) # n*1 vector (pi=probability of the logistic model )
cc <- t(1 - qq) # n*1 vector
gg <- (dev.new) * t(dev.new) # p*p matrix
ff <- (X.tilde) %*% t(X.tilde) # n*n matrix
pp <- exp(fit1$coefficients)/(1 + exp(fit1$coefficients)) # p*1 matrix
ss <- t(1/(1 + exp(fit1$coefficients))) # p*1 vector
aa <- t(X.tilde) %*% qq %*% cc %*% W %*% (X.tilde) # p*p matrix
firstP <- (aa + (pp * ss)) # p*p matrix
info.mat <- firstP+gg # p*p matrix
info.mat <- as.matrix(info.mat)
this code returns the following error
Error in e1 + Matrix(e2) :
Matrices must have same number of rows for arithmetic
As in my theory the dimension is fine by when I implement its not correct
any help?
r
I wrote an algorithm which fits a linear model with lm() and then "updates" the response variable iteratively. The problem is: In a high-dimension scenario, fitting linear models creates a bottleneck.
On the other hand, most of the work required is a matrix inversion that only depends on the covariate matrix X, i.e., the coefficients are given by: solve(t(X) %*% X) %*% X %*% y.
Reading lm() code, I understand that R uses QR decomposition.
Is it possible to recover the internal matrix operation used and fit a new model with new y values faster?
Here's a minimal example:
set.seed(1)
X <- matrix(runif(400*150000), nrow = 150000)
y1 <- runif(150000)
y2 <- runif(150000)
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
Theoretically, mod2 "repeats" costful matrix operations identical to the ones made in the first lm() call.
I want to keep using lm() for its efficient implementation and ability to handle incomplete rank matrices automatically.
# Data
set.seed(1)
n = 5
X <- matrix(runif(5*n), nrow = n)
y1 <- runif(n)
y2 <- runif(n)
# lm models
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)
# Obtain QR decomposition of X
q = qr(X)
# Reuse 'q' to obtain fitted values repeatedly
mod1_fv = qr.fitted(q, y1)
mod2_fv = qr.fitted(q, y2)
# Compare fitted values from reusing 'q' to fitted values in 'lm' models
Vectorize(all.equal)(unname(mod1$fitted.values), mod1_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Vectorize(all.equal)(unname(mod2$fitted.values), mod2_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE
Created on 2019-09-06 by the reprex package (v0.3.0)
Have you tried just fitting a multivariate model? I haven't checked the code, but on my system it's almost half as fast as fitting separately, so I wouldn't be surprised if it's doing what you suggest behind the scenes. That is,
mods <- lm(cbind(y1, y2) ~ X)
I am trying to do bootstrapping regression by re-sampling X and Y from original sample.
I followed a more manual approach (without using any package)
This is my work so far ,
set.seed(326581)
X1=rnorm(10,0,1)
Y1=rnorm(10,0,2)
data=data.frame(X1,Y1)
lst <- replicate(
100,
df.smpl <- data %>% sample_n(10, replace = T),
simplify = FALSE)
The list contained 100 samples where each sample has 2 columns (X,Y) with a sample size of 10 . These are the bootstrap samples.
to get bootstrap residuals , i separated the X and Y columns into two seperate data frames as follows,
new1=data.frame(lapply(lst, `[`, 'X1'))
new2=data.frame(lapply(lst, `[`, 'Y1))
After that i tried to store the residuals that got from each model fitted by using the following code,
res=c()
for(i in 1:100)
{
res[i]=residuals(lm(new2[,i]~new1[,i]))
}
But seems like something is wrong. Can anyone help me to figure that out ?
By the way is there any easier approach than this ?
You're doing this unnecessarily complicated. The whole advantage of storing objects in a list is that you can easily loop through them with e.g. lapply or sapply.
So for example, to store the residuals of a linear model fit you can do
res <- lapply(lst, function(df) residuals(lm(Y1 ~ X1, data = df)))
This fits a linear model of the form lm(Y1 ~ X1) to all data.frames in lst, and stores the residuals in a list of 100 vectors
length(res)
#[1] 100
You could also store residuals based on an lm fit to all 100 sampled data.frames in a 10x100 matrix by using sapply instead of lapply
res <- sapply(lst, function(df)
residuals(lm(Y1 ~ X1, data = df)))
dim(res)
#[1] 10 100
Update
In response to your comment you can do the following
First calculate and store residuals and residual-derived weights in every data.frame in the list.
# Add residuals and weights to lst
lst <- lapply(lst, function(df) {
df$res <- residuals(lm(Y1 ~ X1, data = df));
df$weights <- 1 / fitted(lm(abs(res) ~ X1, data = df))^2;
df;
})
Then run a weighted linear regression and return the second (slop) coefficients
# Return 2nd coeffficient of weighted regression
coeff <- lapply(lst, function(df)
coefficients(lm(Y1 ~ X1, data = df , weights = weights))[2])
Question to be answered
Does anyone know how to solve the attached problem in two lines of code? I believe an as.matrix would work to create a matrix, X, and then use X %*% X, t(X), and solve(X) to get the answer. However, it does not seem to be working. Any answers will help, thanks.
I would recommend using read.csv instead of read.table
It would be useful for you to go over the difference of the two functions in this thread: read.csv vs. read.table
df <- read.csv("http://pengstats.macssa.com/download/rcc/lmdata.csv")
model1 <- lm(y ~ x1 + x2, data = df)
coefficients(model1) # get the coefficients of your regression model1
summary(model1) # get the summary of model1
Based on the answer of #kon_u, here is an example to do it by hands:
df <- read.csv("http://pengstats.macssa.com/download/rcc/lmdata.csv")
model1 <- lm(y ~ x1 + x2, data = df)
coefficients(model1) # get the coefficients of your regression model1
summary(model1) # get the summary of model1
### Based on the formula
X <- cbind(1, df$x1, df$x2) # the column of 1 is to consider the intercept
Y <- df$y
bhat <- solve(t(X) %*% X) %*% t(X) %*% Y # coefficients
bhat # Note that we got the same coefficients with the lm function
Currently, I am working with data frames in R, the first column of which is a numeric for date. I now have the data sorted in ascending order of date. I want to fit a model (the code I've provided is a simple OLS model) for a 20 day period but for now I've had to assume that I have exactly 124 observations per day, requiring me to use a for loop, however that is not the case. Is there a way for me to include a 20 day window without making that assumption? The current algorithm I have is below. Any help would be much appreciated. The inputs are a data set and two integers, predict and predictor.
rollerOLS <- function(data, predict, predictor){
res <- list()
alpha <- c()
beta <- c()
m <- dim(data)[1]
for(i in 1:(floor(m/124)-10)){
data.new <- as.data.frame(data[c((1+(124*(i-1))):((i+9)*124)),])
data.pred <- as.data.frame(data[c((1+(124*(i+9))):((i+10)*124)-1),])
n <- dim(data.new)[1]
k <- dim(data.pred)[1]
x <- data.new[-1,predictor]
y <- data.new[-n, predict]
mod <- lm(y ~ x)
ts <- mod$coefficients[1] + mod$coefficients[2]*data.pred[-1,predictor]
actual <- data.pred[-k,predict]
alpha[i] <- mod$coefficients[1]
beta[i] <- mod$coefficients[2]
}
coef <- as.data.frame(cbind(alpha, beta))
res$coefs <- coef
res <- as.data.frame(res)
return(res)
}