I have a time-series dataset and am taking a rolling average of the past 2 years of coefficients and applying that to the current year variables. I created a method that applies the average, but I am wondering if I can do this with a function like predict so I don't have to individually write out each variable.
A simplified version of my code is like this:
formula <- as.formula( a ~ b + c)
subset <- which(data$year == 2009)
fm1 <- lm(formula, data[subset,])
subset2 <- which(data$year == 2010)
fm2 <- lm(formula, data[subset,])
#Take average of coefficients from these two regression
avg_coeff <- (fm1$coeff + fm2$coeff)/2
#Apply average coefficients to current year data
subset3 <- which(data$year == 2011)
subset_data <- data[subset,]
a_hat <- avg_coeff[1] + avg_coeff[2] * subset_data$b + avg_coeff[3] * subset_data$c
This method works, but I want to keep the lm object so I can just use the predict method and do:
a_hat <- predict(fm, subset_data)
where fm$coefficient contains avg_coeff. I tried
fm2$coeff <- avg_coeff
but this does change fm2$coeff, but when I run the predict with fm2, it uses the original coefficient.
If you want to do this for each year maybe you can do something like
require(plyr)
formula <- as.formula( a ~ b + c)
fn <- function(X){
fm <- lm(formula, X)
out <- fm$coefficients
}
coeffs <- dlply(dta, .(year), fn)
avg_coeffs <- sapply(coeffs, mean)
Related
I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275
I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)
I want to predict a time series from a model I estimate with tslm from the forecast package. Here is some data:
x <- ts(rnorm(120,0,3) + 1:120 + 20*sin(2*pi*(1:120)/12), frequency=12, start= c(2000, 01, 01))
y <- ts(x + rnorm(length(x)), frequency=12, start= c(2000, 01, 01))
df <- data.frame(y, x)
So we have an (independent) variable x with some pattern and a (dependent) variable y which appears to be a noisy version of x. I fit the model like this:
fit <- tslm(y ~ trend + season + x, df)
summary(fit) looks okay, since x is highly significant and estimate is close to 1. But running forecast(fit, h=20) gives me an error:
... variable lengths differ (found for 'x') ...
forecast(fit, h= length(x)) works (although plot(forecast(fit, h= length(x))) looks very strange, but this is an other question).
To forecast y into the future using predictors like x, trend and seasonal,
new data for the predictors must be supplied for the number of periods ahead you want to forecast.
This can be done using the argument newdata in forecast.lm (see ?forecast.lm)
Under follows an example with only x as predictor where we want to forecast y for the next 12 months
library(forecast)
n <- 132
set.seed(1337)
x <- ts(rnorm(n,0,3) + 1:n + 20*sin(2*pi*(1:n)/12), frequency=12, start= c(2000, 01, 01))
#Dividing x into a train and test set, where the test set will be the data we use to forecast ´y´
xtrain <- window(x, end=c(2009, 12))
xtest <- data.frame(x=window(x, start=c(2010, 1)))
y <- window(ts(x + rnorm(length(x)), frequency=12, start= c(2000, 01, 01)), end=c(2009,12))
dftrain <- data.frame(y, x=xtrain)
fit <- tslm(y ~ x, dftrain)
f <- forecast(fit, newdata=xtest)
plot(f)
What makes the tslm function a bit 'special' is that it generates data for trend and seasonality automatically if this is specified, e.g.
fit2 <- tslm(y~trend+season)
f2 <- forecast(fit2, h=12)
plot(f2)
Here it automatically generates data for the newdata argument, which can be found here:
f2$newdata #Beware, season is a factor: str(f2$newdata)
If we combine trend, season and x, we get
fit3 <- tslm(y~trend+season+x, data=dftrain)
f3 <- forecast(fit3, newdata=xtest)
f3$newdata
Strange! Even though we expect it to use all predictors for the forecast, trend and season is not included in f$newdata. We can try to include trend and seasonal manually and check if we get the same results:
#Using `seasonaldummy` from the `forecast` package to generate the seasonal dummy matrix.
#Beware: `forecast::seasonaldummy` use December as reference period by default, while `tslm` use January.
#This should not affect our results, except for the interpretation of the seasonal coefficients.
dftrain2 <- data.frame(y, x=xtrain, customTrend=1:(n-12), forecast::seasonaldummy(xtrain))
dftest2 <- data.frame(x=xtest, customTrend = (n-12+1):n, forecast::seasonaldummy(xtrain, h=12))
fit4 <- tslm(y~customTrend+Jan+Feb+Mar+Apr+May+Jun+Jul+Aug+Sep+Oct+Nov+x, data=dftrain2)
f4 <- forecast(fit4, newdata = dftest2)
f4$newdata #now everything is included.
#Compare the forecasts generated by fit3 and fit4:
f3$mean - f4$mean #Close enough
all.equal(f3$mean, f4$mean) #Point forecast
all.equal(f3$lower, f4$lower) #PIs
all.equal(f3$upper, f4$upper) #PIs
We can also include the seasonal variable as factor, which is a bit easier (but less intuitive in my opinion), and yields completely identical coefficient estimates as fit3.
dftrain3 <- data.frame(y, x=xtrain, customTrend=1:(n-12), customSeason = rep(factor(1:12, levels=1:12), 10))
dftest3 <- data.frame(x=xtest, customTrend = (n-12+1):n, customSeason = factor(1:12, levels=1:12))
fit5 <- tslm(y~customTrend+customSeason+x, data=dftrain3)
all(coefficients(fit3) == coefficients(fit5))
f5 <- forecast(fit5, newdata = dftest3)
f5$newdata
I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?
All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.
The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()
Currently, I am working with data frames in R, the first column of which is a numeric for date. I now have the data sorted in ascending order of date. I want to fit a model (the code I've provided is a simple OLS model) for a 20 day period but for now I've had to assume that I have exactly 124 observations per day, requiring me to use a for loop, however that is not the case. Is there a way for me to include a 20 day window without making that assumption? The current algorithm I have is below. Any help would be much appreciated. The inputs are a data set and two integers, predict and predictor.
rollerOLS <- function(data, predict, predictor){
res <- list()
alpha <- c()
beta <- c()
m <- dim(data)[1]
for(i in 1:(floor(m/124)-10)){
data.new <- as.data.frame(data[c((1+(124*(i-1))):((i+9)*124)),])
data.pred <- as.data.frame(data[c((1+(124*(i+9))):((i+10)*124)-1),])
n <- dim(data.new)[1]
k <- dim(data.pred)[1]
x <- data.new[-1,predictor]
y <- data.new[-n, predict]
mod <- lm(y ~ x)
ts <- mod$coefficients[1] + mod$coefficients[2]*data.pred[-1,predictor]
actual <- data.pred[-k,predict]
alpha[i] <- mod$coefficients[1]
beta[i] <- mod$coefficients[2]
}
coef <- as.data.frame(cbind(alpha, beta))
res$coefs <- coef
res <- as.data.frame(res)
return(res)
}