I'm trying to perform a CV on my linear model, which has seasonal dummy variables, so i can't take a random sample.
y = rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12)
x = months(ISOdate(2012,1:12,1))
reg.data = data.frame(y, x)
model = lm(y ~ x, data = reg.data)
My CV function is:
cross.valid = function(model, min.fit = as.integer(nrow(model$model)*0.7), h = 1)
{
dados = model$model
n.rows = nrow(dados)
results = data.frame(pred = numeric(), actual = numeric())
for (i in seq(1, n.rows - min.fit - h + 1, by = h))
{
dados.train = dados[1:(i + min.fit - 1), ]
model <- update(model, data = dados.train)
dados.pred = dados[(i + min.fit):(i + min.fit + h - 1), -1, drop = FALSE]
predic = predict(model, newdata = dados.pred, interval = 'prediction')
actual = dados[(i + min.fit):(i + min.fit + h - 1), 1]
results = rbind(results, data.frame(pred = predic[1:h, 'fit'], actual = actual))
}
results
}
Example:
cv1 = cross.valid(model, h = 1)
mae = with(cv1, mean(abs(actual - pred )))
print(mae)
The MAE values for different horizons (h) are too close. Is the code itself valid? Is there a better solution/package for doing this?
Thanks!
I don't think there is anything incorrect about your function. Investigate the forecast package; I suspect that it will provide many functions that you need.
I have rewritten your function concisely:
set.seed(1)
y = rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12)
x = months(ISOdate(2012,1:12,1))
reg.data = data.frame(y, x)
pred.set<-function(i,h) {
train<-reg.data[1:(i + min.fit - 1),]
test<-reg.data[(i + min.fit):(i + min.fit + h - 1),]
pred<-predict(lm(y~x, data=train), newdata=test)
abs(test$y - pred)
}
pred.by.horiz<-function(h)
mean(sapply(seq(1, nrows - min.fit - h + 1, by = h),pred.set,h=h))
pred.by.horiz matches the output of your function (and post-processing) exactly.
As you mentioned, the horizon does not appear to affect the MAE:
mae.by.h<-sapply(seq(nrows-min.fit),pred.by.horiz)
plot(mae.by.h,type='l',col='red',lwd=2,xlab='Horizon',ylab='Mean absolute error')
Perhaps you expected the the mean error would increase as the prediction horizon increases. For many time-series models this would be true, but in your linear model of months adding more data doesn't help you predict the next point in the series (unless you add 12 months or more).
For example, consider what happens when h is 1. You begin with 84 months of data, 7 points of data for each month. Now, you add one point of data, which will be the next January, and attempt to predict the result of February. But your additional point of data will only help you predict the next January, that is how your linear function works. Look at the summary of the model:
lm(y ~ x, data = reg.data)
Coefficients:
(Intercept) xAugust xDecember xFebruary xJanuary
17.11380 -32.74962 -17.81076 -0.03235 -6.63998
xJuly xJune xMarch xMay xNovember
-26.69203 -17.41170 2.96735 -7.11166 -25.43532
xOctober xSeptember
-33.56517 -36.93474
Each prediction is made solely on the basis of two variables, the intercept, and the predicted month. So predicting one point ahead isn't any easier than predicting five points ahead. That is why the MAE isn't rising as the horizon increases the problem is in the way you modeled the data, not the MAE function.
One thing I didn't completely understand about your function is why you decided to increment the size of the train set by h on each iteration. It is revealing to look at what happens when you try to increment by 1:
# Code to increment by 1
pred.by.horiz2<-
function(h) mean(sapply(seq(1, nrows - min.fit - h + 1, by = 1),pred.set,h=h))
mae.by.h2<-sapply(seq(nrows-min.fit),pred.by.horiz2)
plot(mae.by.h2,type='l',col='red',lwd=2,xlab='Horizon',ylab='Mean absolute error')
The pattern here is complex, but you'll note that the MAE starts to decrease at 12, when the horizon is large enough that the next point can be used.
Related
The basic gist is that I have a set of housing data that I need to create a model for to minimize the predicted price vs actual price of house based on the dataset. So I created this bit of code to essentially test for a range of different numerators and find the one that minimized the difference between them. I'm using the median instead of the mean as the data isn't exactly normal.
Since I only have experience with lm(), I'm using that to create the coefficients and C values. But since the model likes exponents, I have to also test various exponents. It does this for each of the variables and then goes back to the first and re-evaluates it based on the other exponents. The model starts out with all the exponents ending up equal to 1. So the same as the basic linear model. I know that this is probably horribly inefficient and probably uses a lot of code in a somewhat wasteful, but I'm in my first r class so sorry about the mess and/or convoluted coding logic.
Is there any way to do this same thing but being more efficient. Also, I can't really decrease the number of variables as the model likes having more variables and produces a greater margin of error when they aren't present.
w <- seq(1,10000,1)
r <- seq(1,10000,1)
t <- seq(1,10000,1)
z <- seq(1,10000,1)
s <- seq(1,10000,1)
coef_1 <- c(6000,6000,6000,6000,6000,6000,6000,6000)
v <- rep(6000, each = 8)
for(l_1 in 1:10){
for(t_1 in 1:8){
for(i in 1:10000){
t = t_1
coef_1[t] = i
mod5 <- lm(log(SALE_PRC) ~ I(TOT_LVG_AREA^((coef_1[1]-5000)/1000)) + I(LND_SQFOOT^((coef_1[2]-5000)/1000)) + I(RAIL_DIST^((coef_1[3]-5000)/1000)) + I(OCEAN_DIST^((coef_1[4]-5000)/1000)) + I(CNTR_DIST^((coef_1[5]-5000)/1000)) + I(HWY_DIST^((coef_1[6]-5000)/1000)) + I(structure_quality^((coef_1[7]-5000)/1000)) + SUBCNTR_DI + SPEC_FEAT_VAL + (exp(((coef_1[8]-5000)/1000)*SPECIAL_RATIO)) + age, data = kaggle_transform_final)
kaggle_new <- kaggle_transform_final %>%
add_predictions(model = mod5, var = "prediction") %>%
mutate(new_predict = exp(prediction)) %>%
mutate(new_difference = abs((new_predict-SALE_PRC))/SALE_PRC) %>%
mutate(average_percent_difference = median(new_difference)) %>%
mutate(mean_percent_difference = mean(new_difference)) %>%
mutate(quart_75 = quantile(new_difference,.75))
w[i] = kaggle_new$average_percent_difference[1]
r[i] = kaggle_new$mean_percent_difference[1]
t[i] = kaggle_new$quart_75[1]
z[i] = i
s[i] = (i-5000)/1000
if(i%%100 ==0){show(i)}
}
u <- data.frame(median_diff = w, mean_diff = r, quart_75 = t, actual = s, number = z) %>%
arrange(median_diff)
coef_1[t_1] <- u$number[1]
v[t_1] <- u$actual[1]
show(coef_1)
}
coef_1 <- coef_1
}
I have made an ODE model in R using the package deSolve. Currently the output of the model gives me the "observed" prevalence of a disease (i.e. the prevalence not accounting for diagnostic imperfection).
However, I want to adjust the model to output the "true" prevalence, using a simple adjustment formula called the Rogan-Gladen estimator (http://influentialpoints.com/Training/estimating_true_prevalence.htm):
True prevalence =
(Apparent prev. + (Specificity-1)) / (Specificity + (Sensitivity-1))
As you will see in the code below, I have attempted to adjust only one of the differential equations (diggP).
Running the model without adjustment gives an expected output (a proportion between 0 and 1). However, attempting to adjust the model using the RG-estimator gives a spurious output (a proportion less than 0).
Any advice on what might be going wrong here would be very much appreciated.
# Load required packages
library(tidyverse)
library(broom)
library(deSolve)
# Set time (age) for function
time = 1:80
# Defining exponential decay of lambda over age
y1 = 0.003 + (0.15 - 0.003) * exp(-0.05 * time) %>% jitter(10)
df <- data.frame(t = time, y = y1)
fit <- nls(y ~ SSasymp(time, yf, y0, log_alpha), data = df)
fit
# Values of lambda over ages 1-80 years
data <- as.matrix(0.003 + (0.15 - 0.003) * exp(-0.05 * time))
lambda<-as.vector(data[,1])
t<-as.vector(seq(1, 80, by=1))
foi<-cbind(t, lambda)
foi[,1]
# Making lambda varying by time useable in the ODE model
input <- approxfun(x = foi[,1], y = foi[,2], method = "constant", rule = 2)
# Model
ab <- function(time, state, parms) {
with(as.list(c(state, parms)), {
# lambda, changing by time
import<-input(time)
# Derivatives
# RG estimator:
#True prevalence = (apparent prev + (sp-1)) / (sp + (se-1))
diggP<- (((import * iggN) - iggR * iggP) + (sp_igg-1)) / (sp_igg + (se_igg-1))
diggN<- (-import*iggN) + iggR*iggP
dtgerpP<- (0.5*import)*tgerpN -tgerpR*tgerpP
dtgerpN<- (0.5*-import)*tgerpN + tgerpR*tgerpP
# Return results
return(list(c(diggP, diggN, dtgerpP, dtgerpN)))
})
}
# Initial values
yini <- c(iggP=0, iggN=1,
tgerpP=0, tgerpN=1)
# Parameters
pars <- c(iggR = 0, tgerpR = (1/8)/12,
se_igg = 0.95, sp_igg = 0.92)
# Solve model
results<- ode(y=yini, times=time, func=ab, parms = pars)
# Plot results
plot(results, xlab="Time (years)", ylab="Proportion")
I have a linear model with the exchange rate as a dependent variable and 7 others independent variables(e.g. inflation, interest rate etc.). I have quarterly data from 1993Q1-2011Q4.
I would like to create a rolling window regression (with the model above) with window size 60(from 1993Q1-2007Q4) and use the estimated regression to forecast the rest sample. Also, I would like to compare this model with the Random Walk model(exchange rate follows a R.W.). In the end, I would like to perform the dm.test and clarkwest test(does not run). Is my code right?
X = embed(data)
X = as.data.frame(X)
install.packages("foreach")
library(foreach)
w_size=60
n_windows = nrow(X) - 60 #until 2007Q4
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
return(c(f1, f2))
}
e1 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,2]
library(tseries)
library(forecast)
dm.test(e1,e2, "l") #p-value is more than 5% for all the cases( two.sided, greater, less)
clarkwest(e1,e2)
It seems like the clarkwest() function is not supported anymore. I recently wrote my own function: CW Note that I used normal standard errors and not Newey-West corrected.
To investigate your loop you could try:
i=1
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
Here you can see the composition the loop creates when i=1
I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.
I am working through an example from Aguinis, Gottfredson, & Culpepper (2013). They have provided some R code to perform a bootstrapping procedure in R to estimate confidence intervals for slope variances. This is their original R code:
library(RLRsim)
#STEP 3: Random Intercept and Random Slope model
lmm.fit3=lmer(Y ~ (Xc|l2id) + Xc + I(Wj-mean(Wj)), data=exdata, REML=F)
# Nonparametric Bootstrap Function
REMLVC=VarCorr(lmer(Y ~Xc+(Xc|l2id)+I(Wj-mean(Wj) ),data=exdata,REML=T))$l2id[1:2,1:2]
U.R=chol(REMLVC)
REbootstrap=function(Us,es,X,gs){
nj=nrow(Us)
idk=sample(1:nj,size=nj,replace=T)
Usk=as.matrix(Us[idk,])
esk=sample(es,size=length(es),replace=T)
S=t(Usk)%*%Usk/nj
U.S = chol(S)
A=solve(U.S)%*%U.R
Usk = Usk%*%A
datk=expand.grid(l1id = 1:6,l2id = 1:nj)
colnames(X)=c('one','Xc','Wjc')
datk=cbind(datk,X)
datk$yk = X%*%gs + Usk[datk$l2id,1]+Usk[datk$l2id,2]*X[,2]+esk
lmm.fitk=lmer(yk ~Xc+(Xc|l2id)+Wjc,data=datk,REML=F)
tau11k = VarCorr(lmm.fitk)$l2id[2,2]
tau11k
}
# Implementing Bootstrap
bootks=replicate(1500,REbootstrap(Us=ranef(lmm.fit3)$l2id,es=resid(lmm.fit3),X=model.matrix(lmm.fit3),gs=fixef(lmm.fit3)))
quantile(bootks,probs=c(.025,.975))
I was trying to adapt the code to suit my own data and model. That was unfruitful so far because (a) I do not fully understand all the lines of code and (b) I have missing datapoints in one of my predictors. Here is what I have so far:
#reproducible code
set.seed(855)
exdf <- data.frame(
ID= c(rep(1:105, 28)),
content= sort(c(rep(1:28, 105))),
PrePost= sample(0:1, 105*28, replace=TRUE),
eyeFRF= sort(rep(rnorm(28), 105)),
APMs= sample(0:1, 105*28, replace=TRUE),
Gf= rep(rnorm(105), 28)
)
exdf[which(exdf$ID==62), "eyeFRF"] <- NA
RandomMissing <- sample(rownames(exdf[-which(exdf$ID==62), ]), 17)
exdf[RandomMissing, "eyeFRF"] <- NA
View(exdf)
#model
M03b <- glmer(APMs ~ PrePost + Gf + eyeFRF + (1|content) + (eyeFRF|ID), data=exdf, family=binomial("logit"))
#own adaptation
REMLVC=VarCorr(M03b)$ID[1:2,1:2]
U.R=chol(REMLVC)
REbootstrap=function(Us, es, X, gs){
#Us = random effects
#es = residuals
#X = design matrix
#gs = fixed effects
nj = nrow(Us) #104 in this case, one is excluded (#62) b/c no eye-data
idk = sample(1:nj, size=nj, replace=TRUE) #104 IDs
Usk = as.matrix(Us[idk,]) #104 intercepts and slopes
esk = sample(es, size=length(es), replace=TRUE) #2895 datapoints called 'x' (errors?)
S = t(Usk)%*%Usk/nj #?
U.S = chol(S) #?
A = solve(U.S)%*%U.R #?
Usk = Usk%*%A #?
datk = expand.grid(content=1:28, ID=1:nj)
colnames(X) = c('one', 'PrePost', 'Gf', 'eyeFRF')
datk = cbind(datk, X)
datk$APMsk = X%*%gs + Usk[datk$ID,1] + Usk[datk$ID,2]*X[ ,2] + esk
lmm.fitk = glmer(APMsk ~ PrePost + Gf + eyeFRF + (1|content) + (zb|ID), data=datk, family=binomial("logit"))
tau11k = VarCorr(lmm.fitk)$l2id[2,2]
tau11k
}
# Implementing Bootstrap
bootks <- replicate(1500, REbootstrap(Us=ranef(M03b)$ID, es=resid(M03b), X=model.matrix(M03b), gs=fixef(M03b)))
quantile(bootks, probs=c(.025,.975))
(upgrading comment to an answer)
If you're trying to get confidence intervals via parametric bootstrapping, would confint(M03b,method="boot") work for you? (I think these methods may be new or better developed since that paper was written ...)