Updated:
I'm trying to examine the variability in my parameter estimates from a merMod object by simulating known data and running the model 100 times. I'd like the result to be a data frame that looks like the following:
| simulation | intercept | est.x1 | est.x2 |
| ---------- | --------- | ------ | ------ |
| sim_study1 |.09 |.75 |.25 |
| sim_study2 |.10 |.72 |.21 |
| sim_study3 |NA |NA |NA |
My code to generate multilevel data with a random intercept and 2 predictors is:
# note. this code block runs as expected, and if I run a lmer() call
# on a simulated data set I get values that one would expect.
gen_fake <- function(i, j){
school <- rep(1:j)
person <- rep(1:i) # students nested in schools
# parameters
mu_a_true <- 0.10 # real intercept
sigma_a_true <- 0.10 # varince of intercept
sigma_y_true <- 0.40
b1_true <- .75
b2_true <- .25
# random intercept for schools
a_true <- rnorm(j, mu_a_true, sigma_a_true)
# random data for predictors
x1 <- rnorm(i, 0, 1)
x2 <- rnorm(i, 0, 1)
# outcome
y <- rnorm(i, a_true[school] + b1_true*x1 + b2_true*x2, sigma_y_true)
return (data.frame(y, person, school, x1, x2))
}
I'm attempting to conduct a 100 simulations of a model, while generating new data each time. Note, I'm trying to implement tryCatch within the loop because with more complex models, where the model might not terminate normally, I'd like value returned in the table to be NA for parameters.
My code for this is as follows:
# create an empty data frame with names of parameters (there's probably
# a slicker way to do this within the loop where I can match names from
# the model call)
sim_results <- data.frame(matrix(nrow=100, ncol=3,
dimnames=list(c(),
c("intercept",
"est.x1", "est.x2"))),
stringsAsFactors=F)
# load library for analysis
library(lme4)
# conduct 100 simulations of the model generating fake data for each run
sim_study <- function (i, j, n.sims){
for (sim in 1:n.sims){
fake_dat <- gen_fake(i, j)
tryCatch({
lmer_sim <- lmer(y ~ x1 + x2 + (1|school), data = fake_dat)
}, error = function(e){
return(NA)
}) #return previous value of fm if error
estimates <- rbind(fixef(lmer_sim))
}
sim_results[sim,] <- estimates
}
# run the simulation study
sim_study (1000,5,100)
The issue I am having is that the function only returns 1 row and it isn’t populating the empty data frame I made:
(Intercept) x1 x2
[1,] 0.09659339 0.7746392 0.2325391
I'm unsure of the issue. Finally, any feedback you might have for how to make this work faster would also be appreciated, as I'd like to learn more about that issue. Thanks for any assistance.
This may be a bit of a forehead-slapper, but I think you just misplaced the loop brackets? This works for me:
sim_study <- function (i, j, n.sims){
for (sim in 1:n.sims) {
if (sim %% 10 == 0 ) cat(".\n") ## print progress
fake_dat <- gen_fake(i, j)
tryCatch({
lmer_sim <- lmer(y ~ x1 + x2 + (1|school),
data = fake_dat)
}, error = function(e){
return(rep(NA,3)) ## return vector of correct length
}) #return previous value of fm if error
estimates <- rbind(fixef(lmer_sim))
sim_results[sim,] <- estimates
}
return(sim_results)
}
A few more points:
I'm not sure whether the tryCatch() logic works, since I didn't hit any errors (but I think it ought to be modified to return an object with the current length, as above)
you could replace some of your gen_fake (not the generation of the predictors, but the generation of the response with the built-in ?simulate.merMod(), but I don't think it would actually work any better (or worse)
speeding this up significantly would be a bit of work/hacky. There is a refit() function that works quickly if only the predictor variable has changed, but it doesn't hold in this case. You could use the tricks specified here ...
Related
I am using gamm4:gamm4 to model longitudinal change.
I am trying to use the modelsummary::modelsummary function to create an output table of the following results:
I would like to add t-values and std.error to the output of the fixed effects, and remove the empty tags values from the random effects
model_lmer <- gamm4(Y ~ Tract + s(Age, by = Tract, k = 10) + Sex,
data = (DF1),
random = ~ (0 + Tract | ID))
modelsummary(model_lmer$mer,
statistic = c("s.e. = {std.error}",
"t = {statistic}"))
But I am struggling to write the correct syntax to remove the "t" and "s.e." from the random effects output.
This is kind of tricky, actually. The issue is that modelsummary()
automatically drops empty rows when they are filled with NA or an
empty string "". However, since glue strings can include arbitrary
text, it is hard to think of a general way to figure out if the row is
empty or not, because modelsummary() cannot know ex ante what
constitutes an empty string.
If you have an idea on how this check could be implemented, please report it
on Github:
https://github.com/vincentarelbundock/modelsummary
In the meantime, you could use the powerful tidy_custom.CLASSNAME
mechanism
to customize the statistic and p.value statistics directly instead
of through a glue string:
library(gamm4)
library(modelsummary)
# simulate
x <- runif(100)
fac <- sample(1:20,100,replace=TRUE)
eta <- x^2*3 + fac/20; fac <- as.factor(fac)
y <- rpois(100,exp(eta))
# fit
mod <- gamm4(y~s(x),family=poisson,random=~(1|fac))
# customize
tidy_custom.glmerMod <- function(x) {
out <- parameters::parameters(x)
out <- insight::standardize_names(out, style = "broom")
out$statistic <- sprintf("t = %.3f", out$statistic)
out$p.value <- sprintf("p = %.3f", out$p.value)
out
}
# summarize
modelsummary(mod$mer,
statistic = c("{statistic}", "{p.value}"))
Model 1
X(Intercept)
1.550
t = 17.647
p = 0.000
Xs(x)Fx1
0.855
t = 4.445
p = 0.000
Num.Obs.
100
RMSE
2.49
Note that I used simple glue strings in statistic = "{p.value}", otherwise they would be wrapped up in parentheses, as is default for standard errors.
I would like to derive individual growth rates from our growth model directly, similar to this OP and this OP.
I am working with a dataset that contains the age and weight (wt) measurements for ~2000 individuals in a population. Each individual is represented by a unique id number.
A sample of the data can be found here. Here is what the data looks like:
id age wt
1615 6 15
3468 32 61
1615 27 50
1615 60 145
6071 109 209
6071 125 207
10645 56 170
10645 118 200
I have developed a non-linear growth curve to model growth for this dataset (at the population level). It looks like this:
wt~ A*atan(k*age - t0) + m
which predicts weight (wt) for a given age and has modifiable parameters A, t0, and m. I have fit this model to the dataset at the population level using a nlme regression fit where I specified individual id as a random effect and used pdDiag to specify each parameter as uncorrelated. (Note: the random effect would need to be dropped when looking at the individual level.)
The code for this looks like:
nlme.k = nlme(wt~ A*atan(k*age - t0) + m,
data = df,
fixed = A+k+t0+m~1,
random = list(id = pdDiag(A+t0+k+m~1)), #cannot include when looking at the individual level
start = c(A = 99.31,k = 0.02667, t0 = 1.249, m = 103.8), #these values are what we are using at the population level # might need to be changed for individual models
na.action = na.omit,
control = nlmeControl(maxIter = 200, pnlsMaxIter = 10, msMaxIter = 100))
I have our population level growth model (nlme.k), but I would like to use it to derive/extract individual values for each growth constant.
How can I extract individual growth constants for each id using my population level growth model (nlme.k)? Note that I don't need it to be a solution that uses nlme, that is just the model I used for the population growth model.
Any suggestions would be appreciated!
I think this is not possible due to the nature on how random effects are designed. According to this post the effect size (your growth constant) is estimated using partial pooling. This involves using data points from other groups. Thus you can not estimate the effect size of each group (your individual id).
Strictly speaking (see here) random effects are not really a part of the model at all, but more a part of the error.
However, you can estimate the R2 for all groups together. If you want it on an individual level (e.g. parameter estiamtes for id 1), then just run the same model only on all data points of this particular individual. This give you n models with n parameter sets for n individuals.
We ended up using a few loops to do this.
Note that our answer builds off a model posted in this OP if anyone wants the background script. We will also link to the published script when it is posted.
For now - this is should give a general idea of how we did this.
#Individual fits dataframe generation
yid_list <- unique(young_inds$squirrel_id)
indf_prs <- list('df', 'squirrel_id', 'A_value', 'k_value', 'mx_value', 'my_value', 'max_grate', 'hit_asymptote', 'age_asymptote', 'ind_asymptote', 'ind_mass_asy', 'converge') #List of parameters
ind_fits <- data.frame(matrix(ncol = length(indf_prs), nrow = length(yid_list))) #Blank dataframe for all individual fits
colnames(ind_fits) <- indf_prs
#Calculates individual fits for all individuals and appends into ind_fits
for (i in 1:length(yid_list)) {
yind_df <-young_inds%>%filter(squirrel_id %in% yid_list[i]) #Extracts a dataframe for each squirrel
ind_fits[i , 'squirrel_id'] <- as.numeric(yid_list[i]) #Appends squirrel i's id into individual fits dataframe
sex_lab <- unique(yind_df$sex) #Identifies and extracts squirrel "i"s sex
mast_lab <- unique(yind_df$b_mast) #Identifies and extracts squirrel "i"s mast value
Hi_dp <- max(yind_df$wt) #Extracts the largest mass for each squirrel
ind_long <- unique(yind_df$longevity) #Extracts the individual death date
#Sets corresponding values for squirrel "i"
if (mast_lab==0 && sex_lab=="F") { #Female no mast
ind_fits[i , 'df'] <- "fnm" #Squirrel dataframe (appends into ind_fits dataframe)
df_asm <- af_asm #average asymptote value corresponding to sex
df_B_guess <- guess_df[1, "B_value"] #Inital guesses for nls fits corresponding to sex and mast sex and mast
df_k_guess <- guess_df[1, "k_value"]
df_mx_guess <- guess_df[1, "mx_value"]
df_my_guess <- guess_df[1, "my_value"]
ind_asyr <- indf_asy #growth rate at individual asymptote
} else if (mast_lab==0 && sex_lab=="M") { #Male no mast
ind_fits[i , 'df'] <- "mnm"
df_asm <- am_asm
df_B_guess <- guess_df[2, "B_value"]
df_k_guess <- guess_df[2, "k_value"]
df_mx_guess <- guess_df[2, "mx_value"]
df_my_guess <- guess_df[2, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="F") { #Female mast
ind_fits[i , 'df'] <- "fma"
df_asm <- af_asm
df_B_guess <- guess_df[3, "B_value"]
df_k_guess <- guess_df[3, "k_value"]
df_mx_guess <- guess_df[3, "mx_value"]
df_my_guess <- guess_df[3, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="M") { #Males mast
ind_fits[i , 'df'] <- "mma"
df_asm <- am_asm
df_B_guess <- guess_df[4, "B_value"]
df_k_guess <- guess_df[4, "k_value"]
df_mx_guess <- guess_df[4, "mx_value"]
df_my_guess <- guess_df[4, "my_value"]
ind_asyr <- indf_asy
} else { #If sex or mast is not identified or identified improperlly in the data
print("NA")
} #End of if else loop
#Arctangent
#Fits nls model to the created dataframe
nls.floop <- tryCatch({data.frame(tidy(nls(wt~ B*atan(k*(age - mx)) + my, #tryCatch lets nls have alternate results instead of "code stopping" errors
data=yind_df,
start = list(B = df_B_guess, k = df_k_guess, mx = df_mx_guess, my = df_my_guess),
control= list(maxiter = 200000, minFactor = 1/100000000))))
},
error = function(e){
nls.floop <- data.frame(c(0,0), c(0,0)) #Specifies nls.floop as a dummy dataframe if no convergence
},
warning = function(w) {
nls.floop <- data.frame(tidy(nls.floop)) #Fit is the same if warning is displayed
}) #End of nls.floop
#Creates a dummy numerical index from nls.floop for if else loop below
numeric_floop <- as.numeric(nls.floop[1, 2])
#print(numeric_floop) #Taking a look at the values. If numaric floop...
# == 0, function did not converge on iteration "i"
# != 0, function did converge on rapid "i" and code will run through calculations
if (numeric_floop != 0) {
results_DF <- nls.floop
ind_fits[i , 'converge'] <- 1 #converge = 1 for converging fit
#Extracting, calculating, and appending values into dataframe
B_value <- as.numeric(results_DF[1, "estimate"]) #B value
k_value <- as.numeric(results_DF[2, "estimate"]) #k value
mx_value <- as.numeric(results_DF[3, "estimate"]) #mx value
my_value <- as.numeric(results_DF[4, "estimate"]) #my value
A_value <- ((B_value*pi)/2)+ my_value #A value calculation
ind_fits[i , 'A_value'] <- A_value
ind_fits[i , 'k_value'] <- k_value
ind_fits[i , 'mx_value'] <- mx_value
ind_fits[i , 'my_value'] <- my_value #appends my_value into df
ind_fits[i , 'max_grate'] <- adr(mx_value, B_value, k_value, mx_value, my_value) #Calculates max growth rate
}
} #End of individual fits loop
Which gives this output:
> head(ind_fits%>%select(df, squirrel_id, A_value, k_value, mx_value, my_value))
df squirrel_id A_value k_value mx_value my_value
1 mnm 332 257.2572 0.05209824 52.26842 126.13183
2 mnm 1252 261.0728 0.02810033 42.37454 103.02102
3 mnm 3466 260.4936 0.03946594 62.27705 131.56665
4 fnm 855 437.9569 0.01347379 86.18629 158.27641
5 fnm 2409 228.7047 0.04919819 63.99252 123.63404
6 fnm 1417 196.0578 0.05035963 57.67139 99.65781
Note that you need to create a blank dataframe first before running the loops.
As I am just learning R, I am not sure how to solve this. I am trying to get a data frame that shows me the following:
Model Number | adj.r.squared | sigma | statistic | df
------------------------------------------------------
Model 1 | 0.465 | 0.437 | 459.0. | 8
Model 2 | 0.0465 | 0.0437 | 659.0. | 7
I am using the broom package in order to get these statistics with glance() and created a function for it:
glancing <- function(x) {
glance(x)[c("adj.r.squared", "sigma", "statistic", "df")]
}
I am using a dataset that has 9 variables ("danceability","energy", "loudness", "speechiness", "acousticness", "liveness", "valence", "tempo", "instrumentalness) and I needed all the combination possible for linear regression to predict the popularity score
I found a way to put all the formulas in a list:
characteristics <- c("popularity","danceability","energy", "loudness", "speechiness", "acousticness", "liveness", "valence", "tempo", "instrumentalness")
N <- list(1,2,3,4,5,6,7,8,9)
COMB <- sapply(N, function(m) combn(x=characteristics[2:10], m))
formulas <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
formulas[[k]] <- formula(paste("popularity", "~", paste(tmp[,j], collapse=" + ")))
}
}
I was also able to assign each formula in the list to an object with the linear model:
#Assign each model to a variables
for(i in 1:length(formulas)) {
assign(paste0("model",i),lm(formulas[[i]], data=training_data))
}
This leaves me with 511 models (objects), which I have to put into the glancing function manually, and then combine into a data frame.
Is there an easier way of doing this altogether?
I already tried to convert the list into a data frame or vector, but it seems to fail due to the fact the class is a "formula".
Your help is appreciated!
Replace this loop using assign:
for(i in 1:length(formulas)) {
assign(paste0("model",i),lm(formulas[[i]], data=training_data))
}
With this loop using a list:
model_list = list()
for(i in 1:length(formulas)) {
model_list[[i]] = lm(formulas[[i]], data=training_data)
}
Then if you want to glance all of them:
library(dplyr)
library(broom)
glance_results = bind_rows(lapply(model_list, glance))
I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.
Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.
Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.
Stata example
Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command
program reg_simulation, rclass
*Declaring Stata version
version 13
*Droping all variables on memory
drop _all
*Set sample size (n=100)
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
*Estimate OLS
reg y x1 x2
*Store coefficients
matrix B = e(b)
return matrix betas = B
end
Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
Obtained result (stored data on memory)
_b_x1 _b_x2 _b_cons
.4470155 1.50748 1.043514
.4235979 1.60144 1.048863
.5006762 1.362679 .8828927
.5319981 1.494726 1.103693
.4926634 1.476443 .8611253
.5920001 1.557737 .8391003
.5893909 1.384571 1.312495
.4721891 1.37305 1.017576
.7109139 1.47294 1.055216
.4197589 1.442816 .9404677
R replication of the Stata program above.
Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.
Defining reg_simulation function.
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
}
Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
#Set seed internally (to get different values in each run)
set.seed(i)
#Save results into list
results_list[i] <- list(reg_simulation(obs=1000))
}
#unlist results
df_results<- data.frame(t(sapply(results_list,
function(x) x[1:max(lengths(results_list))])))
Obtained result: df_results.
#final results
df_results
# X.Intercept. x1 x2
# 1 1.0162384 0.5490488 1.522017
# 2 1.0663263 0.4989537 1.496758
# 3 0.9862365 0.5144083 1.462388
# 4 1.0137042 0.4767466 1.551139
# 5 0.9996164 0.5020535 1.489724
# 6 1.0351182 0.4372447 1.444495
# 7 0.9975050 0.4809259 1.525741
# 8 1.0286192 0.5253288 1.491966
# 9 1.0107962 0.4659812 1.505793
# 10 0.9765663 0.5317318 1.501162
You're on the right track. Couple of hints/corrections:
Don't use <- inside data.frame()
In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).
(There's more to be said about <- vs =, but I don't want to distract from your main question.)
In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.
When growing a list via a for loop in R, always try to pre-allocate the list first
Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:
results_list <- vector("list", 10)
3. Consider using lapply instead of for
for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.
4. Run large loops in parallel
A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.
Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
obs <- 1e3
# Defining a function
reg_simulation <- function(...){
x1 <- rnorm(obs, 0 , 1)
x2 <- rnorm(obs, 0 , 1)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
# return(ols$coefficients)
return(as.data.frame(t(ols$coefficients)))
}
# N repetitions
nreps <- 10
## Serial version
# results <- lapply(1:nreps, reg_simulation)
## Parallel version
results <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)
## Unlist / convert into a data.table
results <- rbindlist(results)
So, following up on the comments, you want to vary your independent variables (x) and also the error term and simulate the coefficients, but you also want to catch errors if any occur. The following would do the trick:
set.seed(42)
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
tryCatch(
{
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
},
error = function(e){
return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
}
)
}
output <- t(data.frame(replicate(10, reg_simulation())))
output
(Intercept) x1 x2
X1 0.9961328 0.4782010 1.481712
X2 1.0234698 0.4801982 1.556393
X3 1.0336289 0.5239380 1.435468
X4 0.9796523 0.5095907 1.493548
...
Here, tryCatch (see also failwith) catches the error and returns NA as the default value.
Note that you only need to set the seed once because the seed changes automatically with every call to random number generator in a deterministic fashion.
I'm using either dyn or dynlm to predict time series using lagged variables.
However, the predict function in either case only evaluates one time step at a time, taking a constant time of 24 milliseconds per step on my computer, or about 1.8 hours for my dataset, which is super long, given that the entire regression takes about 10 seconds.
So, I'm thinking that perhaps the fastest thing might be just to evaluate the formula by hand?
So, is there some way of evaluating a formula given values in a data.frame or the current envrironment or similar?
I'm thinking of something along the lines of:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ) )
I suppose, as I write this, that we need to handle the coefficients somehow, something like:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ), model$coefficients )
.... so this raises the questions of:
isn't this what predict is supposed to do?
why is predict so slow?
what options do I have to make the prediction a bit faster? After all, it's not inverting any matrices or something, it's just a bit of arithmetic!
I wrote my own lag implementation in the end. It's hacky and not beautiful, but it's a lot faster. It can process 1000 rows in 4 seconds on my crappy laptop.
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
addLags <- function( dataset, lags ) {
N <- nrow(dataset)
print(lags)
if( nrow(lags) > 0 ) {
print(lags)
for( j in 1:nrow(lags) ) {
sourcename <- as.character( lags[j,"var"] )
k <- lags[j,"amount"]
cat("k",k,"sourcename",sourcename,"\n")
lagcolname <- sprintf("%s_%d",sourcename,k)
dataset[,lagcolname] <- c(rep(0,k), dataset[1:(N-k),sourcename])
}
}
dataset
}
lmLagged <- function( formula, train, lags ) {
# get largest lag, and skip that
N <- nrow(train)
skip <- 0
for( j in 1:nrow(lags) ) {
k <- lags[j,"amount"]
skip <- max(k,skip)
}
print(train)
train <- addLags( train, lags )
print(train)
lm( formula, train[(skip+1):N,] )
}
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
predictLagged <- function( model, train, test, dependentvarname, lags ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
test[,dependentvarname] <- NA
testtraindata <- rbind( train, test )
testtraindata <- addLags( testtraindata, lags )
for( i in 1:Ntest ) {
thistestdata <- testtraindata[Ntrain + i,]
result <- predict(model,newdata=thistestdata)
for( j in 1:nrow(lags) ) {
sourcename <- lags[j,"var"]
k <- lags[j,"amount"]
lagcolname <- sprintf("%s_%d",sourcename,k)
testtraindata[Ntrain + i + k,lagcolname] <- result
}
testtraindata[Ntrain+i,dependentvarname] <- result
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),dependentvarname] )
}
library("RUnit")
# size of training data
N <- 6
predictN <- 50
# create training data, which we can get exact fit on
set.seed(1)
x = sample( 100, N )
traindata <- numeric()
traindata[1] <- 1 + 1.1 * x[1]
traindata[2] <- 2 + 1.1 * x[2]
for( i in 3:N ) {
traindata[i] <- 0.5 + 0.3 * traindata[i-2] - 0.8 * traindata[i-1] + 1.1 * x[i]
}
train <- data.frame(x = x, y = traindata, foo = 1)
#train$x <- NULL
# create testing data, bunch of NAs
test <- data.frame( x = sample(100,predictN), y = rep(NA,predictN), foo = 1)
# specify which lags we need to handle
# one row per lag, with name of variable we are lagging, and the distance
# we can then use these in the formula, eg y_1, and y_2
# are y lagged by 1 and 2 respectively
# It's hacky but it kind of works...
lags <- data.frame( var = c("y","y"), amount = c(1,2) )
# fit a model
model <- lmLagged( y ~ x + y_1 + y_2, train, lags )
# look at the model, it's a perfect fit. Nice!
print(model)
print(system.time( test <- predictLagged( model, train, test, "y", lags ) ))
#checkEqualsNumeric( 69.10228, test[56-6], tolerance = 0.0001 )
#checkEquals( 2972.159, test$y[106-6] )
print(test)
# nice plot
plot(test, type='l')
Output:
> source("test/test.regressionlagged.r",echo=F)
Call:
lm(formula = formula, data = train[(skip + 1):N, ])
Coefficients:
(Intercept) x y_1 y_2
0.5 1.1 -0.8 0.3
user system elapsed
0.204 0.000 0.204
[1] -19.108620 131.494916 -42.228519 80.331290 -54.433588 86.846257
[7] -13.807082 77.199543 12.698241 64.101270 56.428457 72.487616
[13] -3.161555 99.575529 8.991110 44.079771 28.433517 3.077118
[19] 30.768361 12.008447 2.323751 36.343533 67.822299 -13.154779
[25] 72.070513 -11.602844 115.003429 -79.583596 164.667906 -102.309403
[31] 193.347894 -176.071136 254.361277 -225.010363 349.216673 -299.076448
[37] 400.626160 -371.223862 453.966938 -420.140709 560.802649 -542.284332
[43] 701.568260 -679.439907 839.222404 -773.509895 897.474637 -935.232679
[49] 1022.328534 -991.232631
There's about 12 hours work in those 91 lines of code. Ok, I confess I played Plants and Zombies for a bit. So, 10 hours. Plus lunch and dinner. Still, quite a lot of work anyway.
If we change predictN to 1000, I get about 4.1 seconds from the system.time call.
I think it's faster because:
we don't use timeseries; I suspect that speeds things up
we don't use dynamic lm libraries, just normal lm; I guess that's slightly faster
we only pass a single row of data into predict for each prediction, which I think is significantly faster, eg using dyn$lm or dynmlm, if one has a lag of 30, one would need to pass 31 rows of data into predict AFAIK
a lot less data.frame/matrix copying, since we just update the lag values in-place on each iteration
Edit: corrected minor buggette where predictLagged returned a multi-column data-frame instead of just a numeric vector
Edit2: corrected less minor bug where you couldn't add more than one variable. Also reconciled the comments and code for lags, and changed the lags structure to "var" and "amount" in place of "name" and "lags". Also, updated the test code to add a second variable.
Edit: there are tons of bugs in this version, which I know, because I've unit-tested it a bit more and fixed them, but copying and pasting is very time-consuming, so I will update this post in a few days, once my deadline is over.
Maybe you're looking for this:
fastlinpred <- function(formula, newdata, coefs) {
X <- model.matrix( formula, data=newdata)
X %*% coefs
}
coefs <- c(1,2,3)
dd <- data.frame( temperature = 10, time = 4 )
fastlinpred( ~ temperature + time,
dd , coefs )
This assumes that the formula has only a RHS (you can get rid of the LHS of a formula by doing form[-2]).
This certainly gets rid of a lot of the overhead of predict.lm, but I don't know if it is as fast as you want. model.matrix has a lot of internal machinery too.