I am trying to bootstrap lambda1 parameter in LASSO regression (using library penalized) (NOT the coefficients estimates as i KNOW that is does not make sense to calculate e.g. 95% CIs for them, this is the question about lambda1 ONLY).
This is where I am so far:
df <- read.table(header=T, text="group class v1 v2
1 Ala 1 3.98 23.2
2 Ala 2 5.37 18.5
3 C 1 4.73 22.1
4 B 1 4.17 22.3
5 C 2 4.47 22.4
")
Tried this:
X<-df[,c(3,4)] # data, variables in columns, cases in rows
Y<-df[,2] # dichotomous response
for (i 1:100) {
opt1<-optL1(Y,X)
opt1$lambda
}
But got Error: unexpected "}" in "}"
Tried this:
f<-function(X,Y,i){
opt1<-optL1(Y,X,[i])
}
boot(X,f,100)
But got Error in boot (X,f,100): incorrect number of subscripts on matrix... Can somebody help?
Here is what is wrong with the for loop:
1) It needs the syntax for (i in 1:100) {} in order to work;
2) It needs to save opt1$lambda in a proper object;
3) It most likely needs the values (Y,X) to change from one iteration of the loop to another.
The R code which addresses items 1) and 2) above could be written as follows:
lambda <- NULL
for (i in 1:100) {
opt1 <- optL1(Y,X) # opt1 will NOT change
# since Y and X are the SAME
# over each iteration of the for loop
lambda <- c(lambda, opt1$lambda)
}
lambda
In this code, the object lambda which will store the value opt1$lambda produced at each iteration is declared at the top of the for loop with the command lambda -> NULL and then it is augmented after each iteration with the command
lambda <- c(lambda, opt1$lambda).
In general, using the NULL trick is not recommended for a large number of iterations. A better alternative would be this:
lambda <- list('vector', 100)
for (i in 1:100) {
opt1 <- optL1(Y,X) # opt1 will NOT change
# since Y and X are the SAME
# over each iteration of the for loop
lambda[i] <- opt1$lambda
}
lambda <- unlist(lambda)
lambda
With this second alternative, we pre-allocate lambda at the top of the for loop to be a list with 100 components, such that the i-th component will store the value opt1$lambda produced during the i-th iteration. Inside the for loop, we save the value of opt1$lambda in the list named lambda with the command:
lambda[i] <- opt1$lambda.
At the end of the loop, we unlist lambda so that it becomes a regular vector (i.e., column of numbers).
You can alter the function to take in a data.frame, and specific the columns to use for response and covariate inside optL1 :
library(boot)
library(penalized)
f<-function(data,ind){
fit = optL1(data[ind,"class"],data[ind,c("v1","v2")])
fit$lambda
}
df = data.frame(group=sample(c("A","B","C"),100,replace=TRUE),
class=sample(2,100,replace=TRUE),
v1 = rnorm(100),
v2 = rnorm(100)
)
bo = boot(df,f,100)
o
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = f, R = 100)
Bootstrap Statistics :
original bias std. error
t1* 2.887399 0.2768409 1.85466
Related
I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.
Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.
Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.
Stata example
Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command
program reg_simulation, rclass
*Declaring Stata version
version 13
*Droping all variables on memory
drop _all
*Set sample size (n=100)
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
*Estimate OLS
reg y x1 x2
*Store coefficients
matrix B = e(b)
return matrix betas = B
end
Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
Obtained result (stored data on memory)
_b_x1 _b_x2 _b_cons
.4470155 1.50748 1.043514
.4235979 1.60144 1.048863
.5006762 1.362679 .8828927
.5319981 1.494726 1.103693
.4926634 1.476443 .8611253
.5920001 1.557737 .8391003
.5893909 1.384571 1.312495
.4721891 1.37305 1.017576
.7109139 1.47294 1.055216
.4197589 1.442816 .9404677
R replication of the Stata program above.
Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.
Defining reg_simulation function.
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
}
Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
#Set seed internally (to get different values in each run)
set.seed(i)
#Save results into list
results_list[i] <- list(reg_simulation(obs=1000))
}
#unlist results
df_results<- data.frame(t(sapply(results_list,
function(x) x[1:max(lengths(results_list))])))
Obtained result: df_results.
#final results
df_results
# X.Intercept. x1 x2
# 1 1.0162384 0.5490488 1.522017
# 2 1.0663263 0.4989537 1.496758
# 3 0.9862365 0.5144083 1.462388
# 4 1.0137042 0.4767466 1.551139
# 5 0.9996164 0.5020535 1.489724
# 6 1.0351182 0.4372447 1.444495
# 7 0.9975050 0.4809259 1.525741
# 8 1.0286192 0.5253288 1.491966
# 9 1.0107962 0.4659812 1.505793
# 10 0.9765663 0.5317318 1.501162
You're on the right track. Couple of hints/corrections:
Don't use <- inside data.frame()
In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).
(There's more to be said about <- vs =, but I don't want to distract from your main question.)
In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.
When growing a list via a for loop in R, always try to pre-allocate the list first
Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:
results_list <- vector("list", 10)
3. Consider using lapply instead of for
for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.
4. Run large loops in parallel
A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.
Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
obs <- 1e3
# Defining a function
reg_simulation <- function(...){
x1 <- rnorm(obs, 0 , 1)
x2 <- rnorm(obs, 0 , 1)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
# return(ols$coefficients)
return(as.data.frame(t(ols$coefficients)))
}
# N repetitions
nreps <- 10
## Serial version
# results <- lapply(1:nreps, reg_simulation)
## Parallel version
results <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)
## Unlist / convert into a data.table
results <- rbindlist(results)
So, following up on the comments, you want to vary your independent variables (x) and also the error term and simulate the coefficients, but you also want to catch errors if any occur. The following would do the trick:
set.seed(42)
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
tryCatch(
{
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
},
error = function(e){
return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
}
)
}
output <- t(data.frame(replicate(10, reg_simulation())))
output
(Intercept) x1 x2
X1 0.9961328 0.4782010 1.481712
X2 1.0234698 0.4801982 1.556393
X3 1.0336289 0.5239380 1.435468
X4 0.9796523 0.5095907 1.493548
...
Here, tryCatch (see also failwith) catches the error and returns NA as the default value.
Note that you only need to set the seed once because the seed changes automatically with every call to random number generator in a deterministic fashion.
I am trying to run a Monte Carlo simulation of a difference in differences estimator, but I am running into an error. Here is the code I am running:
# Set the random seed
set.seed(1234567)
library(MonteCarlo)
#Set up problem, doing this before calling the function
# set sample size
n<- 400
# set true parameters: betas and sd of u
b0 <- 1 # intercept for control data (b0 in diffndiff)
b1 <- 1 # shift on both control and treated after treatment (b1 in
#diffndiff)
b2 <- 2 # difference between intercept on control vs. treated (b2-this is
#the level difference pre-treatment to compare to coef on treat)
b3 <- 3 # shift after treatment that is only for treated group (b3-this is
#the coefficient of interest in diffndiff)
b4 <- 0 # parallel time trend (not measured in diffndiff) biases b0,b1 but
#not b3 that we care about
b5 <- 0 # allows for treated group trend to shift after treatment (0 if
#parallel trends holds)
su <- 4 # std. dev for errors
dnd <- function(n,b0,b1,b2,b3,b4,b5,su){
#initialize a time vector (set observations equal to n)
timelength = 10
t <- c(1:timelength)
num_obs_per_period = n/timelength #allows for multiple observations in one
#time period (can simulate multiple states within one group or something)
t0 <- c(1:timelength)
for (p in 1:(num_obs_per_period-1)){
t <- c(t,t0)
}
T<- 5 #set treatment period
g <- t >T
post <- as.numeric(g)
# assign equal amounts of observations to each state to start with (would
#like to allow selection into treatment at some point)
treat <- vector()
for (m in 1:(round(n/2))){
treat <- c(treat,0)
}
for (m in 1:(round(n/2))){
treat <- c(treat,1)
}
u <- rnorm(n,0,su) #This assumes the mean error is zero
#create my y vector now from the data
y<- b0 + b1*post + b2*treat + b3*treat*post + b4*t + b5*(t-T)*treat*post +u
interaction <- treat*post
#run regression
olsres <- lm(y ~ post + treat + interaction)
olsres$coefficients
# assign the coeeficients
bhat0<- olsres$coefficients[1]
bhat1 <- olsres$coefficients[2]
bhat2<- olsres$coefficients[3]
bhat3<- olsres$coefficients[4]
bhat3_stderr <- coef(summary(olsres))[3, "Std. Error"]
#Here I will use bhat3 to conduct a t-test and determine if this was a pass
#or a fail
tval <- (bhat3-b3)/ bhat3_stderr
#decision at 5% confidence I believe (False indicates the t-stat was less
#than 1.96, and we fail to reject the null)
decision <- abs(tval) > 1.96
decision <- unname(decision)
return(list(decision))
}
#Define a parameter grid to simulate over
from <- -5
to <- 5
increment <- .25
gridparts<- c(from , to , increment)
b5_grid <- seq(from = gridparts[1], to = gridparts[2], by = gridparts[3])
parameter <- list("n" = n, "b0" = b0 , "b1" = b1 ,"b2" = b2 ,"b3" = b3 ,"b4"
=
b4 ,"b5" = b5_grid ,"su" = su)
#Now simulate this multiple times in a monte carlo setting
results <- MonteCarlo(func = dnd ,nrep = 100, param_list = parameter)
And the error that comes up is:
in results[[i]] <- array(NA, dim = c(dim_vec, nrep)) :
attempt to select less than one element in integerOneIndex
This leads me to believe that somewhere something is attempting to access the "0th" element of a vector, which doesn't exist in R as far as I understand. I don't think the part that is doing this arises from my code vs. internal to this package however, and I can't make sense of the code that runs when I run the package.
I am also open to hearing about other methods that will essentially replace simulate() from Stata.
The function passed to MonteCarlo must return a list with named components. Changing line 76 to
return(list("decision" = decision))
should work
I would like to apply T test in R within a loop
Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525
I tried this code with loop, but it is not working:
for (i in 2:4){
t.test(table[,c(i)] ~ table$Groups, conf.level = 0.95)
}
Can anyone help me with this?
Thanks!
Your code computes 4 t-tests, but the results are lost, because you don't do anything with them. Try the following:
info <- read.table(header=TRUE, text="Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525")
results <- list()
for (i in 2:4){
results[[i]] <- t.test(info[,i] ~ info$Groups, conf.level = 0.95)
}
print(results)
When interacting with the REPL/console, typing the t.test function will compute results and return them. The console will print everything that is returned. In scripts that you source, the t.test function will return results but they wil not be printed. This is why I put them into a list and printed the list later on.
Btw, I stored your information as info not as table. R will deal great with variable names that are also function names, but every now and then you will hava a hard time to read error messages, so avoid naming variables table or matrix or c or df.
Using apply functions you could also do:
res<- cbind(
do.call(rbind,apply(info[,-1],2,function(cv)t.test(cv ~ info$Groups, conf.level = 0.95)
[c("statistic","parameter","p.value")]))
,
t(apply(info[,-1],2,function(cv)unlist(t.test(cv ~ info$Groups, conf.level = 0.95)
[c("conf.int","estimate")])))
)
res
> res
statistic parameter p.value conf.int1 conf.int2 estimate.mean in group A estimate.mean in group B
Length 0.7327329 3.991849 0.5044236 -1.13263 1.943907 2.175542 1.769904
Size 0.2339013 3.467515 0.8282072 -0.47739 0.5595231 0.714917 0.6738504
Diet 0.9336103 3.823748 0.4056203 -0.7396173 1.468761 1.460625 1.096053
place 0.9748978 3.162223 0.398155 -159.4359 306.2686 337.4955 264.0791
I'm generating regressions of 30 subpopulations in a for-loop and assigning them to successive elements of a list (matrix?). It seems like only the first component of each regression is making it into the list. The simple version goes like this:
i <- 30
num30 <- with(gapMeans, lm(DHt[Gap==i] ~ Time[Gap==i]))
Works just as I want. The loop version goes:
gmRegs <- NULL
for (i in 1:30){
gmRegs[i] <- with(gapMeans,
if(Ht[Gap==i][1] > 1){
lm(DHt[Gap==i] ~ Time[Gap==i])
} else {NULL}
)}
That runs correctly but:
num30
# Call:
# lm(formula = DHt[Gap == i] ~ Time[Gap == i])
#
# Coefficients:
# (Intercept) Time[Gap == i]
# 24.56874 -0.01546
gmRegs[30]
# [[1]]
# (Intercept) Time[Gap == i]
# 24.56874082 -0.01546019
And str() describes num30 as a list of 13 while gmRegs[30] is a list of 1, and when I try to do abline(reg=gmRegs[30]) it won't. So it seems like my assignment is doing only thing1[1] <- thing2[1], or something to that effect -- I just can't figure out how to properly box up the lm() object to all fit in the list slot.
When you save an lm as an item to a list, the lm itself is a structured element in R. As you have noted, running str(num30) returns a list of 13 things. If you want to save each lm as an element in a list, you can do the following:
# generate random data
response <- runif(90,0,1)
time <- runif(90,10,20)
gap <- rep(1:30,3)
gapMeans <- data.frame(gap,response,time)
Now, head(gapMeans) returns
gap response time
1 1 0.6809973 12.66655
2 2 0.5473042 11.73821
3 3 0.6095777 18.96527
4 4 0.3081830 15.62343
5 5 0.1640612 13.42454
6 6 0.8473997 12.83730
As Richard pointed above, you can rewrite your with call as the following lm:
num30 <- lm(response[gap==30] ~ time[gap==30], data = gapMeans)
Now for your loop you can simply write the following:
gmRegs <- NULL
for(i in 1:30){
gmRegs[[i]] <- lm(response[gap==i] ~ time[gap==i], data= gapMeans)
}
Now each element of gmRegs, accessed via gmRegs[[30]] is itself a lm object.
plot(gapMeans$time[gapMeans$gap==30], gapMeans$response[gapMeans$gap==30], xlab = 'time', ylab = 'response')
abline(gmRegs[[30]]$coefficients, col = "red")
I'm using either dyn or dynlm to predict time series using lagged variables.
However, the predict function in either case only evaluates one time step at a time, taking a constant time of 24 milliseconds per step on my computer, or about 1.8 hours for my dataset, which is super long, given that the entire regression takes about 10 seconds.
So, I'm thinking that perhaps the fastest thing might be just to evaluate the formula by hand?
So, is there some way of evaluating a formula given values in a data.frame or the current envrironment or similar?
I'm thinking of something along the lines of:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ) )
I suppose, as I write this, that we need to handle the coefficients somehow, something like:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ), model$coefficients )
.... so this raises the questions of:
isn't this what predict is supposed to do?
why is predict so slow?
what options do I have to make the prediction a bit faster? After all, it's not inverting any matrices or something, it's just a bit of arithmetic!
I wrote my own lag implementation in the end. It's hacky and not beautiful, but it's a lot faster. It can process 1000 rows in 4 seconds on my crappy laptop.
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
addLags <- function( dataset, lags ) {
N <- nrow(dataset)
print(lags)
if( nrow(lags) > 0 ) {
print(lags)
for( j in 1:nrow(lags) ) {
sourcename <- as.character( lags[j,"var"] )
k <- lags[j,"amount"]
cat("k",k,"sourcename",sourcename,"\n")
lagcolname <- sprintf("%s_%d",sourcename,k)
dataset[,lagcolname] <- c(rep(0,k), dataset[1:(N-k),sourcename])
}
}
dataset
}
lmLagged <- function( formula, train, lags ) {
# get largest lag, and skip that
N <- nrow(train)
skip <- 0
for( j in 1:nrow(lags) ) {
k <- lags[j,"amount"]
skip <- max(k,skip)
}
print(train)
train <- addLags( train, lags )
print(train)
lm( formula, train[(skip+1):N,] )
}
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
predictLagged <- function( model, train, test, dependentvarname, lags ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
test[,dependentvarname] <- NA
testtraindata <- rbind( train, test )
testtraindata <- addLags( testtraindata, lags )
for( i in 1:Ntest ) {
thistestdata <- testtraindata[Ntrain + i,]
result <- predict(model,newdata=thistestdata)
for( j in 1:nrow(lags) ) {
sourcename <- lags[j,"var"]
k <- lags[j,"amount"]
lagcolname <- sprintf("%s_%d",sourcename,k)
testtraindata[Ntrain + i + k,lagcolname] <- result
}
testtraindata[Ntrain+i,dependentvarname] <- result
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),dependentvarname] )
}
library("RUnit")
# size of training data
N <- 6
predictN <- 50
# create training data, which we can get exact fit on
set.seed(1)
x = sample( 100, N )
traindata <- numeric()
traindata[1] <- 1 + 1.1 * x[1]
traindata[2] <- 2 + 1.1 * x[2]
for( i in 3:N ) {
traindata[i] <- 0.5 + 0.3 * traindata[i-2] - 0.8 * traindata[i-1] + 1.1 * x[i]
}
train <- data.frame(x = x, y = traindata, foo = 1)
#train$x <- NULL
# create testing data, bunch of NAs
test <- data.frame( x = sample(100,predictN), y = rep(NA,predictN), foo = 1)
# specify which lags we need to handle
# one row per lag, with name of variable we are lagging, and the distance
# we can then use these in the formula, eg y_1, and y_2
# are y lagged by 1 and 2 respectively
# It's hacky but it kind of works...
lags <- data.frame( var = c("y","y"), amount = c(1,2) )
# fit a model
model <- lmLagged( y ~ x + y_1 + y_2, train, lags )
# look at the model, it's a perfect fit. Nice!
print(model)
print(system.time( test <- predictLagged( model, train, test, "y", lags ) ))
#checkEqualsNumeric( 69.10228, test[56-6], tolerance = 0.0001 )
#checkEquals( 2972.159, test$y[106-6] )
print(test)
# nice plot
plot(test, type='l')
Output:
> source("test/test.regressionlagged.r",echo=F)
Call:
lm(formula = formula, data = train[(skip + 1):N, ])
Coefficients:
(Intercept) x y_1 y_2
0.5 1.1 -0.8 0.3
user system elapsed
0.204 0.000 0.204
[1] -19.108620 131.494916 -42.228519 80.331290 -54.433588 86.846257
[7] -13.807082 77.199543 12.698241 64.101270 56.428457 72.487616
[13] -3.161555 99.575529 8.991110 44.079771 28.433517 3.077118
[19] 30.768361 12.008447 2.323751 36.343533 67.822299 -13.154779
[25] 72.070513 -11.602844 115.003429 -79.583596 164.667906 -102.309403
[31] 193.347894 -176.071136 254.361277 -225.010363 349.216673 -299.076448
[37] 400.626160 -371.223862 453.966938 -420.140709 560.802649 -542.284332
[43] 701.568260 -679.439907 839.222404 -773.509895 897.474637 -935.232679
[49] 1022.328534 -991.232631
There's about 12 hours work in those 91 lines of code. Ok, I confess I played Plants and Zombies for a bit. So, 10 hours. Plus lunch and dinner. Still, quite a lot of work anyway.
If we change predictN to 1000, I get about 4.1 seconds from the system.time call.
I think it's faster because:
we don't use timeseries; I suspect that speeds things up
we don't use dynamic lm libraries, just normal lm; I guess that's slightly faster
we only pass a single row of data into predict for each prediction, which I think is significantly faster, eg using dyn$lm or dynmlm, if one has a lag of 30, one would need to pass 31 rows of data into predict AFAIK
a lot less data.frame/matrix copying, since we just update the lag values in-place on each iteration
Edit: corrected minor buggette where predictLagged returned a multi-column data-frame instead of just a numeric vector
Edit2: corrected less minor bug where you couldn't add more than one variable. Also reconciled the comments and code for lags, and changed the lags structure to "var" and "amount" in place of "name" and "lags". Also, updated the test code to add a second variable.
Edit: there are tons of bugs in this version, which I know, because I've unit-tested it a bit more and fixed them, but copying and pasting is very time-consuming, so I will update this post in a few days, once my deadline is over.
Maybe you're looking for this:
fastlinpred <- function(formula, newdata, coefs) {
X <- model.matrix( formula, data=newdata)
X %*% coefs
}
coefs <- c(1,2,3)
dd <- data.frame( temperature = 10, time = 4 )
fastlinpred( ~ temperature + time,
dd , coefs )
This assumes that the formula has only a RHS (you can get rid of the LHS of a formula by doing form[-2]).
This certainly gets rid of a lot of the overhead of predict.lm, but I don't know if it is as fast as you want. model.matrix has a lot of internal machinery too.