If I use the lme function in the package nlme and write
m <- lme(y ~ Time, random = ~1|Subject)
and then write
Variogram(m, form = ~Time|Subject)
it produces the variogram no problem.
However, if I use lm without the random effect,
m <- lm(y ~ Time)
and write
Variogram(m, form = ~Time)
it produces
Error in Variogram.default(m, form = ~Time) :
argument "distance" is missing, with no default
What's going on? Why does it need a distance when I fit a lm, when it didn't need it before with lme?
How then does one plot a Variogram without needing to specify "Distance"? I have the same problem using other modelling methods: glm, gam, gamm, etc.
EDIT:
You can verify all of this yourself using e.g. the BodyWeight data in nlme.
> m <- lm(weight ~ Time, data = BodyWeight)
> Variogram(m, form =~Time)
Error in Variogram.default(m, form = ~Time) :
argument "distance" is missing, with no default
In nlme there is a Variogram.lme method function for an lme fit, but there is not an equivalent method for lm models.
You can use Variogram.default as follows:
library(nlme)
mod1 <- lm(weight ~ Time, data = BodyWeight)
n <- nrow(BodyWeight)
variog <- Variogram(resid(mod1), distance=dist(1:n))
head(variog)
############
variog dist
1 17.4062805 1
2 23.1229516 2
3 29.6500135 3
4 15.6848617 4
5 3.1222878 5
6 0.9818238 6
We can also plot the variogram:
plot(variog)
Related
I'm performing predictive analysis where I train a model to a portion of my data and test the model with the remaining portion. I'm familiar with the MICE package and the imputation procedure using predictive mean matching.
My understanding is that the proper way to utilize imputation is to create numerous imputed data sets, fit a model to each of those imputed data sets, then combine the coefficients across all of those fitted models into one single model. I know how to do this and view the summary of the coefficients with which I can perform inference on the variables. However, that is not my objective; I need to end up with a single model that I can use to predict new values.
Simply put, when I try to use the predict function with this model I got from using MICE, it doesn't work.
Any suggestions? I am coding this in R.
Edit: using the airquality data set as an example, my code looks like this:
imputed_data <- mice(airquality, method = c(rep("pmm", 6)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Ozone ~ Solar.R + Wind + Temp + Month + Day))
pooled_model <- pool(model)
This gives me a pooled model across my 5 imputed data sets. However, I am unable to use the predict function with this model. When I then execute:
predict(pooled_model, newdata = airquality)
I get this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('mira', 'matrix')"
Not sure exactly what you're looking for, but something like this might work:
library(mice)
library(mitools)
data(mtcars)
mtcars$qsec[c(4,6,8,21)] <- NA
imps <- mice(mtcars, m=10)
comps <- lapply(1:imps$m, function(i)complete(imps, i))
mods <- lapply(comps, function(x)lm(qsec ~hp + drat + wt, data=x))
pmod <- MIcombine(mods)
pmod$coefficients
#> (Intercept) hp drat wt
#> 18.15389098 -0.02570887 0.11434023 0.92348390
newvals <- data.frame(hp=300, drat=4, wt=2.58)
X <- model.matrix(~hp + drat + wt, data=newvals)
preds <- X %*% pmod$coefficients
preds
#> [,1]
#> 1 13.28118
Created on 2023-02-01 by the reprex package (v2.0.1)
I'm using the MuMln package in R to get an averaged model (http://www.inside-r.org/packages/cran/MuMIn/docs/model.avg), and predict from that. The package also includes a predict function specifically for an object returned by model.avg (http://www.inside-r.org/node/123636). I've tried using the examples listed, code as follows:
# Example from Burnham and Anderson (2002), page 100:
fm1 <- lm(y ~ X1 + X2 + X3 + X4, data = Cement)
ms1 <- dredge(fm1)
# obtain model average for AIC delta <2
avgm <- model.avg(ms1, subset=delta<2)
# predict from the averaged model
averaged.full <- predict(avgm, full = TRUE)
But I keep getting
Error in predict.averaging(avgm, full = TRUE): can predict only from 'averaging' object containing model list
which I don't understand, because I did follow the examples and used an object returned by model.avg. Am I missing something?
When you create an "averaging" object directly from "model.selection" object, it does not contain the component models, which are required for predict to work. You can use model.avg(..., fit = TRUE) which will fit the models again.
To avoid fitting the models twice, you can first create a list of all models with
lapply(dredge(..., evaluate = FALSE), eval) and afterwards
use model.avg(..., subset = ...) on it.
This question already has an answer here:
r predict function returning too many values [closed]
(1 answer)
Closed 6 years ago.
I'm anticipating that I'm missing something glaringly obvious here.
I'm trying to build a demonstration of overfitting. I've got a quadratic generating function from which I've drawn 20 samples, and I now want to fit polynomial linear models of increasing degree to the sampled data.
For some reason, regardless which model I use, every time I run predict I get N predictions back, where N is the number of records used to train my model.
set.seed(123)
N=20
xv = seq(1,5,length.out=1e4)
x=sample(xv,N)
gen=function(v){v^2 + 2*rnorm(length(v))}
y=gen(x)
df = data.frame(x,y)
# convenience function for building formulas for polynomial regression
build_formula = function(N){
fpart = paste(lapply(2:N, function(i) {paste('+ poly(x,',i,',raw=T)')} ), collapse="")
paste('y~x',fpart)
}
## Example:
## build_formula(4)="y~x + poly(x, 2 ,raw=T)+ poly(x, 3 ,raw=T)+ poly(x, 4 ,raw=T)"
model = lm(build_formula(10), data=df)
predict(model, data=xv) # returns 20 values instead of 1000
predict(model, data=1) # even *this* spits out 20 results. WTF?
This behavior is present regardless of the degree of polynomial in the formula, including the trivial case 'y~x':
formulas = sapply(c(2,10,20), build_formula)
formulas = c('y~x', formulas)
pred = lapply(formulas
,function(f){
predict(
lm(f, data=df)
,data=xv)
})
lapply(pred, length) # 4 x 20 predictions, expecting 4 x 1000
# unsuccessful sanity check
m1 = lm('y~x', data=df)
predict(m1,data=xv)
This is driving me insane. What am I doing wrong?
The second argument to predict is newdata, not data.
Also, you don't need multiple calls to poly in your model formula; poly(N) will be collinear with poly(N-1) and all the others.
Also^2, to generate a sequence of predictions using xv, you have to put it in a data frame with the appropriate name: data.frame(x=xv).
I am trying to predict fitted values over data containing NAs, and based on a model generated by plm. Here's some sample code:
require(plm)
test.data <- data.frame(id=c(1,1,2,2,3), time=c(1,2,1,2,1),
y=c(1,3,5,10,8), x=c(1, NA, 3,4,5))
model <- plm(y ~ x, data=test.data, index=c("id", "time"),
model="pooling", na.action=na.exclude)
yhat <- predict(model, test.data, na.action=na.pass)
test.data$yhat <- yhat
When I run the last line I get an error stating that the replacement has 4 rows while data has 5 rows.
I have no idea how to get predict return a vector of length 5...
If instead of running a plm I run an lm (as in the line below) I get the expected result.
model <- lm(y ~ x, data=test.data, na.action=na.exclude)
As of version 2.6.2 of plm (2022-08-16), this should work out of the box: Predict out of sample on fixed effects model (from the NEWS file:
prediction implemented for fixed effects models incl. support for argument newdata and out-of-sample prediction. Help page (?predict.plm) added to specifically explain the prediction for fixed effects models and the out-of-sample case.
I think this is something that predict.plm ought to handle for you -- seems like an oversight on the package authors' part -- but you can use ?napredict to implement it for yourself:
pp <- predict(model, test.data)
na.stuff <- attr(model$model,"na.action")
(yhat <- napredict(na.stuff,pp))
## [1] 1.371429 NA 5.485714 7.542857 9.600000
I'm using lm on a time series, which works quite well actually, and it's super super fast.
Let's say my model is:
> formula <- y ~ x
I train this on a training set:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )
... and I can make predictions for new data:
> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
x y
1 4 4.333333
2 5 5.333333
3 6 6.333333
This works super nicely, and it's really speedy.
I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:
> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
x y y_1
1 1 2 0
2 2 1 2
3 3 4 1
update the formula:
formula <- y ~ x * y_1
... and training will work just fine:
> model <- lm( formula, train )
> # no errors here
However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.
Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.
So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:
formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )
... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?
Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.
Here is the beginning of its examples -- a one and twelve month lag:
R> data("UKDriverDeaths", package = "datasets")
R> uk <- log10(UKDriverDeaths)
R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R> dfm
Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)
Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))
Coefficients:
(Intercept) L(uk, 1) L(uk, 12)
0.183 0.431 0.511
R>
Following Dirk's suggestion on dynlm, I couldn't quite figure out how to predict, but searching for that led me to dyn package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package
Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind time series, and the result of predict is offset by start and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.
So, a solution that works is:
use the dyn package
use the following method for prediction
predictDyn method:
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name, so that it can make this into a timeseries
predictDyn <- function( model, train, test, dependentvarname ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
# can't rbind ts's apparently, so convert to numeric first
train[,dependentvarname] <- as.numeric(train[,dependentvarname])
test[,dependentvarname] <- as.numeric(test[,dependentvarname])
testtraindata <- rbind( train, test )
testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )
for( i in 1:Ntest ) {
result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))
testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] )
}
Example usage:
library("dyn")
# size of training and test data
N <- 6
predictN <- 10
# create training data, which we can get exact fit on, so we can check the results easily
traindata <- c(1,2)
for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] }
train <- data.frame( y = ts( traindata ), foo = 1)
# create testing data, bunch of NAs
test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)
# fit a model
model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train )
# look at the model, it's a perfect fit. Nice!
print(model)
test <- predictDyn( model, train, test, "y" )
print(test)
# nice plot
plot(test$y, type='l')
Output:
> model
Call:
lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)
Coefficients:
(Intercept) lag(y, -1) lag(y, -2)
0.5 1.7 1.3
> test
y foo
7 143.2054 1
8 325.6810 1
9 740.3247 1
10 1682.4373 1
11 3823.0656 1
12 8686.8801 1
13 19738.1816 1
14 44848.3528 1
15 101902.3358 1
16 231537.3296 1
Edit: hmmm, this is super slow though. Even if I limit the data in the subset to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60 = 1.792 hours :-O
Try the ARIMA function. The AR parameter is for auto-regressive, which means lagged y. xreg = allows you to add other X variables. You can get predictions with predict.ARIMA.
Here's a thought:
Why don't you create a new data frame? Fill a data frame with the regressors you need. You could have columns like L1, L2, ..., Lp for all lags of any variable you want and, then, you get to use your functions exactly like you would for a cross-section type of regression.
Because you will not have to operate on your data every time you call fitting and prediction functions, but will have transformed the data once, it will be considerably faster. I know that Eviews and Stata provide lagging operators. It is true that there is some convenience to it. But it also is inefficient if you do not need everything functions like 'lm' compute. If you have a few hundreds of thousands of iterations to perform and you just need the forecast, or the forecast and the value of information criteria like BIC or AIC, you can beat 'lm' in speed by avoiding to make computations that you will not use -- just write an OLS estimator in a function and you're good to go.