I am trying to code a model which uses interaction term and generate out-of-sample predictions using the model.
My training sample has 3 variables and 11 rows.
My test sample has 3 variables and 1 row.
My code is the following.
inter.model <- lm(Y.train ~ Y.lag.train + X.1.train + X.1.train:X.2.train)
However, I am not quite sure how R handles the interaction terms.
I have coded the predictions using the coefficients from the model and the test data.
inter.prediction <- inter.model$coef[1] + inter.model$coef[2]*Y.lag.test +
inter.model$coef[3]*X.1.test + (inter.model$coef[4]*X.1.test*X.2.test)
I wanted to make sure that these predictions were correctly coded. Thus, I tried to produce them with the R´s predict-function.
inter.pred.function <- predict(inter.model, newdata=test_data)
However, I am getting a error message:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
variable lengths differ (found for 'X.2.train')
In addition: Warning message:
'newdata' had 1 row but variables found have 11 rows
names(test_data)
[1] "Y.lag.test" "X.1.test" "X.1.test:X.2.test"
So, my question is, how do you code and make linear regression predictions with interaction terms in R?
You won't need "X.1.test:X.2.test" in your new data, the interaction is created automatically in stats:::predict.lm via the model.matrix.
fit <- lm(mpg ~ hp*am, mtcars[1:10, ])
test <- mtcars[-(1:10), c('mpg', 'hp', 'am')]
as.numeric(predict(fit, newdata=test))
# [1] 20.220513 17.430053 17.430053 17.430053 16.206167 15.716612 14.982281 25.658824 27.141176 25.764706
# [11] 21.493355 18.898716 18.898716 14.247949 17.674830 25.658824 23.011765 20.682353 4.694118 14.117647
# [21] -2.823529 21.105882
Related
I have two models which I am running across an imputed dataset in order to produce pooled estimates. My understanding is that because both models are ran through hundreds of imputed data frames, I have to pool or essentially "average out" all the regression model estimates into one "overall" estimate. Below are the steps I did:
#1 IMPUTE MASTER DATASET
imputed_data <- mice(master, m=20, maxit=50, seed=5798713)
#2 RUN LINEAR MODEL
model.linear <- with(imputed_data, lm(outcome~exposure+age+gender+weight))
summary(pool(model.linear))
#3 RUN NON-LINEAR RESTRICTED CUBIC SPLINE (3-KNOT) MODEL
model.rcs <- with(imputed_data, lm(outcome~rcs(exposure,3)+age+gender+weight))
summary(pool(model.rcs))
#4 COMPARE BOTH MODELS USING POOL.COMPARE FUNCTION
pool.compare(model.rcs, model.linear)
Both linear and RCS models produce "pooled" estimates, 95% CI's, and p-values once I use the "summary(pool(..)" function. However, the issue is that when I run the "pool.compare" function, I get an error that states:
Error: Model 'fit0' not contained in 'fit1'
In addition: Warning message:
'pool.compare' is deprecated.
Use 'D1' instead.
See help("Deprecated")
I'm confused as to why the model says fit0 is not contained in fit1 when the "exposure", "outcome", and all the covariates listed are the same between the linear and RCS models. Is there an option that I'm missing here?
Any help/guidance would be very appreciated.
P.S. I am unfortunately unable to provide a sample datacut considering how large the imputed dataset is. Let me know how I can better improve my question if there's any confusion.
As the error says, pool.compare is deprecated. Instead use D1
library(mice)
library(rms)
D1(model.rcs, model.linear)
# test statistic df1 df2 dfcom p.value riv
# 1 ~~ 2 6.248565 2 8.635754 20 0.02098072 0.449098
In some examples, there is only warning, but in others, it give both Error and warning
pool.compare(model.rcs, model.linear)
#Error: Model 'fit0' not contained in 'fit1'
#In addition: Warning message:
# 'pool.compare' is deprecated.
#Use 'D1' instead.
#See help("Deprecated")
The error would be because of the model itself i.e. rcs model while below we are comparing two linear models
imp <- mice(nhanes)
model.linear <- with(imp, lm(age ~ bmi + hyp + chl))
model.rcs <- with(imp, lm(age ~ rcs(bmi, 3) + hyp + chl))
Reproducible example
imp <- mice(nhanes2, print=FALSE, m=50, seed=00219)
fit0 <- with(data=imp,expr=lm(bmi~age+hyp))
fit1 <- with(data=imp,expr=lm(bmi~age+hyp+chl))
stat <- pool.compare(fit1, fit0)
#Warning message:
#'pool.compare' is deprecated.
#Use 'D1' instead.
#See help("Deprecated")
stat <- D1(fit1, fit0)
stat
# test statistic df1 df2 dfcom p.value riv
# 1 ~~ 2 7.606026 1 16.2182 20 0.01387548 0.3281893
After fitting a model with glm I got this as a result:
Warning message:
glm.fit: Adjusted probabilities with numerical value 0 or 1.**
After some research on Google, I tried with the brglm package. When I try to apply backward elimination on the model, I get the following error:
Error in do.call("glm.control", control) : second argument must be a list.
I searched on Google but I didn't find anything.
Here is my code with brglm:
library(mlbench)
#require(Amelia)
library(caTools)
library(mlr)
library(ciTools)
library(brglm)
data("BreastCancer")
data_bc <- BreastCancer
data_bc
head(data_bc)
dim(data_bc)
#Delete id column
data_bc<- data_bc[,-1]
data_bc
dim(data_bc)
str(data_bc)
# convert all factors columns to be numeric except class.
for(i in 1:9){
data_bc[,i]<- as.numeric(as.character(data_bc[,i]))
}
str(data_bc)
#convert class: benign and malignant to binary 0 and 1:
data_bc$Class<-ifelse(data_bc$Class=="malignant",1,0)
# now convert class to factor
data_bc$Class<- factor(data_bc$Class, levels = c(0,1))
str(data_bc)
model <- brglm(formula = Class~.^2, data = data_bc, family = "binomial",
na.action = na.exclude )
summary(model)
#Backward Elimination:
final <- step(model, direction = "backward")
You can work around this by using the brglm2 package, which supersedes the brglm package anyway:
model <- glm(formula = Class~.^2, data = na.omit(data_bc), family = "binomial",
na.action = na.fail, method="brglmFit" )
final <- step(model, direction = "backward")
length(coef(model)) ## 46
length(coef(final)) ## 42
setdiff(names(coef(model)), names(coef(final))
## [1] "Cl.thickness:Epith.c.size" "Cell.size:Marg.adhesion"
## [3] "Cell.shape:Bl.cromatin" "Bl.cromatin:Mitoses"
Some general concerns about your approach:
stepwise reduction is one of the worst forms of model reduction (cf. lasso, ridge, elasticnet ...)
in the presence of missing data, model comparison (e.g. by AIC) is questionable, as different models will be fitted to different subsets of the data. Given that you are only going to lose a small fraction of your data by using na.omit() (comparing nrow(bc_data) with sum(complete.cases(bc_data)), I would strongly recommend dropping observations with NA values from the data set before starting
it's also not clear to me that comparing penalized models via AIC is statistically appropriate (see here)
I am getting the following error: $ operator is invalid for atomic vectors. I am getting the error when trying to calculate the prediction error for a logistic regression model.
Here is the code and data I am using:
install.packages("ElemStatLearn")
library(ElemStatLearn)
# training data
train = vowel.train
# only looking at the first two classes
train.new = train[1:3]
# test data
test = vowel.test
test.new = test[1:3]
# performing the logistic regression
train.new$y <- as.factor(train.new$y)
mylogit <- glm(y ~ ., data = train.new, family = "binomial")
train.logit.values <- predict(mylogit, newdata=test.new, type = "response")
# this is where the error occurs (below)
train.logit.values$se.fit
I tried to make it of type list but that did not seem to work, I am wondering if there is a quick fix so that I can obtain either the prediction error or the misclassification rate.
I did a multiple linear regression in R using the function lm and I want to use it to predict several values. So I'm trying to use the function predict().
Here is my code:
new=data.frame(t=c(10, 20, 30))
v=1/t
LinReg<-lm(p ~ log(t) + v)
Pred=predict(LinReg, new, interval="confidence")
So I would like to predict the values of p when t=c(10,20,30...). However, this is not working and I don't see why. The error message I get is:
"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : variable lengths differ (found for 'vart')
In addition: Warning message:
'newdata' had 3 rows but variables found have 132 rows "
132 is the length of my vector of variables upon which I run the regression. I checked my vector 1/t and it is well-defined and has the right number of coefficients. What is curious is that if I do a simple linear regression (of one variable), the same code works well...
new=data.frame(t=c(10, 20, 30))
LinReg<-lm(p ~ log(t))
Pred=predict(LinReg, new, interval="confidence")
Can anyone help me please! Thanks in advance.
The problem is you defined v as a new, distinct variable from t when you fit your model. R doesn't remember how a variable was created so it doesn't know that v is a function of t when you fit the model. So when you go to predict values, it uses the existing values of v which would have a different length than the new values of t you are specifying.
Instead you want to fit
new <- data.frame(t=c(10, 20, 30))
LinReg <- lm(p ~ log(t) + I(1/t))
Pred <- predict(LinReg, new, interval="confidence")
If you did want v to be a completely independent variable, then you would need to supply values for v as well in your new data.frame in order to predict p.
I am having problems with predict() after a multinomial logit regression by multinom(). I generate a design matrix with model.matrix() and use it to estimate the model. Then, if I pass the entire design matrix to predict(), it returns the same output as fitted(), which is expected. But if I pass only a few rows of the design matrix, it throws this error:
Error in model.frame.default(Terms, newdata, na.action = na.omit, xlev
= object$xlevels) : variable lengths differ (found for 'z') In addition:
Warning message: 'newdata' had 6 rows but variables found have 15 rows
This is a minimal example:
require(nnet)
y<-factor(rep(c(1,2,3),5), levels=1:3, labels=c("good","bad","ugly"))
x<-rnorm(15)+.2*rep(1:3,5)
z<-factor(rep(c(1,2,2),5), levels=1:2, labels=c("short","tall"))
df<-data.frame(y=y, x=x, z=z)
mm<-model.matrix(~x+z, data=df)[,2:3]
m<-multinom(y ~ x+z, data=df)
p1<-predict(m,mm,"probs")
p2<-predict(m,head(mm),"probs")
My actual goal is out-of-sample prediction, but I could not make it work and, while debugging it, I reduced it to this problem.