Predicting values in R ( multiple Regression) [duplicate] - r

I did a multiple linear regression in R using the function lm and I want to use it to predict several values. So I'm trying to use the function predict().
Here is my code:
new=data.frame(t=c(10, 20, 30))
v=1/t
LinReg<-lm(p ~ log(t) + v)
Pred=predict(LinReg, new, interval="confidence")
So I would like to predict the values of p when t=c(10,20,30...). However, this is not working and I don't see why. The error message I get is:
"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : variable lengths differ (found for 'vart')
In addition: Warning message:
'newdata' had 3 rows but variables found have 132 rows "
132 is the length of my vector of variables upon which I run the regression. I checked my vector 1/t and it is well-defined and has the right number of coefficients. What is curious is that if I do a simple linear regression (of one variable), the same code works well...
new=data.frame(t=c(10, 20, 30))
LinReg<-lm(p ~ log(t))
Pred=predict(LinReg, new, interval="confidence")
Can anyone help me please! Thanks in advance.

The problem is you defined v as a new, distinct variable from t when you fit your model. R doesn't remember how a variable was created so it doesn't know that v is a function of t when you fit the model. So when you go to predict values, it uses the existing values of v which would have a different length than the new values of t you are specifying.
Instead you want to fit
new <- data.frame(t=c(10, 20, 30))
LinReg <- lm(p ~ log(t) + I(1/t))
Pred <- predict(LinReg, new, interval="confidence")
If you did want v to be a completely independent variable, then you would need to supply values for v as well in your new data.frame in order to predict p.

Related

Linear regression prediction using interaction terms in R

I am trying to code a model which uses interaction term and generate out-of-sample predictions using the model.
My training sample has 3 variables and 11 rows.
My test sample has 3 variables and 1 row.
My code is the following.
inter.model <- lm(Y.train ~ Y.lag.train + X.1.train + X.1.train:X.2.train)
However, I am not quite sure how R handles the interaction terms.
I have coded the predictions using the coefficients from the model and the test data.
inter.prediction <- inter.model$coef[1] + inter.model$coef[2]*Y.lag.test +
inter.model$coef[3]*X.1.test + (inter.model$coef[4]*X.1.test*X.2.test)
I wanted to make sure that these predictions were correctly coded. Thus, I tried to produce them with the R´s predict-function.
inter.pred.function <- predict(inter.model, newdata=test_data)
However, I am getting a error message:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
variable lengths differ (found for 'X.2.train')
In addition: Warning message:
'newdata' had 1 row but variables found have 11 rows
names(test_data)
[1] "Y.lag.test" "X.1.test" "X.1.test:X.2.test"
So, my question is, how do you code and make linear regression predictions with interaction terms in R?
You won't need "X.1.test:X.2.test" in your new data, the interaction is created automatically in stats:::predict.lm via the model.matrix.
fit <- lm(mpg ~ hp*am, mtcars[1:10, ])
test <- mtcars[-(1:10), c('mpg', 'hp', 'am')]
as.numeric(predict(fit, newdata=test))
# [1] 20.220513 17.430053 17.430053 17.430053 16.206167 15.716612 14.982281 25.658824 27.141176 25.764706
# [11] 21.493355 18.898716 18.898716 14.247949 17.674830 25.658824 23.011765 20.682353 4.694118 14.117647
# [21] -2.823529 21.105882

Marginal Effect from svyglm object with a subsample in R

I need to compute marginal effects out of a Generalized Linear Model (family=Poisson) estimated via the svyglm function from the R package survey for a subsample.
First, I declared the survey desgin with:
myDesisgn = svydesign(id=data$id, strata=data$strata, weights=data$sw, data=data)
Second, I estimated my model as:
fit = svyglm(y~ x1 +x2, design=myDesisgn, data=data, subset= x3 == 1, family= poisson(link = "log"))
Finally, when I want to get the Average Marginal Effect for, let's say, x1 I run:
summary(margins(fit, variables = "x1", design=myDesisgn))
... but I get the following error message:
"Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': 'x' and 'w' must have the same length"
Running the following does not work either:
summary(margins(fit, variables = "x1", design=myDesisgn, subset=x3==1))
Solution:
summary(margins(fit, variables = "x1", design=myDesisgn[myDesisgn$variables$x3 == 1]))
Subsetting complex surveys leads to problems in the error estimation. When interested in a parameter for a specific subsample, one should use the desired subsample to estimate the parameter of interest and the full sample for the estimation of its error.
For example, svyglm(y~x, data=data, subset = z == 1) does exactly this (beta_hat estimated using observations for which z=1 and se(beta_hat) using the full sample).
Subsetting a svy design is possible and it keeps the original design information about number of clusters, strata. The code shown above is the "manual" way of doing so. Alternative one can directly rely on the subset.survey.design {survey} function.
myDesign_subset <- subset(myDesign, data$x3 == 1)
The two methods are equivalent and produce correct z-stats.

Problem predicting glmnet(): "contrasts can be applied only to factors with 2 or more levels"

I trained a penalized regression model using R's glmnet package, and X constructed using a sparse.model.matrix with a formula of "~ . * (var1)" to get every term from my data and an interaction with var1:
X3 <- sparse.model.matrix(object = ~.*(var1), data = X)[,-1]
cv_lasso <- cv.glmnet(x = X3, y = Y3,
alpha = 1,
nfold = 10,
family = "binomial",
nlambda = 100,
lambda.min.ratio=0.001,
type.measure="auc",
keep = TRUE,
parallel = TRUE)
Now, I'm trying to predict on a couple of data points, but when converting the newX to a model.matrix to use with predict.glmnet(), like below:
X_pred <- sparse.model.matrix(object = ~.*(var1), data = X_holdout)
predict(object = cv_lasso,
newx = X_pred,
s = "lambda.min")
But I get the following error:
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I believe this might be caused by a couple of columns from X_holdout that are basically constant (which is correct since I'm trying to predict now, I already trained successfully).
How can I avoid this problem? My understanding is that, since I trained my model using interactions, I have to create a model matrix with the same interactions in my predictions.
Found the root of my problem: some of the prediction X columns were constant, since the holdout data is significantly smaller than the training data.
To fix this, I needed to use the "xlevs" argument in creating the sparse matrices for both the training data and the prediction data, and both with the same xlev.
In case you don't know what "xlev" is, it's basically a list of character vectors that indicate the levels to use when expanding factor variables into dummy/one-hot columns. This way, even if you have a column with only 1 value, sparse.matrix.model() can understand that there are more levels, it's just that they are not present in the data. This argument will also help you make sure that both the training and prediction matrices have the same number of columns, which is important for predict.glmnet()

How to deal with the error variable lengths differ

I have a panel data covering several individuals over a five year period and I want to do a 2SLS estimation using the plm package. My Instrumental Variables are the first lag of the endogenous variables, S and S_active. When I run the second stage IV regression I have the following errors:
variable lengths differ (found for 'S_hat')
Please, can someone help me out on how to resolve this error?
Apparently, by making use of the first lag of the S and S_active variables in the first stage regression, I will lose observations for one year for each panel. As such I understand the error to mean that the fitted values of my endogenous variables will have a shorter length compared to the original data. And that's why my second stage IV regressions throws up that error.
I have googled how to deal with this error and came across quite similar questions here. But none has particularly addressed my specific situation.
I tried another suggested solution (ie. to add the fitted values to the original data) but got another error:
arguments imply differing number of rows: 9196, 7192
The following is how my code looks like:
library(bootstrap)
library(AER)
library(systemfit)
library(sandwich)
library(lmtest)
library(boot)
library(laeken)
library(smoothmest)
library(glm2)
library(tidyverse)
library(foreign)
library(plm)
data_09=read.csv("panel2009.csv")
attach(data_09)
table(is.na(data_09))
FALSE
1480556
#First Stage. Run plm, regressing S on exogenous idependent variables including the IV, lag(S)
firstpan=plm(S~ act + plm::lag(S) + S_active + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((firstpan))
#First Stage. Run plm, regressing S_active on exogenous idependent variables including the IV, lag(S_active)
secondpan=plm(S_active ~act + act*plm::lag(towater) + S + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((secondpan))
#Collect fitted values and add them to data
S_hat=fitted(firstpan)
S_active_hat=fitted(secondpan)
#Run the standard 2SLS using S_hat=fitted and S_active_hat as instruments
iv=ivreg(Y ~ act+ S + S_active + C_tot|act + S_hat +S_active_hat + C_tot,data=data_09, na.action=na.exclude)
summary(iv)
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')
The first stage regressions run succesfully, but the second-stage IV regessions throws up the error: Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')

predict.glm() with three new categories in the test data (r)(error)

I have a data set called data which has 481 092 rows.
I split data into two equal halves:
The first halve (row 1: 240 546) is called train and was used for the glm();
the second halve (row 240 547 : 481 092) is called test and should be used to validate the model;
Then I started the regression:
testreg <- glm(train$returnShipment ~ train$size + train$color + train$price +
train$manufacturerID + train$salutation + train$state +
train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
Now the prediction:
prediction <- predict.glm(testreg, newdata=test, type="response")
gives me an Error:
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.
I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.
How can I proceed?
Edit-Suggested approach by Z. Li
I got problem in the first step:
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
but mID125 is NULL! What have I done wrong?
It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.
Your model formula for model estimation is:
returnShipment ~ size + color + price + manufacturerID + salutation +
state + age + deliverytime
then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:
returnShipment ~ size + color + price + salutation +
state + age + deliverytime
However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:
extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.
Now, let's first pick up a factor level used for model fitting
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
Then we assign this level to your prediction data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement
And we are ready to predict:
pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!!
In the end, we adjust this linear predictor, by subtracting factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est
Finally, if you want prediction on the original scale, you apply the inverse of link function:
testreg$family$linkinv(pred)
update:
You complained that you met various troubles in trying the above solutions. Here is why.
Your code:
testreg <- glm(train$returnShipment~ train$size + train$color +
train$price + train$manufacturerID + train$salutation +
train$state + train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.
As a simple example for such drawback, we simulate some toy data and fit a GLM:
set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo) ## bad style
> toy$formula
foo$y ~ foo$a
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"
Now, we see everything comes with a prefix foo$. During prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata"
rm(foo) ## remove "foo" from R session
predict(toy, newdata)
we get an error:
Error in eval(expr, envir, enclos) : object 'foo' not found
The good style is to specify environment of getting data from data argument of the function:
foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)
then foo$ goes away.
> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"
This would explain two things:
You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;
The prediction error you posted
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
complains train$manufacturerID instead of test$manufacturerID.
As you have divided your train and test sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.
You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use stratified from the splitstackshape package.

Resources