I'm using the Ames, Iowa housing prices data set.
I have a train set and test set. The test set is missing the dependent variable SalePrice. (No column for SalePrice exists).
I have done a linear model and now am trying to predict the Sale Price values on the test set. But when doing so, I always get these same predicted values for SalePrice, regardless of the model used.
Then when trying to calculate RMSE, I get NA.
Here is my model:
lm2 <- lm(SalePrice ~
GarageCars +
Neighborhood +
I(OverallQual^2) + OverallQual +
OverallQual*GrLivArea +
log2(LotArea) +
log2(GrLivArea) +
KitchenQual +
I(TotalBsmtSF^2) +
TotalBsmtSF
, data=train)
# Add an empty column to the test set,
# to be later filled in by predictions
# (Is this even necessary?):
test[, "SalePrice"] <- NA
# My predictions:
predictions <- predict(lm2, newdata = test)
head(predictions)
1 2 3 4 5 6
121093.5 170270.7 170029.5 187012.1 239359.2 172962.1
I always get these same values regardless of the model used. I suspect I'm just not understanding predict(). I suspect I am only getting the predicted values based on my train set rather than on my test set.
I know that the variable names need to match exactly those used in the model, but what other aspect of predict am I not understanding? Do I need to perform the same predictor variable transformations in the test set? Must I create variables to hold them?
Then I calculate the RMSE:
# Formula function for calculating RMSE:
rmse <- function(actual, pred) sqrt(mean((actual-pred)^2))
# Calculate rmse on test set:
rmse(test$SalePrice, predictions))
[1] NA
Could you please tell me what I'm doing wrong? Let me know if you need to see the data.
Related
I have a dataset demos_mn of demographics and an outcome variable. There are 5 variables of interest, so that my glm and null models looks like this:
# binomial model
res.binom <- glm(var.bool ~ var1 + var2*var3 + var4 + var5,
data = demos_mn, family = "binomial")
# null model
res.null <- glm(var.bool ~ 1,
data = demos_mn, family = "binomial")
# calculate marginal R2
print(r.squaredGLMM(res.binom))
# show p value
print(anova(res.null, res.binom))
That is my work flow for glm mixed models, but for my binomial model I do not get a p-value for the overall model only for the predictors. I'm hoping someone could enlighten me?
I did have some success using glmer for a repeated measures version of the model, however that unfortunately means I had to get rid of some key variables that were not measured repeatedly.
Perhaps you forgot test="Chisq" ? From ?anova.glm:
test: a character string, (partially) matching one of ‘"Chisq"’,
‘"LRT"’, ‘"Rao"’, ‘"F"’ or ‘"Cp"’. See ‘stat.anova’.
example("glm") ## to set up / fit the glm.D93 model
null <- update(glm.D93, . ~ 1)
anova(glm.D93, null, test="Chisq")
Analysis of Deviance Table
Model 1: counts ~ outcome + treatment
Model 2: counts ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 4 5.1291
2 8 10.5814 -4 -5.4523 0.244
test="Chisq" is poorly named: it's a likelihood ratio test, note it's an asymptotic test [relies on a large sample size]. For GLMs with an adjustable scale parameter (Gaussian, Gamma, quasi-likelihood) you would use test="F".
I have a series of regressions where I would like to execute different null hypotheses in the same regression.
This means that I would like to test whether one independent variable is equal to 1 and the other equal to 0.
netew3 <- summary(lm(ewvw[,3]-factors$RF ~ factors$Mkt.RF + factors$SMB + factors$HML + factors$MOM, na.action = na.exclude), data = ewvw)
I would like to test whether the first variable (factors$Mkt.RF) is equal to 1 and the others (SMB, HML, and MOM) are equal to zero.
Thank you in advance for your help.
Best
PL
summary() of an lm-object gives you p-values for all coefficients under the null hypotheses that each coefficient equals 0. However, it also gives you all necessary information to conduct your own test with a different null hypothesis, e.g. that coefficients are 1.
This is one of many places where t-test of regression coefficients is explained in detail. Essentially, you get the t-value by calculating (estimate - reference) / SE. SE is the standard error and reference being the assumed value of the coefficient under the null hypothesis (usually 0). So all you have to do is change the latter value from 0 to 1 and you got your t-value.
I automated this in a function below. h0.value is your assumed value under the null hypothesis. You can check if it works properly with your data/model by running it with h0.value = 0 and compare the result to what you get from summary(). If it works, use it with h0.value = 1.
estim_test <- function(lm.mod, h0.value = 0) {
coefm <- as.data.frame(summary(lm.mod)$coefficients)
n <- length(lm.mod$residuals)
coefm$`t value` <- (coefm$Estimate - h0.value)/coefm$`Std. Error`
coefm$`Pr(>|t|)` <- 2*pt(-abs(coefm$`t value`), df=lm.mod$df.residual)
coefm
}
# Testing the function
data("swiss")
mod1 <- lm(Fertility ~ Agriculture + Education + Catholic, data=swiss)
summary(mod1)
estim_test(mod1, h0.value=0)
estim_test(mod1, h0.value=1)
I would like to ask if it is possible to extract coefficients for individual observations for a given variable in panel data using plm in R, for instance using the example data:
wi <- plm(inv ~ value + capital, data = Grunfeld, model = "within", effect = "twoways")
In other words, in this example a coefficient for each of the firms in the sample.
It is not clear to me what you are looking for:
1) "coefficients for individual observations"
vs.
2) "a coefficient for each of the firms"
.
These are two different things and I believe you cannot have neither.
1) Asking for more than one coefficient for one observation does not work.
2) Asking for one coefficient per firm but trying to estimate two (value and capital) does not seem plausible.
Maybe this answer gives what you are looking for:
Extract all individual slope coefficient from pooled OLS estimation in R
I think you're looking for the function fixef in the plm package, which extracts the "fixed effects" (firm coefficients in this case). In your example:
library(plm)
data("Grunfeld")
wi <- plm(inv ~ value + capital, data = Grunfeld, model = "within", effect = "twoways")
and then run:
> fixef(wi)
1 2 3 4 5 6 7 8 9
-134.227709 72.826531 -269.458508 -38.873866 -139.666304 -31.339066 -82.761099 -66.737194 -104.010153
10
-7.390586
So I have a glm that is defined like this
oring.glm = glm(oring.data$Damaged ~ oring.data$Temp, data = oring.data, family=binomial)
The data looks like this
Oring Temp
1 15
0 20
1 30
I want to predict what happens to the Oring at a specific temperature
I've tried doing this
logodds = predict(oring.glm, list(Temp=31))
But this gives me a list of values, as opposed to a single odds value.
How do I get that?
If I'm correct to assume that you DV is dichotomous, then I'd use the logit link function, and access the predicted value of my fit using simple indexing:
g=glm(y~x,family=binomial("logit")) #fit
predict(g)[1234] # gives the predicted value of y for the x=1234.
Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.