I have a categorical independent variable (with options of "yes" or "no") that I want to add to my panel linear model. According to the
answer here: After generating dummy variables?, the lm function automatically creates dummy variables for you for categorical variables.
Does this mean that creating dummy variables through i.e. dummy.data.frame is unnecessary, and I can just add in my variable in the plm function and it will automatically be treated like a dummy variable (even if the data is not numerical)? And is this the same for the plm function?
Also, I don't have much data to begin with. Would it hurt if I manually turned the data into numbers (i.e. "yes"=1, "no"=0) without creating a dummy variable?
It is unnecessary to create dummy variables for use with the lm() function. To illustrate, we'll run a regression model on the mtcars data set, using am (0 = automatic, 1 = manual transmission) as a factor variable.
summary(lm(mpg ~ wt + factor(am),data=mtcars))
...and the output:
> summary(lm(mpg ~ wt + factor(am),data=mtcars))
Call:
lm(formula = mpg ~ wt + factor(am), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5295 -2.3619 -0.1317 1.4025 6.8782
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.32155 3.05464 12.218 5.84e-13 ***
wt -5.35281 0.78824 -6.791 1.87e-07 ***
factor(am)1 -0.02362 1.54565 -0.015 0.988
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.098 on 29 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7358
F-statistic: 44.17 on 2 and 29 DF, p-value: 1.579e-09
Related
When I try to make prediction and confidence intervals, around a a linear regression model, with two continuous variables and two categorical variables (that may act as dummy variables), results for the two intervals are exactly the same. I'm using the predict() function.
I've already tried with other datasets, that have continuous and discrete variables, but not categorical or dichotomous variables, and intervals are different. I tried removing some variables from the regression model, and the intervals are still the same. On the other hand, I've already compared my data.frame with ones that are exemplified on the R documentation, and I think the problem isn't there.
#linear regression model: modeloReducido
summary(modeloReducido)
> Call: lm(formula = V ~ T * W + P * G, data = Datos)
>
> Residuals:
> Min 1Q Median 3Q Max
> -7.5579 -1.6222 0.3286 1.6175 10.4773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.937674 3.710133 0.253 0.800922
T -12.864441 2.955519 -4.353 2.91e-05 ***
W 0.013926 0.001432 9.722 < 2e-16 ***
P 12.142109 1.431102 8.484 8.14e-14 ***
GBaja 15.953421 4.513963 3.534 0.000588 ***
GMedia 0.597568 4.546935 0.131 0.895669
T:W 0.014283 0.001994 7.162 7.82e-11 ***
P:GBaja -3.249681 2.194803 -1.481 0.141418
P:GMedia -5.093860 2.147673 -2.372 0.019348 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 3.237 on 116 degrees of freedom Multiple
> R-squared: 0.9354, Adjusted R-squared: 0.931 F-statistic: 210 on
> 8 and 116 DF, p-value: < 2.2e-16
#Prediction Interval
newdata1.2 <- data.frame(T=1,W=1040,P=10000,G="Media")
#EP
opt1.PI <- predict.lm(modeloReducido, newdata1.2,
interval="prediction", level=.95)
#Confidence interval
newdata1.1 <- data.frame(T=1,W=1040,P=10000,G="Media")
#EP
opt1.CI <- predict(modeloReducido, newdata1.1,
interval="confidence", level=.95)
opt1.CI
#fit lwr upr
#1 70500.51 38260.24 102740.8
opt1.PI
# fit lwr upr
# 1 70500.51 38260.24 102740.8
opt1.PI and opt1.CI should be different.
The Excel file that I was given out is in the following link:
https://www.filehosting.org/file/details/830581/Datos%20Tarea%204.xlsx
I tried several times to use lm and plm to do a regression. And I get different results.
First, I used lm as follows:
fixed.Region1 <- lm(CapNormChange ~ Policychanges + factor(Region),
data=Panel)
Further I used plm in the following way:
fixed.Region2 <- plm(CapNormChange ~ Policychanges+ factor(Region),
data=Panel, index=c("Region", "Year"), model="within", effect="individual")
I think there is something wrong with plm because I don't see an intercept in the results (see below).
Furthermore, I am not entirely sure if + factor (Region) is necessary, however, if it is not there, I don't see the coefficients (and significance) for the dummy.
So, my question is:
I am using the plm function wrong? (or what is wrong about it)
If not, how can it be that the results are different?
If somebody could give me a hint, I would really appreciate.
Results from LM:
Call:
lm(formula = CapNormChange ~ Policychanges + factor(Region),
data = Panel)
Residuals:
Min 1Q Median 3Q Max
-31.141 -4.856 -0.642 1.262 192.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.3488 4.9134 3.531 0.000558 ***
Policychanges 0.6412 0.1215 5.277 4.77e-07 ***
factor(Region)Asia -19.3377 6.7804 -2.852 0.004989 **
factor(Region)C America + Carib 0.1147 6.8049 0.017 0.986578
factor(Region)Eurasia -17.6476 6.8294 -2.584 0.010767 *
factor(Region)Europe -20.7759 8.8993 -2.335 0.020959 *
factor(Region)Middle East -17.3348 6.8285 -2.539 0.012200 *
factor(Region)N America -17.5932 6.8064 -2.585 0.010745 *
factor(Region)Oceania -14.0440 6.8417 -2.053 0.041925 *
factor(Region)S America -14.3580 6.7781 -2.118 0.035878 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.72 on 143 degrees of freedom
Multiple R-squared: 0.3455, Adjusted R-squared: 0.3043
F-statistic: 8.386 on 9 and 143 DF, p-value: 5.444e-10`
Results from PLM:
Call:
plm(formula = CapNormChange ~ Policychanges, data = Panel, effect = "individual",
model = "within", index = c("Region", "Year"))
Balanced Panel: n = 9, T = 17, N = 153
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-31.14147 -4.85551 -0.64177 1.26236 192.80277
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Policychanges 0.64118 0.12150 5.277 4.769e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 66459
Residual Sum of Squares: 55627
R-Squared: 0.16299
Adj. R-Squared: 0.11031
F-statistic: 27.8465 on 1 and 143 DF, p-value: 4.7687e-07`
You would need to leave out + factor(Region) in your formula for the within model with plm to get what you want.
Within models do not have an intercept, but some software packages (esp. Stata and Gretl) report one. You can estimate it with plm by running within_intercept on you estimated model. The help page has the details about this somewhat artificial intercept.
If you want the individual effects and their significance, use summary(fixef(<your_plm_model>)). Use pFtest to check if the within specification seems worthwhile.
The R squareds diverge between the lm model and the plm model. This is due to the lm model (if used like this with the dummies, it is usually called the LSDV model (least squares dummy variables)) gives what is sometimes called the overall R squared while plm will give you the R squared of the demeaned regression, sometimes called the within R squared. Stata's documentation has some details about this: https://www.stata.com/manuals/xtxtreg.pdf
I'm an absolute R beginner and need some help with my likelihood ratio tests for my univariate analyses. Here's the code:
#Univariate analysis for conscientiousness (categorical)
fit <- glm(BCS_Bin~Conscientiousness_cat,data=dat,family=binomial)
summary(fit)
#Likelihood ratio test
fit0<-glm(BCS_Bin~1, data=dat, family=binomial)
summary(fit0)
lrtest(fit, fit0)
The results are:
Call:
glm(formula = BCS_Bin ~ Conscientiousness_cat, family = binomial,
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8847 -0.8847 -0.8439 1.5016 1.5527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.84933 0.03461 -24.541 <2e-16 ***
Conscientiousness_catLow 0.11321 0.05526 2.049 0.0405 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7962.1 on 6439 degrees of freedom
Residual deviance: 7957.9 on 6438 degrees of freedom
(1963 observations deleted due to missingness)
AIC: 7961.9
Number of Fisher Scoring iterations: 4
And:
Call:
glm(formula = BCS_Bin ~ 1, family = binomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8524 -0.8524 -0.8524 1.5419 1.5419
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.82535 0.02379 -34.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10251 on 8337 degrees of freedom
Residual deviance: 10251 on 8337 degrees of freedom
(65 observations deleted due to missingness)
AIC: 10253
Number of Fisher Scoring iterations: 4
For my LRT:
Error in lrtest.default(fit, fit0) :
models were not all fitted to the same size of dataset
I understand that this is happening because there's different numbers of observations missing? That's because it is data from a large questionnaire, and many more drop outs had occurred by the question assessing my predictor variable (conscientiousness) when compared with the outcome variable (body condition score/BCS). So I just have more data for BCS than conscientiousness, for example (it's producing the same error for many of my other variables too).
In order to run the likelihood ratio test, the model with just the intercept has to be fit to the same observations as the model that includes Conscientiousness_cat. So, you need the subset of the data that has no missing values for Conscientiousness_cat:
BCS_bin_subset = BCS_bin[complete.cases(BCS_bin[,"Conscientiousness_cat"]), ]
You can run both models on this subset of the data and the likelihood ratio test should run without error.
In your case, you could also do:
BCS_bin_subset = BCS_bin[!is.na(BCS_bin$Conscientiousness_cat), ]
However, it's nice to have complete.cases handy when you want a subset of a data frame with no missing values across multiple variables.
Another option that is more convenient if you're going to run multiple models, but that is also more complex is to first fit whatever model uses the largest number of variables from BCS_bin (since that model will exclude the largest number of observations due to missingness) and then use the update function to update that model to models with fewer variables. We just need to make sure that update uses the same observations each time, which we do using a wrapper function defined below. Here's an example using the built-in mtcars data frame:
library(lmtest)
dat = mtcars
# Create some missing values in mtcars
dat[1, "wt"] = NA
dat[5, "cyl"] = NA
dat[7, "hp"] = NA
# Wrapper function to ensure the same observations are used for each
# updated model as were used in the first model
# From https://stackoverflow.com/a/37341927/496488
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
m1 = lm(mpg ~ wt + cyl + hp, data=dat)
m2 = update_nested(m1, . ~ . - wt) # Remove wt
m3 = update_nested(m1, . ~ . - cyl) # Remove cyl
m4 = update_nested(m1, . ~ . - wt - cyl) # Remove wt and cyl
m5 = update_nested(m1, . ~ . - wt - cyl - hp) # Remove all three variables (i.e., model with intercept only)
lrtest(m5,m4,m3,m2,m1)
The output for the lm model with two categorical variables is:
Call:
lm(formula = exit_irr ~ type_exit + domicile, data = pe1)
Residuals:
Min 1Q Median 3Q Max
-0.73013 -0.17926 -0.05142 0.03945 2.85043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05333 0.22282 0.239 0.81101
type_exitTrade Sale -0.11871 0.05469 -2.171 0.03081
type_exitUnlisted -0.21208 0.07536 -2.814 0.00525
domicileKSA 0.14593 0.22852 0.639 0.52363
domicileKuwait 0.14679 0.22847 0.643 0.52108
domicileOM 0.08708 0.28225 0.309 0.75791
domicileUAE 0.18623 0.22808 0.817 0.41491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3859 on 274 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.04221, Adjusted R-squared: 0.02124
F-statistic: 2.013 on 6 and 274 DF, p-value: 0.06415
How to write equation of linear regression with categorical predictors?
the function lm() in r automatically accounts for categorical variables. It produces dummy variables of your categorical variables and does regression on it. Make sure your Categorical variables are of class factor. This can be done as:
pe1$type_exit <- as.factor(pe1$type_exit)
pe1$domicile <- as.factor(pe1$domicile)
I have considered type_exit and domicile tobe your categorical columns.
The following code generates a qudaratic regression in R.
lm.out3 = lm(listOfDataFrames1$avgTime ~ listOfDataFrames1$betaexit + I(listOfDataFrames1$betaexit^2) + I(listOfDataFrames1$betaexit^3))
summary(lm.out3)
Call:
lm(formula = listOfDataFrames1$avgTime ~ listOfDataFrames1$betaexit +
I(listOfDataFrames1$betaexit^2) + I(listOfDataFrames1$betaexit^3))
Residuals:
Min 1Q Median 3Q Max
-14.168 -2.923 -1.435 2.459 28.429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 199.41 11.13 17.913 < 2e-16 ***
listOfDataFrames1$betaexit -3982.03 449.49 -8.859 1.14e-12 ***
I(listOfDataFrames1$betaexit^2) 32630.86 5370.27 6.076 7.87e-08 ***
I(listOfDataFrames1$betaexit^3) -93042.90 19521.05 -4.766 1.15e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.254 on 63 degrees of freedom
Multiple R-squared: 0.9302, Adjusted R-squared: 0.9269
F-statistic: 279.8 on 3 and 63 DF, p-value: < 2.2e-16
But how to do I plot the curve on the graph am confused.
To get graph:
plot(listOfDataFrames1$avgTime~listOfDataFrames1$betaexit)
But curve?
Is there any to do it without manually copying the values?
Like mso suggested though it works.
This should work.
# not tested
lm.out3 = lm(avgTime ~ poly(betaexit,3,raw=TRUE),listofDataFrames3)
plot(avgTime~betaexit,listofDataDFrames3)
curve(predict(lm.out3,newdata=data.frame(betaexit=x)),add=T)
Since you didn't provide any data, here is a working example using the built-in mtcars dataset.
fit <- lm(mpg~poly(wt,3,raw=TRUE),mtcars)
plot(mpg~wt,mtcars)
curve(predict(fit,newdata=data.frame(wt=x)),add=T)
Some notes:
(1) It is a really bad idea to reference external data structures in the formula=... argument to lm(...). Instead, reference columns of a data frame referenced in the data=... argumennt, as above and as #mso points out.
(2) You can specify the formula as #mso suggests, or you can use the poly(...) function with raw=TRUE.
(3) The curve(...) function takes an expression as its first argument, This expression has to have a variable x, which will be populated automatically by values from the x-axis of the graph. So in this example, the expression is:
predict(fit,newdata=data.frame(wt=x))
which uses predict(...) on the model with a dataframe having wt (the predictor variable) given by x.
Try with ggplot:
library(ggplot)
ggplot(listOfDataFrames1, aes(x=betaexit, y=avgTime)) + geom_point()+stat_smooth(se=F)
Using mtcars data:
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()+stat_smooth(se=F, method='lm', formula=y~poly(x,3))
Try:
with(listOfDataFrames1, plot(betaexit, avgTime))
with(listOfDataFrames1, lines(betaexit, 199-3982*betaexit+32630*betaexit^2-93042*betaexit^3))