How does lm() know which predictors are categorical? - r

Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?
Somewhat related Question on StackOverflow.

Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.
> summary(lm(mpg~cyl,mtcars))
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
> summary(lm(mpg~factor(cyl),mtcars))
Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09

Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.
Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.
> head(Carseats$ShelveLoc)
[1] Bad Good Medium Medium Bad Bad
Levels: Bad Good Medium

R decides that thing from the features type. You can check that by using the str(dataset).If the feature is factor type then it will create dummies for that feature.

Related

how does R handle NA values vs deleted values with regressions

Say I have a table and I remove all the inapplicable values and I ran a regression. If I ran the exact same regression on the same table, but this time instead of removing the inapplicable values, I turned them into NA values, would the regression still give me the same coefficients?
The regression would omit any NA values prior to doing the analysis (i.e. deleting any row that contains a missing NA in any of the predictor variables or the outcome variable). You can check this by comparing the degrees of freedom and other statistics for both models.
Here's a toy example:
head(mtcars)
# check the data set size (all non-missings)
dim(mtcars) # has 32 rows
# Introduce some missings
set.seed(5)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 5)] <- NA
head(mtcars)
# Create an alternative where all missings are omitted
mtcars_NA_omit <- na.omit(mtcars)
# Check the data set size again
dim(mtcars_NA_omit) # Now only has 27 rows
# Now compare some simple linear regressions
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars))
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit))
Comparing the two summaries you can see that they are identical, with the one exception that for the first model, there's a warning message that 5 csaes have been dropped due to missingness, which is exactly what we did manually in our mtcars_NA_omit example.
# First, original model
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07
# Second model where we dropped missings manually
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07

Test if intercepts in ancova model are significantly different in R

I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.

Display category labels in regression output in R

Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?

BigO of Algorithm Using Multiple Variable Regression

For more verbose algorithms, determining the time complexity (i.e. BigO) is a pain. My solution has been to time the execution of the algorithm with parameters n and k, and come up with a function (time function) that varies with n and k.
My data looks something like the below:
n k executionTime
500 1 0.02
500 2 0.03
500 3 0.05
500 ... ...
500 10 0.18
1000 1 0.08
... ... ...
10000 1 9.8
... ... ...
10000 10 74.57
I've been using the lm() function in the stats R package. I don't know how to interpret the output of the multiple regression, to determine a final Big-O. This is my main question: how do you translate the output of a multiple variable regression, to a final ruling on the best Big-O time complexity rating?
Here's the output of the lm():
Residuals:
Min 1Q Median 3Q Max
-14.943 -5.325 -1.916 3.681 31.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.130e+01 1.591e+00 -13.39 <2e-16 ***
n 4.080e-03 1.953e-04 20.89 <2e-16 ***
k 2.361e+00 1.960e-01 12.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.962 on 197 degrees of freedom
Multiple R-squared: 0.747, Adjusted R-squared: 0.7444
F-statistic: 290.8 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of log(y) ~ log(n) + log(k):
Residuals:
Min 1Q Median 3Q Max
-0.4445 -0.1136 -0.0253 0.1370 0.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.80405 0.13749 -122.22 <2e-16 ***
log(n) 2.02321 0.01609 125.72 <2e-16 ***
log(k) 1.01216 0.01833 55.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1803 on 197 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
F-statistic: 9428 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of the principle components, showing both n and k are contributing to the spread of the multivariate model:
PC1(This is n) PC2 (this is k) PC3 (noise?)
Standard deviation 1.3654 1.0000 0.36840
Proportion of Variance 0.6214 0.3333 0.04524
Cumulative Proportion 0.6214 0.9548 1.00000

R- How to save the console data into a row/matrix or data frame for future use?

I'm performing the multiple regression to find the best model to predict the prices. See as following for the output in the R console.
I'd like to store the first column (Estimates) into a row/matrix or data frame for future use such as using R shiny to deploy on the web.
*(Price = 698.8+0.116*voltage-70.72*VendorCHICONY
-36.6*VendorDELTA-66.8*VendorLITEON-14.86*H)*
Can somebody kindly advise?? Thanks in advance.
Call:
lm(formula = Price ~ Voltage + Vendor + H, data = PSU2)
Residuals:
Min 1Q Median 3Q Max
-10.9950 -0.6251 0.0000 3.0134 11.0360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.821309 276.240098 2.530 0.0280 *
Voltage 0.116958 0.005126 22.818 1.29e-10 ***
VendorCHICONY -70.721088 9.308563 -7.597 1.06e-05 ***
VendorDELTA -36.639685 5.866688 -6.245 6.30e-05 ***
VendorLITEON -66.796531 6.120925 -10.913 3.07e-07 ***
H -14.869478 6.897259 -2.156 0.0541 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.307 on 11 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.9799
F-statistic: 156.6 on 5 and 11 DF, p-value: 7.766e-10
Use coef on your lm output.
e.g.
m <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
summary(m)
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
# Residuals:
# Min 1Q Median 3Q Max
# -1.30711 -0.25713 -0.05325 0.19542 1.41253
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
# Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
# Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
# Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.438 on 146 degrees of freedom
# Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
# F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
coef(m)
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2513932 0.8035609 1.4587431 1.9468166
See also names(m) which shows you some things you can extract, e.g. m$residuals (or equivalently, resid(m)).
And also methods(class='lm') will show you some other functions that work on a lm.
> methods(class='lm')
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta dfbetas drop1 dummy.coef effects extractAIC family
[16] formula hatvalues influence initialize kappa labels logLik model.frame model.matrix nobs plot predict print proj qr
[31] residuals rstandard rstudent show simulate slotsFromS3 summary variable.names vcov
(oddly, 'coef' is not in there? ah well)
Besides, I'd like to know if there is command to show the "residual percentage"
=(actual value-fitted value)/actual value"; currently the "residuals()" command can
only show the below info but I need the percentage instead.
residuals(fit3ab)
1 2 3 4 5 6
-5.625491e-01 -5.625491e-01 7.676578e-15 -8.293815e+00 -5.646900e+00 3.443652e+00

Resources