Extract root of dummy variable in model fit summary - r

In the following example, gender is encoded as dummy variables corresponding to the categories.
fit <- lm(mass ~ height + gender, data=dplyr::starwars)
summary(fit)
# Call:
# lm(formula = mass ~ height + gender, data = dplyr::starwars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -41.908 -6.536 -1.585 1.302 55.481
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -46.69901 12.67896 -3.683 0.000557 ***
# height 0.59177 0.06784 8.723 1.1e-11 ***
# genderhermaphrodite 1301.13951 17.37871 74.870 < 2e-16 ***
# gendermale 22.39565 5.82763 3.843 0.000338 ***
# gendernone 68.34530 17.49287 3.907 0.000276 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.57 on 51 degrees of freedom
# (31 observations deleted due to missingness)
# Multiple R-squared: 0.9915, Adjusted R-squared: 0.9909
# F-statistic: 1496 on 4 and 51 DF, p-value: < 2.2e-16
Is there a way to extract the root of the dummy variable name? For example, for gendernone, gendermale and genderhermaphrodite, the root would be gender, corresponding to the original column name in the dplyr::starwars data.

Get the variable names from the formula and check which one matches the input:
input <- c("gendermale", "height")
v <- all.vars(formula(fit))
v[sapply(input, function(x) which(pmatch(v, x) == 1))]
## [1] "gender" "height"

Related

Linear regression on dynamic groups in R

I have a data.table data_dt on which I want to run linear regression so that user can choose the number of columns in groups G1 and G2 using variable n_col. The following code works perfectly but it is slow due to extra time spent on creating matrices. To improve the performance of the code below, is there a way to remove Steps 1, 2, and 3 altogether by tweaking the formula of lm function and still get the same results?
library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"])
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)])
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)])
lm(xx ~ G1 + G2)
Results -
summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
G1LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
G1MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
G1ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
G2LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
G2LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
G2LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
This may be easier by just creating the formula with reformulate
out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)],
response = 'SPI'), data = data_dt)
-checking
> summary(out)
Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col +
2 + n_col)], response = "SPI"), data = data_dt)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16

How to run a model with variables from different dataframes multiple times with lapply in R

I have 2 dataframes
#dummy df for examples:
set.seed(1)
df1 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
df2 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
I want to do this for every column in both dataframes (except the t column):
model <- lm(df2$A ~ df1$A, data = NULL)
I have tried something like this:
model <- function(yvar, xvar){
lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
}
lapply(names(data), model)
but it obviously doesn't work. What am i doing wrong?
In the end, what i really want is to get the coefficients and other stuff from the models. But what is stopping me is how to run a linear model with variables from different dataframes multiple times.
the output i want i'll guess it should look something like this:
# [[1]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[2]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[3]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
Since df1 and df2 have same names you can do this as :
model <- function(var){
lm(df1[[var]] ~ df2[[var]])
}
result <- lapply(names(df1)[-1], model)
result
#[[1]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.1504 -0.4763
#[[2]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 3.0227 0.6374
#[[3]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.4240 0.2411
To get summary statistics from the model you can use broom::tidy :
purrr::map_df(result, broom::tidy, .id = 'model_num')
# model_num term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 (Intercept) 15.2 3.03 5.00 0.000194
#2 1 df2[[var]] -0.476 0.248 -1.92 0.0754
#3 2 (Intercept) 3.02 4.09 0.739 0.472
#4 2 df2[[var]] 0.637 0.227 2.81 0.0139
#5 3 (Intercept) 15.4 4.40 3.50 0.00351
#6 3 df2[[var]] 0.241 0.272 0.888 0.390

how does R handle NA values vs deleted values with regressions

Say I have a table and I remove all the inapplicable values and I ran a regression. If I ran the exact same regression on the same table, but this time instead of removing the inapplicable values, I turned them into NA values, would the regression still give me the same coefficients?
The regression would omit any NA values prior to doing the analysis (i.e. deleting any row that contains a missing NA in any of the predictor variables or the outcome variable). You can check this by comparing the degrees of freedom and other statistics for both models.
Here's a toy example:
head(mtcars)
# check the data set size (all non-missings)
dim(mtcars) # has 32 rows
# Introduce some missings
set.seed(5)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 5)] <- NA
head(mtcars)
# Create an alternative where all missings are omitted
mtcars_NA_omit <- na.omit(mtcars)
# Check the data set size again
dim(mtcars_NA_omit) # Now only has 27 rows
# Now compare some simple linear regressions
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars))
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit))
Comparing the two summaries you can see that they are identical, with the one exception that for the first model, there's a warning message that 5 csaes have been dropped due to missingness, which is exactly what we did manually in our mtcars_NA_omit example.
# First, original model
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07
# Second model where we dropped missings manually
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07

How does lm() know which predictors are categorical?

Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?
Somewhat related Question on StackOverflow.
Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.
> summary(lm(mpg~cyl,mtcars))
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
> summary(lm(mpg~factor(cyl),mtcars))
Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.
Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.
> head(Carseats$ShelveLoc)
[1] Bad Good Medium Medium Bad Bad
Levels: Bad Good Medium
R decides that thing from the features type. You can check that by using the str(dataset).If the feature is factor type then it will create dummies for that feature.

R- How to save the console data into a row/matrix or data frame for future use?

I'm performing the multiple regression to find the best model to predict the prices. See as following for the output in the R console.
I'd like to store the first column (Estimates) into a row/matrix or data frame for future use such as using R shiny to deploy on the web.
*(Price = 698.8+0.116*voltage-70.72*VendorCHICONY
-36.6*VendorDELTA-66.8*VendorLITEON-14.86*H)*
Can somebody kindly advise?? Thanks in advance.
Call:
lm(formula = Price ~ Voltage + Vendor + H, data = PSU2)
Residuals:
Min 1Q Median 3Q Max
-10.9950 -0.6251 0.0000 3.0134 11.0360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.821309 276.240098 2.530 0.0280 *
Voltage 0.116958 0.005126 22.818 1.29e-10 ***
VendorCHICONY -70.721088 9.308563 -7.597 1.06e-05 ***
VendorDELTA -36.639685 5.866688 -6.245 6.30e-05 ***
VendorLITEON -66.796531 6.120925 -10.913 3.07e-07 ***
H -14.869478 6.897259 -2.156 0.0541 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.307 on 11 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.9799
F-statistic: 156.6 on 5 and 11 DF, p-value: 7.766e-10
Use coef on your lm output.
e.g.
m <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
summary(m)
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
# Residuals:
# Min 1Q Median 3Q Max
# -1.30711 -0.25713 -0.05325 0.19542 1.41253
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
# Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
# Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
# Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.438 on 146 degrees of freedom
# Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
# F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
coef(m)
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2513932 0.8035609 1.4587431 1.9468166
See also names(m) which shows you some things you can extract, e.g. m$residuals (or equivalently, resid(m)).
And also methods(class='lm') will show you some other functions that work on a lm.
> methods(class='lm')
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta dfbetas drop1 dummy.coef effects extractAIC family
[16] formula hatvalues influence initialize kappa labels logLik model.frame model.matrix nobs plot predict print proj qr
[31] residuals rstandard rstudent show simulate slotsFromS3 summary variable.names vcov
(oddly, 'coef' is not in there? ah well)
Besides, I'd like to know if there is command to show the "residual percentage"
=(actual value-fitted value)/actual value"; currently the "residuals()" command can
only show the below info but I need the percentage instead.
residuals(fit3ab)
1 2 3 4 5 6
-5.625491e-01 -5.625491e-01 7.676578e-15 -8.293815e+00 -5.646900e+00 3.443652e+00

Resources