R equivalent of Stata * in regression - r

I am looking for an equivalent in R of Stata's * function that I can use when running regressions.
For example, if I have a dataframe like the following:
outcome var1 var2 var3 new
3 2 3 4 3
2 3 2 4 2
4 3 2 1 4
I would like to be able to select all variable names that begin with "var" without typing each one out separately in order to run the following regression more efficiently:
lm(outcome ~ var1 + var2 + var3 + new, data = df)
This question explains how I can select the necessary columns. How can I cleanly incorporate these into a regression?

One technique is to subset the data to the required columns, and then to use the . operator for the formula object to represent the independent variables in lm(). The . operator is interpreted as "all columns not otherwise in the formula".
data <- as.data.frame(matrix(runif(1000),nrow = 100)*100)
colnames(data) <- c("outcome", "x1","x2","x3","x4", "x5","x6", "x7", "var8", "var9")
# select outcome plus vars beginning with var
desiredCols <- grepl("var",colnames(data)) | grepl("outcome",colnames(data))
# use desiredCols to subset data frame argument in lm()
summary(lm(outcome ~ .,data = data[desiredCols]))
...and the output:
> summary(lm(outcome ~ .,data = data[desiredCols]))
Call:
lm(formula = outcome ~ ., data = data[desiredCols])
Residuals:
Min 1Q Median 3Q Max
-57.902 -25.359 2.296 26.213 52.871
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.712722 7.334937 8.005 2.62e-12 ***
var8 0.008617 0.101298 0.085 0.932
var9 -0.154073 0.103438 -1.490 0.140
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29.86 on 97 degrees of freedom
Multiple R-squared: 0.02249, Adjusted R-squared: 0.002331
F-statistic: 1.116 on 2 and 97 DF, p-value: 0.3319
>

Related

Adjusted mean with emmeans at 3 expressions

I know how to get the adjusted mean by emmeans when I have 2 expressions present, such as with sex.
sex == 1 : men, sex == 2 : women --> 2 expressions.
The associated model with the subsequent adjusted mean (EMM) calculation is:
mean_MF <- lm(LZ~age + SES_3 + sex, data = MF)
summary(mean_MF)
emmeans(mean_MF, ~ sex)
and the output looks like this:
> emmeans(mean_MF, ~ sex)
sex emmean SE df lower.CL upper.CL
1 7.05 0.0193 20894 7.02 7.09
2 6.96 0.0187 20894 6.93 7.00
Results are averaged over the levels of: belastet_SZ, belastet_SNZ, guteSeiten_SZ, guteSeiten_SNZ, SES_3
Confidence level used: 0.95
But if I want to calculate the adjusted mean of a variable with 3 values, I only get an adjusted mean of a common value? expression, instead of for all 3.
e.g. for age (Alter), here I have 3 characteristics which are coded as follows:
18-30 years: 1
31-40 years: 2
41-51 years: 3
What else do I need to add to the emmeans function so that I get the adjusted means of all three variables?
F_Alter <- lm(LZ~ SES_3 + Alter, data = Frauen)
summary(F_Alter)
emmeans(F_Alter, ~ Alter)
The summary of (F_Alter) looks as follows:
> summary(F_Alter)
Call:
lm(formula = LZ ~ SES_3 + Alterfactor, data = Frauen)
Residuals:
Min 1Q Median 3Q Max
-7.2303 -1.1162 0.1951 1.1220 3.8838
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.44956 0.05653 131.777 < 2e-16 ***
SES_3mittel -0.42539 0.04076 -10.437 < 2e-16 ***
SES_3niedrig -1.11411 0.05115 -21.781 < 2e-16 ***
Alterfactor -0.07309 0.02080 -3.513 0.000444 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.889 on 14481 degrees of freedom
(5769 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.03287, Adjusted R-squared: 0.03267
F-statistic: 164 on 3 and 14481 DF, p-value: < 2.2e-16
In the following output I only get a value of 1.93 instead of my 3 expressions and the respective specific EEM's.
emmeans(F_Alter, ~ Alter)
Alter emmean SE df lower.CL upper.CL
1.93 6.8 0.0179 14481 6.76 6.83
Results are averaged over the levels of: SES_3
Confidence level used: 0.95
What can I change in the emmeans formula to get the output for my 3 age expressions (1, 2, 3)?
The predictor Alter in the original question was not coded as a factor, and so it was being treated as a continuous numeric variable in the model estimation and by emmeans.
The problem is fixed by creating a new factor variable,
Frauen$Alterfactor = as.factor(Frauer$Alter)
and then using this new variable as the predictor in the model.

How is Pr(>|t|) in a linear regression in R calculated?

What formula is used to calculate the value of Pr(>|t|) that is output when linear regression is performed by R?
I understand that the value of Pr (> | t |) is a p-value, but I do not understand how the value is calculated.
For example, although the value of Pr (> | t |) of x1 is displayed as 0.021 in the output result below, I want to know how this value was calculated
x1 <- c(10,20,30,40,50,60,70,80,90,100)
x2 <- c(20,30,60,70,100,110,140,150,180,190)
y <- c(100,120,150,180,210,220,250,280,310,330)
summary(lm(y ~ x1+x2))
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6 -2 0 2 6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0000 3.4226 21.621 1.14e-07 ***
x1 1.8000 0.6071 2.965 0.021 *
x2 0.4000 0.3071 1.303 0.234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.781 on 7 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.9963
F-statistic: 1209 on 2 and 7 DF, p-value: 1.291e-09
Basically, the values in the column t-value are obtained by dividing the coefficient estimate (which is in the Estimate column) by the standard error.
For example in your case in the second row we get that:
tval = 1.8000 / 0.6071 = 2.965
The column you are interested in is the p-value. It is the probability that the absolute value of t-distribution is greater than 2.965. Using the symmetry of the t-distribution this probability is:
2 * pt(abs(tval), rdf, lower.tail = FALSE)
Here rdf denotes the residual degrees of freedom, which in our case is equal to 7:
rdf = number of observations minus total number of coefficient = 10 - 3 = 7
And a simple check shows that this is indeed what R does:
2 * pt(2.965, 7, lower.tail = FALSE)
[1] 0.02095584

Extract root of dummy variable in model fit summary

In the following example, gender is encoded as dummy variables corresponding to the categories.
fit <- lm(mass ~ height + gender, data=dplyr::starwars)
summary(fit)
# Call:
# lm(formula = mass ~ height + gender, data = dplyr::starwars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -41.908 -6.536 -1.585 1.302 55.481
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -46.69901 12.67896 -3.683 0.000557 ***
# height 0.59177 0.06784 8.723 1.1e-11 ***
# genderhermaphrodite 1301.13951 17.37871 74.870 < 2e-16 ***
# gendermale 22.39565 5.82763 3.843 0.000338 ***
# gendernone 68.34530 17.49287 3.907 0.000276 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.57 on 51 degrees of freedom
# (31 observations deleted due to missingness)
# Multiple R-squared: 0.9915, Adjusted R-squared: 0.9909
# F-statistic: 1496 on 4 and 51 DF, p-value: < 2.2e-16
Is there a way to extract the root of the dummy variable name? For example, for gendernone, gendermale and genderhermaphrodite, the root would be gender, corresponding to the original column name in the dplyr::starwars data.
Get the variable names from the formula and check which one matches the input:
input <- c("gendermale", "height")
v <- all.vars(formula(fit))
v[sapply(input, function(x) which(pmatch(v, x) == 1))]
## [1] "gender" "height"

Dummy Variable in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Ciao Everyone,
I would like to create a dummy variable in R. So I have a list of Italian regions, and a variable called mafia. The mafia variable is coded 1 in the regions with high levels of mafia infiltration and 0 in the regions with lower levels of mafia penetration.
Now, I would like to create a dummy that considers only the regions with high levels of mafia. (=1)
If I understand your question correctly, the typical way of adding dummy variables (also called fixed effects) is to use the function factor. Here is a an example that creates random data and then uses factor in a linear regression:
set.seed(1)
require(data.table)
A = data.table(region = LETTERS[0:3], y = runif(100), x = runif(100), mafia = sample(c(0,1),100,rep = T))
> head(A)
region var mafia
1: A 0.67371223 1
2: B 0.09485786 0
3: C 0.49259612 1
4: A 0.46155184 1
5: B 0.37521653 1
6: C 0.99109922 1
formula = y ~ x + factor(mafia)
reg <- lm(formula, data = A)
> summary(reg)
Call:
lm(formula = formula, data = A)
Residuals:
Min 1Q Median 3Q Max
-0.46965 -0.24828 -0.03362 0.28780 0.51183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.46196 0.07093 6.513 3.28e-09 ***
x 0.06735 0.10521 0.640 0.524
factor(mafia)1 -0.01830 0.06415 -0.285 0.776
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3189 on 97 degrees of freedom
Multiple R-squared: 0.005498, Adjusted R-squared: -0.01501
F-statistic: 0.2681 on 2 and 97 DF, p-value: 0.7654
If you wish to only do a regression on the observations that are coded with 1 in the "mafia" column, this is much easier:
# Note that A is a data.table
A.mafia = A[ mafia == 1 ]
formula = y ~ x
reg <- lm(formula, data = A.mafia)
summary(reg)
Output:
Call:
lm(formula = formula, data = A.mafia)
Residuals:
Min 1Q Median 3Q Max
-0.47163 -0.26063 -0.05724 0.30166 0.52062
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.43334 0.07926 5.467 1.53e-06 ***
x 0.09017 0.14474 0.623 0.536
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3197 on 49 degrees of freedom
Multiple R-squared: 0.007857, Adjusted R-squared: -0.01239
F-statistic: 0.388 on 1 and 49 DF, p-value: 0.5362

Multi Collinearity for Categorical Variables

For Numerical/Continuous data, to detect Collinearity between predictor variables we use the Pearson's Correlation Coefficient and make sure that predictors are not correlated among themselves but are correlated with the response variable.
But How can we detect multicollinearity if we have a dataset, where predictors are all categorical. I am sharing one dataset where I am trying to find out if predictor variables are correlated or not
> A(Response Variable) B C D
> Yes Yes Yes Yes
> No Yes Yes Yes
> Yes No No No
How to do the same?
Collinearity can be, but is not always , a property of just a pair of variables and this is especially true when dealing with categorical variables. So although a high correlation coefficient would be sufficient to establish that collinearity might be a problem, a bunch of pairwise low to medium correlations is not a sufficient test for lack of collinearity. The usual method for continuous mixed or categorical collections for variables is to look at the variance inflation factors (which my memory tells me are proportional to the eigenvalues of the variance-covariance-matrix). At any rate this is the code for the vif-function in package:rms:
vif <-
function (fit)
{
v <- vcov(fit, regcoef.only = TRUE)
nam <- dimnames(v)[[1]]
ns <- num.intercepts(fit)
if (ns > 0) {
v <- v[-(1:ns), -(1:ns), drop = FALSE]
nam <- nam[-(1:ns)]
}
d <- diag(v)^0.5
v <- diag(solve(v/(d %o% d)))
names(v) <- nam
v
}
The reason that categorical variables have a greater tendency to generate collinearity is that the three-way or four-way tabulations often form linear combinations that lead to complete collinearity. You example case is an extreme case of collinearity but you can also get collinearity with
A B C D
1 1 0 0
1 0 1 0
1 0 0 1
Notice that this is collinear because A == B+C+D in all rows. None of pairwise correlations would be high, but the system together causes complete collinearity.
After putting your data into an R object and running lm() on it, it becomes apparent that there is another way to determine collinearity with R and that is because lm will drop factor variables from the results when they are "aliased", which is just another term for being completely collinear.
Here is an example for #Alex demonstrating highly collinear data and the output of vif in that situation. Generally you hope to see variance inflation factors below 10.
> set.seed(123)
> dat2 <- data.frame(res = rnorm(100), A=sample(1:4, 1000, repl=TRUE)
+ )
> dat2$B<-dat2$A
> head(dat2)
res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3 1.55870831 3 3
4 0.07050839 3 3
5 0.12928774 2 2
6 1.71506499 4 4
> dat2[1,2] <- 2
#change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
Call:
lm(formula = res ~ A + B, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-2.41139 -0.58576 -0.02922 0.60271 2.10760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10972 0.07053 1.556 0.120
A -0.66270 0.91060 -0.728 0.467
B 0.65520 0.90988 0.720 0.472
Residual standard error: 0.9093 on 997 degrees of freedom
Multiple R-squared: 0.0005982, Adjusted R-squared: -0.001407
F-statistic: 0.2984 on 2 and 997 DF, p-value: 0.7421
> vif ( mod )
A B
1239.335 1239.335
If you make a fourth variable "C" that is independent of the first two perdictors (admittedly a bad name for a variable since C is also an R function), you get a more desirable result from vif:
dat2$C <- sample(1:4, 1000, repl=TRUE)
vif ( lm( res ~ A + C, dat2) )
#---------
A C
1.003493 1.003493
Edit: I realized that I had not actually created R-representations of a "categorical variable" despite sampling from 1:4. The same sort of result occurs with factor versions of that "sample":
> dat2 <- data.frame(res = rnorm(100), A=factor( sample(1:4, 1000, repl=TRUE) ) )
> dat2$B<-dat2$A
> head(dat2)
res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3 1.55870831 3 3
4 0.07050839 3 3
5 0.12928774 2 2
6 1.71506499 4 4
> dat2[1,2] <- 2
> #change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
Call:
lm(formula = res ~ A + B, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-2.43375 -0.59278 -0.04761 0.62591 2.12461
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11165 0.05766 1.936 0.0531 .
A2 -0.67213 0.91170 -0.737 0.4612
A3 0.01293 0.08146 0.159 0.8739
A4 -0.04624 0.08196 -0.564 0.5728
B2 0.62320 0.91165 0.684 0.4944
B3 NA NA NA NA
B4 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9099 on 995 degrees of freedom
Multiple R-squared: 0.001426, Adjusted R-squared: -0.002588
F-statistic: 0.3553 on 4 and 995 DF, p-value: 0.8404
Notice that two of the factor levels are omitted from the calculation of coefficints. ... because they are completely collinear with the corresponding A levels. So if you want to see what vif returns for factor variables that are almost collinear, you need to change a few more values:
> dat2[1,2] <- 2
> dat2[2,2] <-2; dat2[3,2]<-2; dat2[4,2]<-4
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
Call:
lm(formula = res ~ A + B, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-2.42819 -0.59241 -0.04483 0.62482 2.12461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11165 0.05768 1.936 0.0532 .
A2 -0.67213 0.91201 -0.737 0.4613
A3 -1.51763 1.17803 -1.288 0.1980
A4 -0.97195 1.17710 -0.826 0.4092
B2 0.62320 0.91196 0.683 0.4945
B3 1.52500 1.17520 1.298 0.1947
B4 0.92448 1.17520 0.787 0.4317
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9102 on 993 degrees of freedom
Multiple R-squared: 0.002753, Adjusted R-squared: -0.003272
F-statistic: 0.4569 on 6 and 993 DF, p-value: 0.8403
#--------------
> library(rms)
> vif(mod)
A2 A3 A4 B2 B3 B4
192.6898 312.4128 308.5177 191.2080 312.5856 307.5242

Resources