The ways to construct columns associated with interaction terms in data frame - r

I have a data frame with 6 columns,
dat<-data.frame(x1,x2,x3,x4,x5,x6)
Right now, I need to build two extra columns associated with interaction terms, x1*x2 and x3*x4*x5 How to do that in R. Are there any special consideration when some of them, such as x2 is categorical?

I guess the function model.matrix does exactly what you want.
For instance, you can fit a linear model including the variables and interaction terms you're interested in and then extract the model matrix from that fitted object
model.matrix(lm(drat ~ mpg * cyl + disp * hp * wt, data = mtcars))
Factors need to be explicitly coded as factors, find an example below
mtcars$cyl <- factor(mtcars$cyl)
model.matrix(lm(drat ~ mpg * cyl + disp * hp * wt, data = mtcars))
The default kind of contrasts used for factors is treatment coding. You can easily change this to sum coding (or other codings: ?contr.sum) by using the command below
contrasts(mtcars$cyl) <- contr.sum

Related

How to transpose a regression output with modelsummary package?

I JUST found out about this amazing R package, modelsummary.
It doesn't seem like it offers an ability to transpose regression outputs.
I know that you cannot do a tranposition within kable-extra, which is my go-to for ordinary table outputs in R. Since modelsummary relies on kable-extra for post-processing, I'm wondering if this is possible. Has anyone else figured it out?
Ideally I'd like to preserve the stars of my regression output.
This is available in STATA (below):
Thanks in advance!
You can flip the order of the terms in the group argument formula. See documentation here and also here for many examples.
library(modelsummary)
mod <- list(
lm(mpg ~ hp, mtcars),
lm(mpg ~ hp + drat, mtcars))
modelsummary(mod, group = model ~ term)
(Intercept)
hp
drat
Model 1
30.099
-0.068
(1.634)
(0.010)
Model 2
10.790
-0.052
4.698
(5.078)
(0.009)
(1.192)
The main problem with this strategy is that there is not (yet) an automatic way to append goodness of fit statistics. So you would probably have to rig something up by creating a data.frame and feeding it to the add_columns argument. For example:
N <- sapply(mod, function(x) get_gof(x)$nobs)
N <- data.frame(N = c(N[1], "", N[2], ""))
modelsummary(mod,
group = model ~ term,
add_columns = N,
align = "lcccc")
(Intercept)
hp
drat
N
Model 1
30.099
-0.068
32
(1.634)
(0.010)
Model 2
10.790
-0.052
4.698
32
(5.078)
(0.009)
(1.192)
If you have ideas about the best default behavior for goodness of fit statistics, please file a feature request on Github.

Multiple Linear Regression with character as dependent variable

I'm currently trying do perform a multiple linear regression on the voter turnout per state within the 2020 Presidential Election.
To create this regression model I would like to use the following variables: State, Total_Voters and Population.
When I try to run my linear regression I get the following error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'
The dataset I've gathered is quite large. I have created a new dataframe with the variables which I need as follows:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
After that I remove all NA values:
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
After that I filter through the dataframe once more and filter out all the states which did not report:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
In the end the dataframe looks like this:
With the following summary:
However when I now try to run my multiple linear regression I get the error I have showcased above. The command looks like this:
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
I'm quite new to linear regressions but I'm eager to learn. I have looked through StackOverflow for quite a bit now, and couldn't figure it out.
It would be greatly appreciated if someone here would be able to assist me.
The full script at once:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
# Change all NA to 0
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
summary(Turnout_Rate_2020)
# Select all again and filter out states which did not report. (values that were NA)
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
# Does not work and if I turn variables around I get NaN values.
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
summary(lmTurnoutRate_2020)
# Does not work
ggplot(lmTurnoutRate_2020, aes(x=State_Full,y=Population)) + geom_point() + geom_smooth(method=lm, level=0.95) + labs(x = "State", y = "Voters")
1) The input is missing from the question so we will use mtcars and make cyl a character column. lm cannot handle that but we could create a 0/1 model matrix from cyl and run that. This performs a separate lm for each level of cyl. This would only be applicable if the dependent variable had a small number of levels as we have here. If your dependent variable is naturally or has been cut into a small number of levels that would be the situation.
(Probably in this case we want to use logistic regression as with glm and family=binomial() or ordinal logistic regression as with polr in MASS or the ordinal package or multinom in nnet package but we will show it with lm just to show it can be done although it probably shouldn't be because with only two values the dependent variable is not sufficiently gaussian.)
mtcars2 <- transform(mtcars, cyl = as.character(cyl))
lm(model.matrix(~ cyl + 0) ~ hp, mtcars2)
giving:
Call:
lm(formula = model.matrix(~cyl + 0) ~ hp, data = mtcars2)
Coefficients:
cyl4 cyl6 cyl8
(Intercept) 1.052957 0.390688 -0.443645
hp -0.004835 -0.001172 0.006007
With polr (which assumes the levels are ordered as they are with cyl):
library(MASS)
polr(cyl ~ hp, transform(mtcars2, cyl = factor(cyl)))
giving:
Call:
polr(formula = cyl ~ hp, data = transform(mtcars2, cyl = factor(cyl)))
Coefficients:
hp
0.1156849
Intercepts:
4|6 6|8
12.32592 17.25331
Residual Deviance: 20.35585
AIC: 26.35585
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
The other possibility is that your dependent variable just happens to be represented as character because of how it was created but could be numeric if one used as.numeric(...) on it. We can't tell without the input but using our example we can do this although again it is likely inappropriate because cyl has only 3 values and so does not approximate a gaussian closely enough. Your data may be different though.
lm(cyl ~ hp, transform(mtcars2, cyl = as.numeric(cyl)))
giving:
Call:
lm(formula = cyl ~ hp, data = transform(mtcars2, cyl = as.numeric(cyl)))
Coefficients:
(Intercept) hp
3.00680 0.02168

margins.plot: using the 'which' argument to choose which margins to include in plot

I am trying to plot marginal effects in r based on a logistic regression. For example:
data <- mtcars
mod <- glm(am ~ cyl + hp + wt + mpg, family = binomial, data = data)
library(margins)
marg <- margins(mod, atmeans = TRUE)
summary(marg)
I can run the margins plot command:
plot(marg)
which plots marginal effects and confidence intervals for all of the IVs. I only want to include in the plot cyl and hp, my explanatory variables of interest. According to r documentation, this can be accomplished using the 'which' argument, which takes a character vector. However, the documentation doesn't say how to use this argument. Does anyone know how to use the 'which' argument to ask margins.plot to plot only select marginal effects? Unfortunately, the margins plot help page, linked above, does not have any examples.
plot image
Before plotting, we can specify variables of interest with the variables option within the margins()function.
mod <- glm(am ~ cyl + hp + wt + mpg, family=binomial, data=mtcars)
library(margins)
marg <- margins(mod, variables=c("cyl", "hp"))
plot(marg)
Gives:

r- MICE package getting standardized betas from lm

I have a dataset that is missing values. I imputed using the mice package and ran my linear model using lm and pool for the results. I only get unstandardized beta weights. Is there a way to get standardized beta weights?
There are two ways in which you can do so (which I know of), there can be many:
1) First method:
You need to first scale your data, so assume you imputed your data first then you can do as following:
A toy example:
mtcars1 <- mtcars[,c("mpg", "disp", "hp", "wt", "qsec", "drat")]
mtcars_scaled <- data.frame(sapply(mtcars1, scale), stringsAsFactors=F) ##scaling for standardization,
model_fit_st <- lm(mpg ~ disp + wt + drat, data=mtcars_scaled)
Here model_fit_st is your standardized result but it does however having the intercept(which is kind of odd, the reason being that we supplied it using lm, it will generate an intercept), however if you compare it with QuantPsyc::lm.beta function coefficients value will match.
2) Second Method:
Here QuantPsyc::lm.beta can be used once you install QuantPsyc package which is for generating standardized betas like below.
QuantPsyc::lm.beta(lm(mpg ~ disp + wt + drat, data=mtcars))
Off-course apart from intercept(there is no sense of having intercept in standardized betas) both the results (via scaling and quantpsyc outcome) is matching here.

Is there a way that I can put into a barplot the significant variables from regression?

Take mtcars for example:
> reg <- lm(mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
> sigvar <- data.frame(summary(reg)$coef[summary(reg)$coef[,4] <= .05, 4]) #extracts significant variables with p-values
> rownames <- rownames(sig) #extracts the variables only
I hope to put the rownames on the x-axis of a barplot and the height of the barplot would be the average of the said rownames. Thanks.
I'm unclear what you want the heights of the bars to be in the barplot. However:
reg = lm(mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
## extracts significant variables with p-values
## note the -1 means you skip the intercept
sigvar = summary(reg)$coef[-1,4] <= .05
If I understand what you want, you want a barplot of the average value of those variables which are significant as the height of your bar. You need to match up the significant variables with the variable names in the data frame
i = match(names(sigvar)[sigvar], names(mtcars))
i now contains the columns of the original data frame that correspond to the significant variable. Unfortunately for the mtcars data, this means mtcars[,i] only returns one column, so normally I would do something like
barplot(sapply(mtcars[,i], mean))
but that doesn't do the right thing here because mtcars[,i] returns a vector. Let's assume for argument that i = c(5,6), then this will work
i = 5:6
barplot(sapply(mtcars[,i], mean))

Resources