I have run a binomial logistic regression model in R using the lme4 package. Now, I want to plot the estimated marginal means for the model, so I have installed the sjPlot package and I have used the plot_model() function.
My x axis includes three variables corresponding to three different groups: "L1", "HS", and "L2". I want to have the three variables in that precise order. However, when I plot the model, I get "HS" before "L1", because the labels appear in alphabetical order. I would like to change the order of those two labels and I know how to do that in a dataframe, but not when plotting a model with that function. Any ideas on how to reorder my x axis using the sjPlot package?
You can change the order of the coefficients using the order.terms-argument. Note that the numbers for this argument correspond to the position of the summary. Example:
library(sjPlot)
library(sjlabelled)
data(efc)
efc <- as_factor(efc, c161sex, e42dep, c172code)
m <- lm(neg_c_7 ~ pos_v_4 + c12hour + e42dep + c172code, data = efc)
plot_model(m, auto.label = F)
summary(m)
#>
#> Call:
#> lm(formula = neg_c_7 ~ pos_v_4 + c12hour + e42dep + c172code,
#> data = efc)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -6.5411 -2.0797 -0.5183 1.3256 19.1412
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 17.65938 0.82864 21.311 < 2e-16 ***
#> pos_v_4 -0.66552 0.05163 -12.890 < 2e-16 ***
#> c12hour 0.01134 0.00270 4.201 2.95e-05 ***
#> e42dep2 0.84189 0.47605 1.768 0.077355 .
#> e42dep3 1.73616 0.47118 3.685 0.000244 ***
#> e42dep4 3.10107 0.50470 6.144 1.26e-09 ***
#> c172code2 0.12894 0.28832 0.447 0.654844
#> c172code3 0.69876 0.36649 1.907 0.056922 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.27 on 810 degrees of freedom
#> (90 observations deleted due to missingness)
#> Multiple R-squared: 0.2981, Adjusted R-squared: 0.292
#> F-statistic: 49.15 on 7 and 810 DF, p-value: < 2.2e-16
# according to summary, order of coefficients:
# 1=pos_v_4, 2=c12hour, 3=e42dep2, 4=e42dep3, ...
plot_model(m, auto.label = F, order.terms = c(1,2,4,5,3,6,7))
Created on 2019-05-08 by the reprex package (v0.2.1)
Related
I tried to perform a multiple linear regression analysis with code like this one but with no success. I tried to do it with lm() function. I think there is a problem with the 'x1*x2'.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1 + x2 + x1*x2)
summary(model)
plot(model)
It shows me error.
What should I do?
The error did not occur because of your interaction term. When testing it, that worked perfectly for me. You forgot to specify the data. The lm() function requires you to provide the data your variables should stem from. In the code below I also shortened the code within the function because x1*x2 is already sufficient. R detects that you have an interaction term, so you don't have to repeat the same variable names.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1*x2,
data= data)
summary(model)
#>
#> Call:
#> lm(formula = y ~ x1 * x2, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.21772 -0.77564 0.06347 0.56901 2.15324
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.05853 0.09914 -0.590 0.5564
#> x1 0.17384 0.09466 1.836 0.0694 .
#> x2 -0.02830 0.08646 -0.327 0.7442
#> x1:x2 -0.00836 0.07846 -0.107 0.9154
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9792 on 96 degrees of freedom
#> Multiple R-squared: 0.03423, Adjusted R-squared: 0.004055
#> F-statistic: 1.134 on 3 and 96 DF, p-value: 0.3392
Created on 2023-01-14 with reprex v2.0.2
There you can see my code and the output r gives. My question is: How can I get r to print the arguments of the function as separated values in the summary of glm(). So the intercept, gender_m0, age_centered and gender_m0 * age_centered instead of the intercept and the y? I hope someone could help me with my little problem. Thank you.
test_reg <- function(parameters){
glm_model2 <- glm(healing ~ parameters, family = "binomial", data = psa_data)
summary(glm_model2)}
test_reg(psa_data$gender_m0 * age_centered)
Call:
glm(formula = healing ~ parameters, family = "binomial", data = psa_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2323 0.4486 0.4486 0.4486 0.6800
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.24590 0.13844 16.223 <2e-16 ***
parameters -0.02505 0.01369 -1.829 0.0674 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 426.99 on 649 degrees of freedom
Residual deviance: 423.79 on 648 degrees of freedom
(78 Beobachtungen als fehlend gelöscht)
AIC: 427.79
Number of Fisher Scoring iterations: 5
The terms inside formulas are never substituted but taken literally, so glm is looking for a column called "parameters" in your data frame, which of course doesn't exist. You will need to capture the parameters from your call, deparse them and construct the formula if you want to call your function this way:
test_reg <- function(parameters) {
f <- as.formula(paste0("healing ~ ", deparse(match.call()$parameters)))
mod <- glm(f, family = binomial, data = psa_data)
mod$call$formula <- f
summary(mod)
}
Obviously, I don't have your data, but if I create a little sample data frame with the same names, we can see this works as expected:
set.seed(1)
psa_data <- data.frame(healing = rbinom(20, 1, 0.5),
age_centred = sample(21:40),
gender_m0 = rbinom(20, 1, 0.5))
test_reg(age_centred * gender_m0)
#>
#> Call:
#> glm(formula = healing ~ age_centred * gender_m0, family = binomial,
#> data = psa_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.416 -1.281 0.963 1.046 1.379
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.05873 2.99206 0.354 0.723
#> age_centred -0.02443 0.09901 -0.247 0.805
#> gender_m0 -3.51341 5.49542 -0.639 0.523
#> age_centred:gender_m0 0.10107 0.17303 0.584 0.559
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 27.526 on 19 degrees of freedom
#> Residual deviance: 27.027 on 16 degrees of freedom
#> AIC: 35.027
#>
#> Number of Fisher Scoring iterations: 4
Created on 2022-06-29 by the reprex package (v2.0.1)
I want to perform the following task using fastfood dataset from openintro package in R.
a) Create a regression predicting whether or not a restaurant is McDonalds or Subway
based on calories, sodium, and protein. (McDonalds should be 1, Subway 0).
Save the coefficients to Q2.
b) use data from only restaurants with between 50 and 60 items in the
data set. Predict total fat from cholesterol, total carbs, vitamin a, and restaurant.
Remove any nonsignificant predictors and run again.
Assign the strongest standardized regression coefficient to Q5.
Here's my code.
library(tidyverse)
library(openintro)
library(lm.beta)
fastfood <- openintro::fastfood
head(fastfood)
#Solving for part (a)
fit_1 <- lm(I(restaurant %in% c("Subway", "Mcdonalds")) ~ calories + sodium + protein, data = fastfood)
Q2 <- round(summary(fit_1)$coefficients,2)
#Solving for part (b)
newdata <- fastfood[ which(fastfood$item>=50 & fastfood$item <= 60), ]
df = sort(sample(nrow(newdata), nrow(data)*.7))
newdata_train<-data[df,]
newdata_test<-data[-df,]
fit_5 <- lm(I(total_fat) ~ cholesterol + total_carb + vit_a + restaurant, data = newdata)
prediction_5 <- predict(fit_5, newdata = newdata_test)
Q5 <- lm.beta(fit_5)
But I'm not getting desired results
Here's is desired output
output for part (a):
output for part (b):
The first question requires logistic regression rather than linear regression, since the aim is to predict a binary outcome. The most sensible way to do this is, as the question suggests, to remove all the restaurants except McDonald's and Subway, then create a new binary variable to mark which rows are McDonald's and which aren't:
library(dplyr)
fastfood <- openintro::fastfood %>%
filter(restaurant %in% c("Mcdonalds", "Subway")) %>%
mutate(is_mcdonalds = restaurant == "Mcdonalds")
The logistic regression is done like this:
fit_1 <- glm(is_mcdonalds ~ calories + sodium + protein,
family = "binomial", data = fastfood)
And your coefficients are obtained like this:
Q2 <- round(coef(fit_1), 2)
Q2
#> (Intercept) calories sodium protein
#> -1.24 0.00 0.00 0.06
The second question requires that you filter out any restaurants with more than 60 or fewer than 50 items:
fastfood <- openintro::fastfood %>%
group_by(restaurant) %>%
filter(n() >= 50 & n() <= 60)
We now fit the described regression and examine it to look for non-significant regressors:
fit_2 <- lm(total_fat ~ cholesterol + vit_a + total_carb + restaurant,
data = fastfood)
summary(fit_2)
#>
#> Call:
#> lm(formula = total_fat ~ cholesterol + vit_a + total_carb + restaurant,
#> data = fastfood)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -24.8280 -2.9417 0.9397 5.1450 21.0494
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.20102 2.08029 -0.577 0.564751
#> cholesterol 0.26932 0.01129 23.853 < 2e-16 ***
#> vit_a 0.01159 0.01655 0.701 0.484895
#> total_carb 0.16327 0.03317 4.922 2.64e-06 ***
#> restaurantMcdonalds -4.90272 1.94071 -2.526 0.012778 *
#> restaurantSonic 6.43353 1.89014 3.404 0.000894 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.611 on 125 degrees of freedom
#> (34 observations deleted due to missingness)
#> Multiple R-squared: 0.8776, Adjusted R-squared: 0.8727
#> F-statistic: 179.2 on 5 and 125 DF, p-value: < 2.2e-16
We note that vit_a is non-significant and drop it from our model:
fit_3 <- update(fit_2, . ~ . - vit_a)
Now we get the regularized coefficients and round them:
coefs <- round(coef(lm.beta::lm.beta(fit_3)), 2)
and Q5 will be the maximum value of these coefficients:
Q5 <- coefs[which.max(coefs)]
Q5
#> cholesterol
#> 0.82
Created on 2022-02-26 by the reprex package (v2.0.1)
I want to get the p-value of both ca.po models. Can someone show me how?
at?
library(data.table)
library(urca)
dt_xy = as.data.table(timeSeries::LPP2005REC[, 2:3])
res = urca::ca.po(dt_xy, type = "Pz", demean = demean, lag = "short")
summary(res)
And the results. I marked the p-values I need in the result.
Model 1 p-value = 0.9841
Model 2 p-value = 0.1363
########################################
# Phillips and Ouliaris Unit Root Test #
########################################
Test of type Pz
detrending of series with constant and linear trend
Response SPI :
Call:
lm(formula = SPI ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.036601 -0.003494 0.000243 0.004139 0.024975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.702e-04 7.954e-04 1.220 0.223
zrSPI -1.185e-02 5.227e-02 -0.227 0.821
zrSII -3.037e-02 1.374e-01 -0.221 0.825
trd -6.961e-07 3.657e-06 -0.190 0.849
Residual standard error: 0.007675 on 372 degrees of freedom
Multiple R-squared: 0.0004236, Adjusted R-squared: -0.007637
F-statistic: 0.05255 on 3 and 372 DF, p-value: 0.9841 **<--- I need this p.value**
Response SII :
Call:
lm(formula = SII ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.0096931 -0.0018105 -0.0002734 0.0017166 0.0115427
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.598e-05 3.012e-04 -0.252 0.8010
zrSPI -1.068e-02 1.979e-02 -0.540 0.5897
zrSII -9.574e-02 5.201e-02 -1.841 0.0664 .
trd 1.891e-06 1.385e-06 1.365 0.1730
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.002906 on 372 degrees of freedom
Multiple R-squared: 0.01476, Adjusted R-squared: 0.006813
F-statistic: 1.858 on 3 and 372 DF, p-value: 0.1363 **<--- I need this p.value**
Value of test-statistic is: 857.4274
Critical values of Pz are:
10pct 5pct 1pct
critical values 71.9586 81.3812 102.0167
You have to dig into the res object to see its attributes and what's available there.
attributes(reg)
...
#>
#> $testreg
#> Response SPI :
#>
#> Call:
#> lm(formula = SPI ~ zr + trd)
#>
...
A long list of objects is returned, but we can see what is looking like the summary of lm being called under testreg, which we can see is one of the attributes of res. We can also access attributes of res using attr(res, "name"), so let's look at the components of testreg.
names(attributes(res))
#> [1] "z" "type" "model" "lag" "cval" "res"
#> [7] "teststat" "testreg" "test.name" "class"
names(attr(res, "testreg"))
#> [1] "Response SPI" "Response SII"
As you noted above, you're looking for 2 separate p-values, so makes since we have two separate models. Let's retrieve these and look at what they are.
spi <- attr(res, "testreg")[["Response SPI"]]
sii <- attr(res, "testreg")[["Response SII"]]
class(spi)
#> [1] "summary.lm"
So, each of them is a summary.lm object. There's lots of documentation on how to extract p-values from lm or summary.lm objects, so let's use the method described here.
get_pval <- function(summary_lm) {
pf(
summary_lm$fstatistic[1L],
summary_lm$fstatistic[2L],
summary_lm$fstatistic[3L],
lower.tail = FALSE
)
}
get_pval(spi)
#> value
#> 0.9840898
get_pval(sii)
#> value
#> 0.1363474
And there you go, those are the two p-values you were interested in!
I am interested in calculating the log-odds of the relationship between a continuous predictor and dichotomous outcome for purposes of graphically evaluating the linearity assumption for a logistic regression model. Does anyone know a formula for this? My key issue is I am unsure how to calculate an event rate for each level of the continuous predictor (i.e. number with outcome/total observations at that level).
Thank you!
Let's simulate some data to show how this can be done.
Imagine we are testing a new electrical product, and we test at a variety of temperatures to see whether temperature affects failure rate.
set.seed(69)
df <- data.frame(temperature = seq(0, 100, length.out = 1000),
failed = rbinom(1000, 1, seq(0.1, 0.9, length.out = 1000)))
So we have two columns: the temperature, and a dichotomous column of 1 (failed) and 0 (passed).
We can get a rough measure of the relationship between temperature and failure rate just by cutting our data frame into 5 degree bins:
df$temp_range <- cut(df$temperature, seq(0, 100, 5), include.lowest = TRUE)
We can now plot the proportion of devices that failed within each 5 degree temperature band:
library(ggplot2)
ggplot(df, aes(x = temp_range, y = failed)) + stat_summary()
#> No summary function supplied, defaulting to `mean_se()`
We can see that the probability of failure appears to go up linearly with temperature.
Now, if we get the proportions of failures in each bin, we take these as the estimate of probability of failure. This allows us to calculate the log odds of failure within each bin:
counts <- table(df$temp_range, df$failed)
probs <- counts[,2]/rowSums(counts)
logodds <- log(probs/(1 - probs))
temp_range <- seq(2.5, 97.5, 5)
logit_df <- data.frame(temp_range, probs, logodds)
So now, we can plot the log odds. Here, we will make our x axis continuous by taking the mid point of each bin as the x co-ordinate. We can then draw a linear regression through our points:
p <- ggplot(logit_df, aes(temp_range, logodds)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", linetype = 2, se = FALSE)
p
#> `geom_smooth()` using formula 'y ~ x'
and in fact carry out a linear regression:
summary(lm(logodds ~ temp_range))
#>
#> Call:
#> lm(formula = logodds ~ temp_range)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.70596 -0.20764 -0.06761 0.18100 1.31147
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.160639 0.207276 -10.42 4.70e-09 ***
#> temp_range 0.046025 0.003591 12.82 1.74e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.463 on 18 degrees of freedom
#> Multiple R-squared: 0.9012, Adjusted R-squared: 0.8957
#> F-statistic: 164.2 on 1 and 18 DF, p-value: 1.738e-10
We can see that the linear assumption is reasonable here.
What we have just done is like a crude form of logistic regression. Let's now do it properly:
model <- glm(failed ~ temperature, data = df, family = binomial())
summary(model)
#>
#> Call:
#> glm(formula = failed ~ temperature, family = binomial(), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1854 -0.8514 0.4672 0.8518 2.0430
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.006197 0.159997 -12.54 <2e-16 ***
#> temperature 0.043064 0.002938 14.66 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1383.4 on 999 degrees of freedom
#> Residual deviance: 1096.0 on 998 degrees of freedom
#> AIC: 1100
#>
#> Number of Fisher Scoring iterations: 3
Notice how close the coefficients are to our hand-crafted model.
Now that we have this model, we can plot its predictions over our crude linear estimate:
mod_df <- data.frame(temp_range = 1:100,
logodds = predict(model, newdata = list(temperature = 1:100)))
p + geom_line(data = mod_df, colour = "red", linetype = 3, size = 2)
#> `geom_smooth()` using formula 'y ~ x'
Pretty close.
Created on 2020-06-19 by the reprex package (v0.3.0)