Can geom_smooth accept logical variables for glm? - r

I have a tibble with numerical and logical variables, e.g. like this:
x f y
<dbl> <int> <dbl>
1 -2 1 -0.801
2 -1.96 0 -2.27
3 -1.92 0 -1.75
4 -1.88 0 -2.44
5 -1.84 1 -0.123
...
For reproducibility, it can be generated using:
library(tidyverse)
set.seed(0)
tb1 = tibble(
x=(-50:50)/25,
p=plogis(x),
f=rbinom(p, 1, p),
y = x+f+rnorm(x, 0, .5)
) %>% select(-p)
I'd like to plot the points and draw regression lines, once taking x as the predictor and f as the outcome (logistic regression), and once taking x and f as predictors and y as the outcome (linear regression). This works well for the logistic regression.
ggplot(tb1, aes(x, f)) +
geom_point() +
geom_smooth(method="glm", method.args=list(family="binomial"))
produces:
but:
ggplot(tb1, aes(x, y, colour=f)) +
geom_point() +
geom_smooth(method="lm")
produces:
which is wrong. I want f treated as a factor, producing two regression lines, and a discrete instead of the continuous-coloured legend. I can force f manually to a logical value:
tb2 = tb1 %>% mutate(f = f>0)
and obtain the correct linear regression graph:
but now I cannot plot the logistic regression. I get the
Warning message:
Computation failed in stat_smooth():
y values must be 0 <= y <= 1
For some reason, both lm() and glm() have no problems:
summary(glm(f ~ x, binomial, tb1))
summary(lm(y ~ x + f, tb1))
summary(glm(f ~ x, binomial, tb2))
summary(lm(y ~ x + f, tb2))
all produce reasonable results, and the results are identical for tb1 and tb2, as they should be. So is there a way of convincing geom_smooth to accept logical variables, or must I use two redundant variables, with identical values but of a different type, e.g. f.int and f.lgl?

Related

How to plot the predicted probabilities for an ordered logit regression?

I want to plot a similar plot as this one in the buttom of the page: ordered logit
They use a variabel on the x-axis that is categorical (0-10) and therefore they use seq(0, 10, 1) and hold all other variables constant at their means when they determine the predicted probabilities for the plot.
However in my data the variabel I want to have on the x-axis is a dummy (0-1) and therefore seq() does not work. What can I do instead to do a similar plot? Do you have any exapmples.
It's difficult to walk you through your own code in the absence of a reproducible example, but let's create one with a dummy variable on the x axis.
Here, we have a binary outcome variable, a dummy predictor variable, and a numeric predictor variable that we want to hold at its mean value for the purposes of the plot:
set.seed(1)
df <- data.frame(dummy = rep(0:1, each = 20),
other_var = runif(40),
outcome = rbinom(40, 1, rep(c(0.3, 0.7), each = 20)))
We can create a logistic regression like this:
model <- glm(outcome ~ dummy + other_var, family = binomial, data = df)
summary(model)
#>
#> Call:
#> glm(formula = outcome ~ dummy + other_var, family = binomial,
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.4942 -0.9675 -0.5557 0.9465 2.0245
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.006877 0.831579 -0.008 0.9934
#> dummy 1.447983 0.713932 2.028 0.0425 *
#> other_var -2.120011 1.339460 -1.583 0.1135
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 54.548 on 39 degrees of freedom
#> Residual deviance: 46.726 on 37 degrees of freedom
#> AIC: 52.726
#>
#> Number of Fisher Scoring iterations: 4
To predict the outcome for each value of dummy at the mean of other_var, we create a little prediction data frame with one column holding each unique value of dummy and another column filled with the mean value of other_var:
pred_df <- data.frame(dummy = 0:1, other_var = mean(df$other_var))
Now we feed this to predict
preds <- predict(model, pred_df, se.fit = TRUE)
Since we are dealing with log odds here, we need a little function to convert log odds to probabilities:
log_odds_to_probs <- function(x) exp(x) / (1 + exp(x))
Now we can get our predicted values as well as the 95% confidence intervals for the predictions as probabilities:
pred_df$fit <- log_odds_to_probs(preds$fit)
pred_df$lower <- log_odds_to_probs(preds$fit - 1.96 * preds$se.fit)
pred_df$upper <- log_odds_to_probs(preds$fit + 1.96 * preds$se.fit)
Finally, we plot the result using whatever style you like:
library(ggplot2)
ggplot(pred_df, aes(factor(dummy), fit)) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, size = 1.5,
color = 'deepskyblue4') +
geom_point(size = 4) +
labs(x = 'dummy', y = 'probability') +
ylim(c(0, 1)) +
theme_minimal(base_size = 16)
Created on 2022-12-06 with reprex v2.0.2
I want to offer a much easier solution. For comparability, I also start with the model proposed by Allan.
model <- glm(outcome ~ dummy + other_var, family = binomial, data = df)
Now to calculate the predicted probabilities, use the glm.predict package by Benjamin Schlegel. This package does all the hard stuff for you with just one command. You can even set the method (boostrap or simulation) and it calculates discrete changes if you like. Also, the package fits within a tidyverse pipeline.
You set you dummy to 0 and 1 and your other variable to the mean. From that you can directly calculate the same plot as Allan has. Below you see the code for that.
library(glm.predict)
predicts(model, "0,1;mean") %>%
ggplot(aes(x=as.factor(dummy), y=mean, ymin=lower, ymax=upper))+
geom_pointrange()

Compare treatment effects in three way interaction between two continuous variables and one categorical variable in R

I am trying to run a linear regression model which contains continuous variable A * continuous variables B * categorical variable (treatments with 4 levels). Data can be download here.
Model<-lm(H2O2~Treatment*(A*B), data=mydata)
Now I want to compare different treatment effects.
I know that lstrends can deal with continuous variable * categorical variable in linear model, but it could not work in my situation. I also tried to divide the data based on different treatment groups and created 4 different linear models to compare, that did not work either.
The equation you're estimating is:
There are six different treatment effects in which you could be interested - they comprise the pairwise differences among treatment categories given fixed values of A and B. Three of these are represented by comparisons of estimated categories versus the reference category. For example, to figure out the effect of HF versus HC (the reference), you would calculate:
Looking at the coefficients from your model:
b <- coef(Model)
b
(Intercept) TreatmentHF TreatmentLF TreatmentMF A B
-1.4318658015 1.5744952961 1.7649475644 -0.6971275663 0.0334782841 0.1528682774
A:B TreatmentHF:A TreatmentLF:A TreatmentMF:A TreatmentHF:B TreatmentLF:B
-0.0022753098 -0.0313728254 -0.0342105088 0.0173173280 -0.1430777577 -0.1214230927
TreatmentMF:B TreatmentHF:A:B TreatmentLF:A:B TreatmentMF:A:B
0.0212295284 0.0025811227 0.0023565223 -0.0007721532
You would want in R, something like
b[2] + b[8]*A + b[11]*B + b[14]*A*B
You would want to substitute in a wide range of combinations of A and B, which you could do by making a sequence of values of each going from the minimum to the maximum, and then crossing them.
a_seq <- seq(min(mydata$A), max(mydata$A), length=25)
b_seq <- seq(min(mydata$B), max(mydata$B), length=25)
eg <- expand.grid(A=a_seq, B=b_seq)
head(eg)
# A B
# 1 5.03000 4.34
# 2 10.01292 4.34
# 3 14.99583 4.34
# 4 19.97875 4.34
# 5 24.96167 4.34
# 6 29.94458 4.34
You could then make the treatment effect in this dataset.
library(dplyr)
eg <- eg %>% mutate(treat_HC_HF = b[2] + b[8]*A + b[11]*B + b[14]*A*B)
Then, you could plot it using a heatmap or similar.
ggplot(eg, aes(x=A, y=B, fill=treat_HC_HF)) +
geom_tile() +
scale_fill_viridis_c() +
theme_classic() +
labs(fill="Treatment\nEffect")
You could do this for the other comparisons as well. There are two things that you don't get from this directly that are. First, this doesn't tell you anything about where you actually observe A and B. Second it doesn't tell you whether any of these effects is statistically significant. The first problem you could solve more or less by only plotting those hypothetical values of A and B that are in the convex hull of A and B in the data.
library(geometry)
ch <- convhulln(mydata[,c("A", "B")])
eg <- eg %>%
mutate(inhull = inhulln(ch, cbind(A,B)))
eg %>%
filter(inhull) %>%
ggplot(aes(x=A, y=B, fill=treat_HC_HF)) +
geom_tile() +
scale_fill_viridis_c(limits = c(min(eg$treat_HC_HF), max(eg$treat_HC_HF))) +
theme_classic() +
labs(fill="Treatment\nEffect")
To calculate whether or not these are significant, you would have to do a bit more work. First, you'd have to get the standard error of each comparison. What you need is a matrix we'll call M that collects the values you multiply the coefficients by to get the treatment effect. So, in the above example, we would have the three pieces of information:
In R, we could get these with:
b_t <- b[c(2,8,11,14)]
V_t <- vcov(Model)[c(2,8,11,14), c(2,8,11,14)]
M <- cbind(1, eg$A, eg$B, eg$A*eg$B)
Then, we could calculate the standard error of the treatment effect as:
In R, we could do this and identify which treatment effects are significant (two-tailed 95% test) with:
eg <- eg %>%
mutate(se = sqrt(diag(M %*% V_t %*% t(M))),
sig = abs(treat_HC_HF/se) > pt(0.975, Model$df.residual))
Then we could plot only those effects that are in the convex hull and significant:
eg %>%
filter(inhull, sig) %>%
ggplot(aes(x=A, y=B, fill=treat_HC_HF)) +
geom_tile() +
scale_fill_viridis_c(limits = c(min(eg$treat_HC_HF), max(eg$treat_HC_HF))) +
theme_classic() +
labs(fill="Treatment\nEffect")
You would have to do this for each one of the six paired comparisons of levels of the treatment effects. This seems like a lot of work, but the model, despite the simplicity of estimating it, is quite complicated to interpret.

How to run a regression for a given value of a parameter and represent this in a plot and stagazer?

Given the linear model:
data(wage1)
reg_wage1 <- lm(lwage ~ female + educ + (female * educ), data = wage1)
stargazer(reg_wage1, type = "text")
===============================================
Dependent variable:
---------------------------
lwage
-----------------------------------------------
female -0.360*
(0.185)
educ 0.077***
(0.009)
female:educ -0.0001
(0.015)
Constant 0.826***
(0.118)
-----------------------------------------------
Observations 526
R2 0.300
Adjusted R2 0.296
Residual Std. Error 0.446 (df = 522)
F Statistic 74.649*** (df = 3; 522)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Create a line graph that compares each gender's wage growth (female is a data type that can only be 1 or 0. 1 means yes this is a female and 0 means no.), illustrates two graphs corresponding to different values of $delta_0$ and $delta_1$.
My attempt:
plot(NULL, xlim=c(0,16), ylim=c(-1,1), ylab="Log Wage", xlab="Education", main = "Men vs Women's Wages Slope")
abline(a = -0.360, b = .185, col = "blue")
abline(a = 0.077, b = 0.009, col = "red")
legend("bottomright", c("Female", "Education"),
col = c("blue", "red"), fill= c("blue", "red"), title="Legend", cex=.75)
Where a(constant) is the intercept and b(standard error) is the slope (these values are taken from the stargazer table).
My main problem is I don't know how to run the model where I can evaluate female and male wage growth independently, that is where female = 1 or female = 0. I tried using predict() but the result it returns does not come in a format where I am able to get the coefficients in I need to plot into plot()
The results of this exercise should look something like this:
First of all, please make sure to note where your data comes from (in this case the np package).
Your intuition of using the predict function was one way of doing it but I would recommend using the ggplot2 package to plot your regression results.
The easiest implementation of this is using the geom_smooth argument, which does the modelling automatically for you and because we specify a color argument, your interaction is taken into account:
library(np)
data(wage1)
ggplot(wage1, aes(x=educ, y=lwage, color=female))+
geom_point()+
geom_smooth(method = "lm")
You could also use the predict function, as mentioned, but you have to run the model manually before (and adding the confidence intervals is a bit of a pain):
reg_wage1 <- lm(lwage ~ female + educ + (female * educ), data = wage1)
wage1$pred <- predict(reg_wage1)
ggplot(wage1, aes(x=educ, y=lwage, color=female))+
geom_point()+
geom_line(aes(x = educ, y = pred, color=female))
Both lines have different slopes as well, although they are really difficult to see with your model. To reassure you that this is not just an intercept correction but also shows you different slopes, consider another example where the interaction effects are much stronger:
data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
ggplot(mtcars,aes(x=hp,y=mpg,color=vs)) +
geom_point() +
geom_smooth(method = "lm")
Hope this helps!

ggplot2: Logistic Regression - plot probabilities and regression line

I have a data.frame containing a continuous predictor and a dichotomous response variable.
> head(df)
position response
1 0 1
2 3 1
3 -4 0
4 -1 0
5 -2 1
6 0 0
I can easily compute a logistic regression by means of the glm()-function, no problems up to this point.
Next, I want to create a plot with ggplot, that contains both the empiric probabilities for each of the overall 11 predictor values, and the fitted regression line.
I went ahead and computed the probabilities with cast() and saved them in another data.frame
> probs
position prob
1 -5 0.0500
2 -4 0.0000
3 -3 0.0000
4 -2 0.2000
5 -1 0.1500
6 0 0.3684
7 1 0.4500
8 2 0.6500
9 3 0.7500
10 4 0.8500
11 5 1.0000
I plotted the probabilities:
p <- ggplot(probs, aes(x=position, y=prob)) + geom_point()
But when I try to add the fitted regression line
p <- p + stat_smooth(method="glm", family="binomial", se=F)
it returns a warning: non-integer #successes in a binomial glm!.
I know that in order to plot the stat_smooth "correctly", I'd have to call it on the original df data with the dichotomous variable. However if I use the dfdata in ggplot(), I see no way to plot the probabilities.
How can I combine the probabilities and the regression line in one plot, in the way it's meant to be in ggplot2, i.e. without getting any warning or error messages?
There are basically three solutions:
Merging the data.frames
The easiest, after you have your data in two separate data.frames would be to merge them by position:
mydf <- merge( mydf, probs, by="position")
Then you can call ggplot on this data.frame without warnings:
ggplot( mydf, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
Avoiding the creation of two data.frames
In future you could directly avoid the creation of two separate data.frames which you have to merge later. Personally, I like to use the plyr package for that:
librayr(plyr)
mydf <- ddply( mydf, "position", mutate, prob = mean(response) )
Edit: Use different data for each layer
I forgot to mention, that you can use for each layer another data.frame which is a strong advantage of ggplot2:
ggplot( probs, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(data = mydf, aes(x = position, y = response),
method = "glm", method.args = list(family = "binomial"),
se = FALSE)
As an additional hint: Avoid the usage of the variable name df since you override the built in function stats::df by assigning to this variable name.

Fitting a function in R

I have a few datapoints (x and y) that seem to have a logarithmic relationship.
> mydata
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77
> qplot(x, y, data=mydata, geom="line")
Now I would like to find an underlying function that fits the graph and allows me to infer other datapoints (i.e. 3 or 82). I read about lm and nls but I'm not getting anywhere really.
At first, I created a function of which I thought it resembled the plot the most:
f <- function(x, a, b) {
a * exp(b *-x)
}
x <- seq(0:100)
y <- f(seq(0:100), 1,1)
qplot(x,y, geom="line")
Afterwards, I tried to generate a fitting model using nls:
> fit <- nls(y ~ f(x, a, b), data=mydata, start=list(a=1, b=1))
Error in numericDeriv(form[[3]], names(ind), env) :
Missing value or an Infinity produced when evaluating the model
Can someone point me in the right direction on what to do from here?
Follow up
After reading your comments and googling around a bit further I adjusted the starting parameters for a, b and c and then suddenly the model converged.
fit <- nls(y~f(x,a,b,c), data=data.frame(mydata), start=list(a=1, b=30, c=-0.3))
x <- seq(0,120)
fitted.data <- data.frame(x=x, y=predict(fit, list(x=x))
ggplot(mydata, aes(x, y)) + geom_point(color="red", alpha=.5) + geom_line(alpha=.5) + geom_line(data=fitted.data)
Maybe using a cubic specification for your model and estimating via lm would give you a good fit.
# Importing your data
dataset <- read.table(text='
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77', header=T)
# I think one possible specification would be a cubic linear model
y.hat <- predict(lm(y~x+I(x^2)+I(x^3), data=dataset)) # estimating the model and obtaining the fitted values from the model
qplot(x, y, data=dataset, geom="line") # your plot black lines
last_plot() + geom_line(aes(x=x, y=y.hat), col=2) # the fitted values red lines
# It fits good.
Try taking the log of your response variable and then using lm to fit a linear model:
fit <- lm(log(y) ~ x, data=mydata)
The adjusted R-squared is 0.8486, which at face value isn't bad. You can look at the fit using plot, for example:
plot(fit, which=2)
But perhaps, it's not such a good fit after all:
last_plot() + geom_line(aes(x=x, y=exp(fit$fitted.values)))
Check this document out: http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
In brief, first you need to decide on the model to fit onto your data (e.g., exponential) and then estimate its parameters.
Here are some widely used distributions:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm

Resources