I would like to add 2 different regression curves, coming from different models, in a scatter plot.
Let's use the example below:
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
The scatter plot looks like this:
plot_df <- ggplot(df, aes(x = Weight, y = Increment, color=Id)) + geom_point()
I tested a linear and an exponential regression model and could extract the results following loki's answer there:
linear_df <- df %>% group_by(Id) %>% do(model = glance(lm(Increment ~ Weight,data = .))) %>% unnest(model)
exp_df <- df %>% group_by(Id) %>% do(model = glance(lm(log(Increment) ~ Weight,data = .))) %>% unnest(model)
The linear model fits better for the Ga group, the exponential one for the Aa group, and nothing for the Za one:
> linear_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.656 0.637 15.1 34.4 1.50e- 5 1 -81.6 169. 172. 4106. 18 20
2 Ga 1.00 1.00 0.243 104113. 6.10e-32 1 1.01 3.98 6.65 0.942 16 18
3 Za 0.0471 -0.0262 26.7 0.642 4.37e- 1 1 -69.5 145. 147. 9283. 13 15
> exp_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.999 0.999 0.0624 24757. 1.05e-29 1 28.2 -50.3 -47.4 0.0700 18 20
2 Ga 0.892 0.885 0.219 132. 3.86e- 9 1 2.87 0.264 2.94 0.766 16 18
3 Za 0.00444 -0.0721 0.941 0.0580 8.14e- 1 1 -19.3 44.6 46.7 11.5 13 15
Now, how can I draw the linear regression line for the Aa group, the exponential regression curve for the Ga group, and no curve for the Za group? There is this, but it applies for different regressions built inside the same model type. How can I combine my different objects?
The formula shown below gives the same fitted values as does 3 separate fits for each Id so create the lm objects for each of the two models and then plot the points and the lines for each. The straight solid lines are the linear model and the curved dashed lines are the exponential model.
library(ggplot2)
fm.lin <- lm(Increment ~ Id/Weight + 0, df)
fm.exp <- lm(log(Increment) ~ Id/Weight + 0, df)
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = fitted(fm.lin))) +
geom_line(aes(y = exp(fitted(fm.exp))), lty = 2, lwd = 1)
To only show the Aa fitted lines for the linear model and Ga fitted lines for the exponential model NA out the portions not wanted. In this case we used solid lines for the fitted models.
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = ifelse(Id == "Aa", fitted(fm.lin), NA))) +
geom_line(aes(y = ifelse(Id == "Ga", exp(fitted(fm.exp)), NA)))
Added
Regarding the questions in the comments, the formula used above nests Weight within Id and effectively uses a model matrix which, modulo column order, is a block diagonal matrix whose blocks are the model matrices of the 3 individual models. Look at this to understand it.
model.matrix(fm.lin)
Since this is a single model rather than three models the summary statistics will be pooled. To get separate summary statistics use lmList from the nlme package (which comes with R so it does not have to be installed -- just issue a library statement). The statements below will give objects of class lmList that can be used in place of the ones above as they have a fitted method that will return the same fitted values.
library(nlme)
fm.lin2 <- lmList(Increment ~ Weight | Id, df, pool = FALSE)
fm.exp2 <- lmList(log(Increment) ~ Weight | Id, df, pool = FALSE)
In addition, they can be used to get individual summary statistics. Internally the lmList objects consist of a list of 3 lm objects with attributes in this case so we can extract the summary statistics by extracting the summary statistics from each component.
library(broom)
sapply(fm.lin2, glance)
sapply(fm.exp2, glance)
One caveat is that common statistical tests between models using different dependent variables, Increment vs. log(Increment), are invalid.
possible solution
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
library(tidyverse)
df_model <- df %>%
group_nest(Id) %>%
mutate(
formula = c(
"lm(log(Increment) ~ Weight, data = .x)",
"lm(Increment ~ Weight,data = .x)",
"lm(Increment ~ 0,data = .x)"
),
transform = c("exp(fitted(.x))",
"fitted(.x)",
"fitted(.x)")
) %>%
mutate(model = map2(data, formula, .f = ~ eval(parse(text = .y)))) %>%
mutate(fit = map2(model, transform, ~ eval(parse(text = .y)))) %>%
select(Id, data, fit) %>%
unnest(c(data, fit))
ggplot(df_model) +
geom_point(aes(Weight, Increment, color = Id)) +
geom_line(aes(Weight, fit, color = Id))
Created on 2021-10-06 by the reprex package (v2.0.1)
I have split a data set from the Column WithSTV, which is discrete, into two datasets. From there a run the lm for each dataset and again from there I need the predicted probabilities of a single discrete variable in these models. I must then plot this. I am getting the exact value for each which makes no sense. Please find the code and data below.
I have tried multiple packages and avenues including ggplot2, DAMisc, etc.
library(readstata13)
library(haven)
library(sjlabelled)
library(sjmisc)
library(sjstats)
library(ggeffects)
library(sjPlot)
dat <- read.dta13("STV.dta")
dat <- na.omit(dat)
zig<-split(dat, dat$WithSTV) ##split dataframe by WithSTV variable
##this == WithSTV=0
zig5<- zig[[1]] ##sperate by 1st level
blah55 <-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=zig5)
summary(blah55)
##this == WithSTV=1
zig10<- zig[[2]] ##sperate by 2nd level
blah10 <-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=zig10)
summary(blah10)
##WithSTV==0
d<-zig5
d$nonpartisan <- as.factor(d$nonpartisan)
fit<-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=d)
d$predicted <- predict(fit) # Save the predicted values
d$residuals <- residuals(fit)
plot_model(fit, type = "pred", terms = c("nonpartisan"))
##WithSTV==1
d<-zig10
d$nonpartisan <- as.factor(d$nonpartisan)
fit2<-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=d)
d$predicted <- predict(fit2) # Save the predicted values
d$residuals <- residuals(fit2)
plot_model(fit2, type = "pred", terms = c("nonpartisan"))
This is a link to the data. This is not a large file.
https://drive.google.com/file/d/1HBssOfb0QX6BTh6ipwlJCRf3ZOT5zKmE/view?usp=sharing
I am expecting the predicted values of nonpartisan for each model to not be identical or close to identical but more stacked/ stairs looking. So, for example, if I compare nonpartisan for zig5/ fit and zig10/fit they will say around 3 & 5 not both 8.
Thanks in advance.
So from not knowing your exact research question, I would include the WithSTV as a predictor in the model and not try to model them separately. I imagine that the intervention was on the entire sample so you're interested in the marginal difference between those exposed and those who were not exposed. If that is the case, leaving WithSTV in a single linear model is a good approach.
Here is the approach with that in mind that you use:
library(haven)
library(tidyverse)
# Read in the Data
dat <- haven::read_dta("STV.dta")
# Fit the Model including WithSTV as a fixed effect
blah10 <-lm(PercentRunoff1 ~ Statewide + Contested +
nonpartisan + presidential_election +
education_level + WithSTV, data=dat)
# Inspect the Coefficients
# install.packages("arm") # great utility package
arm::coefplot(blah10)
arm::display(blah10)
# Add the prediction intervals
marginal_combos <- model.matrix(PercentRunoff1 ~ Statewide + Contested +
nonpartisan + presidential_election +
education_level + WithSTV,
data = dat)
# Add the Predicted Values to Original Data with Pred Interval
dat_fitted <- dat %>%
bind_cols(pred = predict(blah10, newdata = ., interval="predict") %>%
as_tibble())
# Generate the Marginal Graph
# This basically will make the fitted lines and then the prediction intervals in gray
dat_fitted %>%
mutate(nonpartisan = as_factor(nonpartisan)) %>%
mutate(WithSTV = as_factor(WithSTV)) %>%
ggplot(aes(nonpartisan, fit, group = WithSTV))+
geom_line(aes(color = WithSTV))+
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = .1)
This shows that the estimates are slightly different for those with and without for the point estimates, but that the prediction intervals are contained. So there isn't a huge, distinct, difference.
Hope this helps.
Based on your data, the differences between the two subsets of your data indeed seem to be rather small. This is true both for predicted values, adjusted by the covariates of your models, as well as the raw mean values of your outcome for each subgroup (see results at the very bottom).
If you would expect more differences, there might be an issue with your data (preparation)?
library(sjlabelled)
library(ggeffects)
library(sjPlot)
library(dplyr)
d <- read_stata("D:/Downloads/STV.dta")
dat <- na.omit(d)
zig <- split(dat, dat$WithSTV)
zig5 <- zig[[1]]
zig5$nonpartisan <- as.factor(zig5$nonpartisan)
fit <- lm(
PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level,
data = zig5
)
zig10 <- zig[[2]]
zig10$nonpartisan <- as.factor(zig10$nonpartisan)
fit2 <- lm(
PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level,
data = zig10
)
ggpredict(fit, "nonpartisan")
#>
#> # Predicted values of PercentRunoff1
#> # x = nonpartisan
#>
#> x predicted std.error conf.low conf.high
#> 0 0.095 0.003 0.090 0.100
#> 1 0.198 0.007 0.184 0.212
#>
#> Adjusted for:
#> * Statewide = 0.05
#> * Contested = 0.70
#> * presidential_election = 0.52
#> * education_level = 82.71
ggpredict(fit2, "nonpartisan")
#>
#> # Predicted values of PercentRunoff1
#> # x = nonpartisan
#>
#> x predicted std.error conf.low conf.high
#> 0 0.099 0.004 0.092 0.107
#> 1 0.268 0.007 0.255 0.282
#>
#> Adjusted for:
#> * Statewide = 0.05
#> * Contested = 0.77
#> * presidential_election = 0.43
#> * education_level = 82.56
dat %>%
group_by(WithSTV, nonpartisan) %>%
summarize(mean = mean(PercentRunoff1))
#> # A tibble: 4 x 3
#> # Groups: WithSTV [2]
#> WithSTV nonpartisan mean
#> <dbl> <dbl> <dbl>
#> 1 0 0 0.101
#> 2 0 1 0.165
#> 3 1 0 0.114
#> 4 1 1 0.223
Created on 2019-09-01 by the reprex package (v0.3.0)