Plotting categorical variables OLS in R

Plotting categorical variables OLS in R - r

I am trying to produce a plot with age in the x-axis, expected serum urate in the y-axis and lines for male/white, female/white, male/black, female/black, using the estimates from the lm() function.
goutdata <- read.table("gout.txt", header = TRUE)
goutdata$sex <- factor(goutdata$sex,levels = c("M", "F"))
goutdata$race <- as.factor(goutdata$race)
fm <- lm(su~sex+race+age, data = goutdata)
summary(fm)
ggplot(fm, aes(x= age, y = su))+xlim(30, 70) + geom_jitter(aes(age,su, colour=age)) + facet_grid(sex~race)
I have tried using the facet_wrap() function with ggplot to address the categorical variables, but I am wanting to create just one plot. I was trying a combination of geom_jitter and geom_smooth, but I am not sure how to use geom_smooth() with categorical variables. Any help would be appreciated.
Data: https://github.com/gdlc/STT465/blob/master/gout.txt

We can use interaction() to create groupings on the fly and perform the OLS right within geom_smooth(). Here they are grouped on one plot:
ggplot(goutdata, aes(age, su, color = interaction(sex, race))) +
geom_smooth(formula = y~x, method="lm") +
geom_point() +
hrbrthemes::theme_ipsum_rc(grid="XY")
and, spread out into facets:
ggplot(goutdata, aes(age, su, color = interaction(sex, race))) +
geom_smooth(formula = y~x, method="lm") +
geom_point() +
facet_wrap(sex~race) +
hrbrthemes::theme_ipsum_rc(grid="XY")
You've now got a partial answer to #1 of https://github.com/gdlc/STT465/blob/master/HW_4_OLS.md :-)

You could probably use geom_smooth() to show regression lines?
dat <- read.table("https://raw.githubusercontent.com/gdlc/STT465/master/gout.txt",
header = T, stringsAsFactors = F)
library(tidyverse)
dat %>%
dplyr::mutate(sex = ifelse(sex == "M", "Male", "Female"),
race = ifelse(race == "W", "Caucasian", "African-American"),
group = paste(race, sex, sep = ", ")
) %>%
ggplot(aes(x = age, y = su, colour = group)) +
geom_smooth(method = "lm", se = F, show.legend = F) +
geom_point(show.legend = F, position = "jitter", alpha = .5, pch = 16) +
facet_wrap(~group) +
ggthemes::theme_few() +
labs(x = "Age", y = "Expected serum urate level")

Related

Change key glyphs in ggplot2

I am having two issues with my ggplot. I am trying to plot two continuous variables in a scatterplot, stratified by a categorical variable (4 levels).
The first one is that the plot produces the letter "a" in the legend instead of a line. I know that there is something off with the glyph but I cannot figure it out.
The second issue is that when I use label.y = 10 in the stat_cor() function from the ggpubr package the correlations of the 4 groups collapse all together. label.x works fine.
My code is the following:
library(ggplot2)
library(ggpubr)
df <- data.frame(categories = as.factor(c(1,2,3,3,4,1,4,2,2,1,2,3,4)),
var1 = c(1,11,13,2,5,5,4,10,7,1,2,4,5),
var2 = c(2,10,12,15,14,1,3,7,11,5,6,7,5))
b <- ggplot(df, aes(x = var1, y = var2, colour = categories)) +
geom_point()+
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
scale_color_manual(values = c("#feedde", "#fdbe85", "#fd8d3c", "#d94701")) +
theme_bw() +
ggpubr::stat_cor(aes(color = categories), label.x = 3, label.y = 10) +
ggtitle("Title") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Var1") + ylab("Var2") +
labs(color = "Categories")
b

How to plot regression lines for multiple y

I am trying to plot regression lines for my 3 Y variable and my 1 x variable.
library(ggplot2)
library(tidyverse)
Sika_deer<-read.csv("C:/Users/Lau/Desktop/Sikadeer.csv", sep = ";",header = T)
Plot<-ggplot(Sika_deer, aes(x=Year)) +
geom_point(aes(y=Females, color="Females")) +
geom_point(aes(y=Young, color="Youngs")) +
geom_point(aes(y=Males, color="Males")
)+ facet_wrap(~District, scales = ("free_y"))+
labs(x = "Number of culled animals", y = "Year)")
I tried using geom_smooth but I keep on receiving the error:geom_smooth() using formula 'y ~ x'
Errore: stat_smooth requires the following missing aesthetics: y
I am not sure what I am doing wrong here...
Thank you all for the attention and help!
p.s sorry if I made some mistakes posting my question, it's my first time asking for help on an online platform.
This is my plot

It would be best to reshape your data into long format to avoid the need for repeated calls to geom_point and geom_smooth, but the following should work for you:
library(ggplot2)
library(tidyverse)
Sika_deer <- read.csv("C:/Users/Lau/Desktop/Sikadeer.csv", sep = ";", header = TRUE)
Plot <- ggplot(Sika_deer, aes(x = Year)) +
geom_point(aes(y = Females, color = "Females")) +
geom_point(aes(y = Young, color = "Youngs")) +
geom_point(aes(y = Males, color = "Males")) +
geom_smooth(aes(y = Females, color = "Females"), se = FALSE) +
geom_smooth(aes(y = Young, color = "Youngs"), se = FALSE) +
geom_smooth(aes(y = Males, color = "Males"), se = FALSE) +
facet_wrap(~District, scales = ("free_y")) +
labs(x = "Number of culled animals", y = "Year)")
If this does not work for you, please edit your question to include a sample of your data by typing dput(Sika_deer) into the console and pasting the result into your question.

I agree about transforming to long data before trying this plot, then you can pass the color variable into aes at the top and the subsequent layers will just inherit it. Since I don't have your data to confirm an answer, I'm showing an example with the iris dataset but it will be the same with yours.
library(tidyverse)
iris %>%
pivot_longer(-c(Species, Sepal.Length), names_to = "attribute") %>%
ggplot(aes(x = Sepal.Length, y = value, color = Species)) +
geom_point() +
geom_smooth() +
facet_wrap(facets = "attribute", scales = "free_y")
With your data I think you could try:
Sika_deer %>%
pivot_longer(-c(District, year), names_to = "category") %>%
ggplot(aes(x = year, y = value, color = category)) +
geom_point() +
geom_smooth() +
facet_wrap(facets = "District", scales = "free_y") +
labs(x = "Number of culled animals", y = "Year)")
But if you share the output of dput(Sika_deer) in your question, we can be sure.

How can I add a layer showing the distribution on a conditional variable in a probability plot in R studio?

I am fitting the following regression:
model <- glm(DV ~ conditions + predictor + conditions*predictor, family = binomial(link = "probit"), data = d).
I use 'sjPlot' (and 'ggplot2') to make the following plot:
library("ggplot2")
library("sjPlot")
plot_model(model, type = "pred", terms = c("predictor", "conditions")) +
xlab("Xlab") +
ylab("Ylab") +
theme_minimal() +
ggtitle("Title")>
But I can't figure out how to add a layer showing the distribution on the conditioning variable like I can easily do by setting "hist = TRUE" using 'interplot':
library("interplot")
interplot(model, var1 = "conditions", var2 = "predictor", hist = TRUE) +
xlab("Xlab") +
ylab("Ylab") +
theme_minimal() +
ggtitle("Title")
I have tried a bunch of layers using just ggplot as well, with no success
ggplot(d, aes(x=predictor, y=DV, color=conditions))+
geom_smooth(method = "glm") +
xlab("Xlab") +
ylab("Ylab") +
theme_minimal() +
ggtitle("Title")
.
I am open to any suggestions!

I've obviously had to try to recreate your data to get this to work, so it won't be faithful to your original, but if we assume your plot is something like this:
p <- plot_model(model, type = "pred", terms = c("predictor [all]", "conditions")) +
xlab("Xlab") +
ylab("Ylab") +
theme_minimal() +
ggtitle("Title")
p
Then we can add a histogram of the predictor variable like this:
p + geom_histogram(data = d, inherit.aes = FALSE,
aes(x = predictor, y = ..count../1000),
fill = "gray85", colour = "gray50", alpha = 0.3)
And if you wanted to do the whole thing in ggplot, you need to remember to tell geom_smooth that your glm is a probit model, otherwise it will just fit a normal linear regression. I've copied the color palette over too for this example, though note the smoothing lines for the groups start at their lowest x value rather than extrapolating back to 0.
ggplot(d, aes(x = predictor, y = DV, color = conditions))+
geom_smooth(method = "glm", aes(fill = conditions),
method.args = list(family = binomial(link = "probit")),
alpha = 0.15, size = 0.5) +
xlab("Xlab") +
scale_fill_manual(values = c("#e41a1c", "#377eb8")) +
scale_colour_manual(values = c("#e41a1c", "#377eb8")) +
ylab("Ylab") +
theme_minimal() +
ggtitle("Title") +
geom_histogram(aes(y = ..count../1000),
fill = "gray85", colour = "gray50", alpha = 0.3)
Data
set.seed(69)
n_each <- 500
predictor <- rgamma(2 * n_each, 2.5, 3)
predictor <- 1 - predictor/max(predictor)
log_odds <- c((1 - predictor[1:n_each]) * 5 - 3.605,
predictor[n_each + 1:n_each] * 0 + 0.57)
DV <- rbinom(2 * n_each, 1, exp(log_odds)/(1 + exp(log_odds)))
conditions <- factor(rep(c(" ", " "), each = n_each))
d <- data.frame(DV, predictor, conditions)

Adjusting rugplot in ggplot2

Below is the code for a graph I am making for an article I am working on. The plot showed the predicted probabilities along a range of values in my data set. Along the x-axis is a rug plot that shows the distribution of trade share values (I provided the code and an image of the graph):
sitc8 <- ggplot() + geom_line(data=plotdat8, aes(x = lagsitc8100, y = PredictedProbabilityMean), size = 2, color="blue") +
geom_ribbon(data=plotdat8, aes(x = lagsitc8100, ymin = lowersd, ymax = uppersd),
fill = "grey50", alpha=.5) +
ylim(c(-0.75, 1.5)) +
geom_hline(yintercept=0) +
geom_rug(data=multi.sanctions.bust8.full#frame, aes(x=lagsitc8100), col="black", size=1.0, sides="b") +
xlab("SITC 8 Trade Share") +
ylab("Probability of Sanctions Busting") +
theme(panel.grid.major = element_line(colour = "gray", linetype = "dotted"), panel.grid.minor =
element_blank(), panel.background = element_blank())
My question is: is it possible to change the color of the lines of the rugplot of trade share in which the event I am modeling occurs? In other words, I would like to add red lines or red dots along those values of trade share when my event = 1.
Is this possible?

Sure. You'd just have to add a color argument within an aes() function call within geom_rug().
Here's some code to create a dummy data frame.
library(tidyverse)
set.seed(42)
dummy_data <- tibble(x_var = rnorm(100),
y_var = abs(rnorm(100)) * x_var) %>%
rownames_to_column(var = "temp_row") %>%
mutate(color_id = if_else(as.numeric(temp_row) <= 50,
"Type A",
"Type B"))
And here's a ggplot call where the color for geom_rug is mapped to a character column named color_id
ggplot(data = dummy_data, mapping = aes(x = x_var, y = y_var)) +
geom_smooth(method = "lm") +
geom_rug(mapping = aes(color = color_id), sides = "b")
Update:
Following OP's comment, here's an updated version. If it's a numeric vector of 0s and 1s, you have to tell ggplot to treat it as a dichotomous variable. You can do that by wrapping it in a call to factor() for instance.
For the color we can set that manually using scale_color_manual(). So the changes to the code are the following.
color_id is now a vector og 0s and 1s.
the color is now mapped to factor(color_id)
the color scale is determined using scale_color_manual
library(tidyverse)
set.seed(42)
dummy_data <- tibble(x_var = rnorm(100),
y_var = abs(rnorm(100)) * x_var) %>%
rownames_to_column(var = "temp_row") %>%
mutate(color_id = if_else(as.numeric(temp_row) <= 50,
0,
1))
ggplot(data = dummy_data, mapping = aes(x = x_var, y = y_var)) +
geom_smooth(method = "lm") +
geom_rug(mapping = aes(color = factor(color_id)), sides = "b") +
scale_color_manual(values = c("black", "red")) +
labs(color = "This takes two values")

Definitely possible. Here's an example using iris, and a dynamic condition in the rug. You could also do two rugs, if you chose.
library(tidyverse)
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_rug(aes(color = Petal.Length >3), sides = "b")
# Second example, output not shown
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_rug(data = subset(iris, Petal.Length > 3), color = "black", sides = "b") +
geom_rug(data = subset(iris, Petal.Length <= 3), color = "red", sides = "b")

How to plot three or more variables in a single scatterplot with automated equations?

I want to plot 2 variables from 3 different dataframes in one scatterplot and also plot the equations of each linear relationship automatically. I am using the following code. However I have two problems:
I get the plots for the same values and not for the whole range (e.g. df1 =700 values, df2= 350 values, df3=450 values). What is the role of omitting the NA? Because I tried that both ways and I still get the same plot
I can only add the equations as a text which means to run the lm function and then add the relathionship manually in the plot. I need to do that automatically.
The code that I am using is:
ggplot(df1, aes(x=noxppb, y=OX, colour = "red")) +
geom_point(colour = "red", shape=2) + # Use hollow circles
geom_smooth(method=lm, se = FALSE) +
geom_point(data = df1, aes(x=noxppb, y=OX)) +
geom_point(colour = "blue", shape=3) +
geom_smooth(method = lm, se = F, colour = "blue", data = df2, aes(x=noxppb, y=OX)) +
geom_point(colour = "green", shape=4) +
geom_smooth(method = lm, se = F, colour = "green", data = df3, aes(x=noxppb, y=OX))
I get the following image:
However I Need something similar to this:

try this,
d <- plyr::mdply(data.frame(a=c(1,2,3), b=c(-1,0,1)),
function(a,b) data.frame(x=seq(0,10), y=jitter(a*seq(0,10)+b)))
equationise = function(d, ...){
m = lm(y ~ x, d)
eq <- substitute(italic(y) == a + b %.% italic(x),
list(a = format(coef(m)[1], ...),
b = format(coef(m)[2], ...)))
data.frame(x = Inf, y = d$y[nrow(d)],
label = as.character(as.expression(eq)),
stringsAsFactors = FALSE)
}
eqs <- plyr::ddply(d, "a", equationise, digits = 2)
ggplot(d, aes(x=x, y=y, colour = factor(a))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_label(data=eqs, aes(label = label), parse=TRUE, hjust=1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Plotting categorical variables OLS in R - r

Related

Change key glyphs in ggplot2

How to plot regression lines for multiple y

How can I add a layer showing the distribution on a conditional variable in a probability plot in R studio?

Adjusting rugplot in ggplot2

How to plot three or more variables in a single scatterplot with automated equations?

Categories

Resources