Different objects are not showing up on my ggplot2 - r

I'm studying the returns to college admission for marginal student and i'm trying to make a ggplot2 of the following data which is, average salaries of students who finished or didn't finish their masters in medicin and the average 'GPA' (foreign equivalent) distance to the 'acceptance score':
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
I have to do a Regression Discontinuity Design (RDD), so to do the regression - as far as i understand it - i have to rewrite the DistanceGrades to numeric so i just created a variable z
z <- -5:4
where 0 is the cutoff (ie. 0 is equal to "0.0" in DistanceGrades).
I then make a dataframe
df <- data.frame(z,SalaryAfter)
Now my attempt to create the plot gets a bit messy (i use the package 'fpp3', but i suppose that it is just the ggplot2 and maybe dyplr packages)
df %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0))) %>%
ggplot(aes(x = z, y = SalaryAfter, color = D)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
xlim(-6,5) +
xlab("Distance to acceptance score") +
labs(title = "Figur 1.1", subtitle = "Salary for every distance to the acceptance score")
Which plots:
What i'm trying to do is firstly, split the data with a dummy variable D=1 if z>0 and D=0 if z<0. Then i plot it with a linear regression and a vertical line at z=0. Lastly i write the title and subtilte. Now i have two problems:
The x axis is displaying -5, -2.5, ... but i would like for it to show all the integers, the rational numbers have no relation to the z variable which is discrete. I have tried to fix this with several different methods, but none of them have worked, i can't remember all the ways i have tried (theme(panel.grid...),scale_x_discrete and many more), but the outcome has all been pretty similar. They all cause the x-axis to be completely removed such that there is no numbers and sometimes it even removes the axis title.
i would like for the regression channel for the first part of the data to extend to z=0
When i try to solve both of these problems i again get similar results, most of the things i try is not producing an error message when i run the code, but they either do nothing to my plot or they remove some of the existing elements which leaves me made of questions. I suppose that the error is caused by some of the elements not working together but i have no idea.

Try this:
library(tidyverse)
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
z <- -5:4
df <- data.frame(z,SalaryAfter) %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0)))
# Fit a lm model for the left part of the panel
fit_data <- lm(SalaryAfter~z, data = filter(df, z <= -0.1)) %>%
predict(., newdata = data.frame(z = seq(-5, 0, 0.1)), interval = "confidence") %>%
as.data.frame() %>%
mutate(z = seq(-5, 0, 0.1), D = factor(0, levels = c(0, 1)))
# Plot
ggplot(mapping = aes(color = D)) +
geom_ribbon(data = filter(fit_data, z <= 0 & -1 <= z),
aes(x = z, ymin = lwr, ymax = upr),
fill = "grey70", color = "transparent", alpha = 0.5) +
geom_line(data = fit_data, aes(x = z, y = fit), size = 1) +
geom_point(data = df, aes(x = z, y = SalaryAfter), stat = "identity") +
geom_smooth(data = df, aes(x = z, y = SalaryAfter), method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
scale_x_continuous(limits = c(-6, 5), breaks = -6:5) +
xlab("Distance to acceptance score") +
labs(title = "Figure 1.1", subtitle = "Salary for every distance to the acceptance score")

Related

How to use position_dodge with 2 levels of groupings in ggplot2

I would like to group a series of lines by 2 factors using group = interaction in ggplot. Here is some sample code:
set.seed(123)
N <- 18
means <- rnorm(N,0,1)
ses <- rexp(N,2)
upper<- means+qnorm(0.975)*ses
lower<- means+qnorm(0.025)*ses
fruit <- rep(c("Apples","Bananas","Pears"), each=6)
size <- rep(rep(c("Small","Medium","Big"), each=2),3)
GMO <- rep(c("Yes","No"), 9)
d<- data.frame(means,upper,lower,fruit,size,GMO)
ggplot(data=d,
aes(x = fruit,y = means, ymin = lower, ymax = upper, col=size,linetype=GMO,group=interaction(GMO, size)))+
geom_hline(aes(fill=size),yintercept =1, linetype=2)+
xlab('labels')+ ylab("Parameter estimates (95% Confidence Interval)")+
geom_pointrange(position=position_dodge(width = 0.6)) +
scale_x_discrete(name="Fruits")+
coord_flip()-> fplot
dev.new()
fplot
Here's a link to the resulting graph: https://i.stack.imgur.com/5YF4F.png
I would like to bring the same coloured lines for each of the three groups closer together. In other words I would like the lines to cluster not only by the 'Fruit' variable but also the 'Size' variable for each of the fruits. poisition_dodge seems to only work for one of the interacting groups.
Thanks for your advice.
As far as I know that is not possible with position_dodge, i.e. it dodges according to the categories of the group aes. And it does not matter whether you map one variable on the group aes or an interaction of two or more. The groups are simply placed equidistant from one another.
One option to achieve your desired result would be to use the "facets that don't look like facets" trick which means faceting by fruit, mapping size on x and afterwards using theme options to get rid of the facet look plus some tweaking of the x scale:
set.seed(123)
N <- 18
means <- rnorm(N, 0, 1)
ses <- rexp(N, 2)
upper <- means + qnorm(0.975) * ses
lower <- means + qnorm(0.025) * ses
fruit <- rep(c("Apples", "Bananas", "Pears"), each = 6)
size <- rep(rep(c("Small", "Medium", "Big"), each = 2), 3)
GMO <- rep(c("Yes", "No"), 9)
d <- data.frame(means, upper, lower, fruit, size, GMO)
library(ggplot2)
ggplot(data = d, aes(x = size, y = means, ymin = lower, ymax = upper, col = size, linetype = GMO, group = GMO)) +
geom_hline(yintercept = 1, linetype = 2) +
xlab("labels") +
ylab("Parameter estimates (95% Confidence Interval)") +
geom_pointrange(position = position_dodge(width = 0.6)) +
scale_x_discrete(name = "Fruits", breaks = "Medium", labels = NULL, expand = c(0, 1)) +
coord_flip() +
facet_grid(fruit ~ ., switch = "y") +
theme(strip.placement = "outside",
strip.background.y = element_blank(),
strip.text.y.left = element_text(angle = 0),
panel.spacing.y = unit(0, "pt"))
Maybe you want to facet_wrap your size variable:
set.seed(123)
N <- 18
means <- rnorm(N,0,1)
ses <- rexp(N,2)
upper<- means+qnorm(0.975)*ses
lower<- means+qnorm(0.025)*ses
fruit <- rep(c("Apples","Bananas","Pears"), each=6)
size <- rep(rep(c("Small","Medium","Big"), each=2),3)
GMO <- rep(c("Yes","No"), 9)
d<- data.frame(means,upper,lower,fruit,size,GMO)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.1.2
ggplot(data=d,
aes(x = fruit,y = means, ymin = lower, ymax = upper, col=size,linetype=GMO,group=interaction(GMO, size)))+
geom_hline(aes(fill=size),yintercept =1, linetype=2)+
xlab('labels')+ ylab("Parameter estimates (95% Confidence Interval)")+
geom_pointrange(position=position_dodge(width = 0.6)) +
scale_x_discrete(name="Fruits")+
coord_flip() +
facet_wrap(~size)-> fplot
#> Warning: geom_hline(): Ignoring `mapping` because `yintercept` was provided.
fplot
Created on 2022-07-13 by the reprex package (v2.0.1)

Fill stat_ecdf with two different colors

I'm doing for class hypothesis contrast with bayesian models. And I want to do a fancy graphic with ggplot showing the two hypothesis regions with two different colours.
Normal distribution
I would like to fill region H1 with different colour of region H0.
My code is:
#Param of normal distribution
param1 <- 1.74
param2 <- 0.000617
#Normal simulation
sim_posteriori <- data.frame(rnorm(1000, param1, sqrt(param2)), rep('Posteriori', 1000))
names(sim_posteriori) <- c('Datos', 'Grupo')
#Hypotesis contrast
# P(H0) -> mu <= 1.75
pnorm(1.75, param1, sqrt(param2))
# P(H1) -> mu <= 1.75
1 - pnorm(1.75, param1, sqrt(param2))
#Plot
sim_posteriori %>% ggplot(aes(Datos)) +
stat_ecdf(fill = '#F2C14E95', geom = 'density') +
geom_vline(aes(xintercept = 1.75), lty = 2, size = 1) +
labs(title = 'Distribución posteriori y acumulada') +
xlab('Altura(en metros)') +
ylab('Densidad') +
theme_minimal() +
annotate('text', x = 1.735, y = 0.25, label = 'Región H1') +
annotate('text', x = 1.79, y = 0.25, label = 'Región H0')
If you find yourself wondering how to get ggplot to do a complex manipulation of your data with its various stat_ functions, you're probably approaching your problem in the wrong way. These functions exist to make it easy to carry out common simple transformations, but we need to remember that ggplot is a tool for plotting, not for wrangling data, so if the stat_ functions aren't quite what you are looking for, it's normally best to just prepare the data you actually want to plot, then plot it.
In this case it is pretty trivial to to create your own ecdf in a data frame outside of ggplot, label which parts of it are above and below your threshold, then use geom_area to plot it:
h <- sort(sim_posteriori$Datos)
df <- data.frame(x = h, y = seq_along(h)/length(h), region = h > 1.75)
ggplot(df, aes(x, y, fill = region)) +
geom_area() +
geom_vline(aes(xintercept = 1.75), lty = 2, size = 1) +
scale_fill_manual(values = c('#F2C14E95', '#C14E4295'), guide = "none") +
labs(title = 'Distribución posteriori y acumulada',
x = 'Altura(en metros)', y = 'Densidad') +
theme_minimal() +
annotate('text', x = 1.735, y = 0.25, label = 'Región H1') +
annotate('text', x = 1.79, y = 0.25, label = 'Región H0')

How can I add annotation in ggplotly animation?

I am creating animated plotly graph for my assignment in r, where I am comparing several models with various number of observations. I would like to add annotation showing what is the RMSE of the current model - this means I would like to have text that changes together with slider. Is there any easy way how to do that?
Here is my dataset stored on GitHub. There already is created variable with RMSE: data
The base ggplot graphic is as follows:
library(tidyverse)
library(plotly)
p <- ggplot(values_predictions, aes(x = x)) +
geom_line(aes(y = preds_BLR, frame = n, colour = "BLR")) +
geom_line(aes(y = preds_RLS, frame = n, colour = "RLS")) +
geom_point(aes(x = x, y = target, frame = n, colour = "target"), alpha = 0.3) +
geom_line(aes(x = x, y = sin(2 * pi * x), colour = "sin(2*pi*x)"), alpha = 0.3) +
ggtitle("Comparison of performance) +
labs(y = "predictions and targets", colour = "colours")
This is converted to plotly, and I have added an animation to the Plotly graph:
plot <- ggplotly(p) %>%
animation_opts(easing = "linear",redraw = FALSE)
plot
Thanks!
You can add annotations to a ggplot graph using the annotate function: http://ggplot2.tidyverse.org/reference/annotate.html
df <- data.frame(x = rnorm(100, mean = 10), y = rnorm(100, mean = 10))
# Build model
fit <- lm(x ~ y, data = df)
# function finds RMSE
RMSE <- function(error) { sqrt(mean(error^2)) }
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_point() +
annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 2,
label = paste("RMSE", RMSE(fit$residuals)) )
There seems to be a bit of a problem converting between ggplot and plotly. However this workaround here shows a workaround which can be used:
ggplotly(plot) %>%
layout(annotations = list(x = 12, y = 13, text = paste("RMSE",
RMSE(fit$residuals)), showarrow = F))
Here's an example of adding data dependent text using the built in iris dataset with correlation as text to ggplotly.
library(plotly)
library(ggplot2)
library(dplyr)
mydata = iris %>% rename(variable1=Sepal.Length, variable2= Sepal.Width)
shift_right = 0.1 # number from 0-1 where higher = more right
shift_down = 0.02 # number from 0-1 where higher = more down
p = ggplot(mydata, aes(variable1,variable2))+
annotate(geom = "text",
label = paste0("Cor = ",as.character(round(cor.test(mydata$variable1,mydata$variable2)$estimate,2))),
x = min(mydata$variable1)+abs(shift_right*(min(mydata$variable1)-max(mydata$variable1))),
y = max(mydata$variable2)-abs(shift_down*(min(mydata$variable2)-max(mydata$variable2))), size=4)+
geom_point()
ggplotly(p) %>% style(hoverinfo = "none", traces = 1) # remove hover on text

R ggplot2::geom_density with a constant variable

I have recently came across a problem with ggplot2::geom_density that I am not able to solve. I am trying to visualise a density of some variable and compare it to a constant. To plot the density, I am using the ggplot2::geom_density. The variable for which I am plotting the density, however, happens to be a constant (this time):
df <- data.frame(matrix(1,ncol = 1, nrow = 100))
colnames(df) <- "dummy"
dfV <- data.frame(matrix(5,ncol = 1, nrow = 1))
colnames(dfV) <- "latent"
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.2, position = "identity") +
geom_vline(data = dfV, aes(xintercept = latent, color = 'ls'), size = 2)
This is OK and something I would expect. But, when I shift this distribution to the far right, I get a plot like this:
df <- data.frame(matrix(71,ncol = 1, nrow = 100))
colnames(df) <- "dummy"
dfV <- data.frame(matrix(75,ncol = 1, nrow = 1))
colnames(dfV) <- "latent"
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.2, position = "identity") +
geom_vline(data = dfV, aes(xintercept = latent, color = 'ls'), size = 2)
which probably means that the kernel estimation is still taking 0 as the centre of the distribution (right?).
Is there any way to circumvent this? I would like to see a plot like the one above, only the centre of the kerner density would be in 71 and the vline in 75.
Thanks
Well I am not sure what the code does, but I suspect the geom_density primitive was not designed for a case where the values are all the same, and it is making some assumptions about the distribution that are not what you expect. Here is some code and a plot that sheds some light:
# Generate 10 data sets with 100 constant values from 0 to 90
# and then merge them into a single dataframe
dfs <- list()
for (i in 1:10){
v <- 10*(i-1)
dfs[[i]] <- data.frame(dummy=rep(v,100),facet=v)
}
df <- do.call(rbind,dfs)
# facet plot them
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.5, position = "identity") +
facet_wrap( ~ facet,ncol=5 )
Yielding:
So it is not doing what you thought it was, but it is also probably not doing what you want. You could of course make it "translation-invariant" (almost) by adding some noise like this for example:
set.seed(1234)
noise <- +rnorm(100,0,1e-3)
dfs <- list()
for (i in 1:10){
v <- 10*(i-1)
dfs[[i]] <- data.frame(dummy=rep(v,100)+noise,facet=v)
}
df <- do.call(rbind,dfs)
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.5, position = "identity") +
facet_wrap( ~ facet,ncol=5 )
Yielding:
Note that there is apparently a random component to the geom_density function, and I can't see how to set the seed before each instance, so the estimated density is a bit different each time.

Bar plot of group means with lines of individual results overlaid

this is my first stack overflow post and I am a relatively new R user, so please go gently!
I have a data frame with three columns, a participant identifier, a condition (factor with 2 levels either Placebo or Experimental), and an outcome score.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
I would like to construct a bar plot with two bars with the mean outcome score for each condition and the standard deviation as an error bar. I would like to then overlay lines connecting points for each participant's score in each condition. So the plot displays the individual response as well as the group mean.If it is also possible I would like to include an axis break.
I don't seem to be able to find any advice in other threads, apologies if I am repeating a question.
Many Thanks.
p.s. I realise that presenting data in this way will not be to everyones tastes. It is for a specific requirement!
This ought to work:
library(ggplot2)
library(dplyr)
dat.summ <- dat %>% group_by(Condition) %>%
summarize(mean.outcome = mean(Outcome),
sd.outcome = sd(Outcome))
ggplot(dat.summ, aes(x = Condition, y = mean.outcome)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean.outcome - sd.outcome,
ymax = mean.outcome + sd.outcome),
color = "dodgerblue", width = 0.3) +
geom_point(data = dat, aes(x = Condition, y = Outcome),
color = "firebrick", size = 1.2) +
geom_line(data = dat, aes(x = Condition, y = Outcome, group = ID),
color = "firebrick", size = 1.2, alpha = 0.5) +
scale_y_continuous(limits = c(0, max(dat$Outcome)))
Some people are better with ggplot's stat functions and arguments than I am and might do it differently. I prefer to just transform my data first.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
dat.w <- reshape(dat, direction = 'wide', idvar = 'ID', timevar = 'Condition')
means <- colMeans(dat.w[, 2:3])
sds <- apply(dat.w[, 2:3], 2, sd)
ci.l <- means - sds
ci.u <- means + sds
ci.width <- .25
bp <- barplot(means, ylim = c(0,20))
segments(bp, ci.l, bp, ci.u)
segments(bp - ci.width, ci.u, bp + ci.width, ci.u)
segments(bp - ci.width, ci.l, bp + ci.width, ci.l)
segments(x0 = bp[1], x1 = bp[2], y0 = dat.w[, 2], y1 = dat.w[, 3], col = 1:10)
points(c(rep(bp[1], 10), rep(bp[2], 10)), dat$Outcome, col = 1:10, pch = 19)
Here is a method using the transfomations inside ggplot2
ggplot(dat) +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.y="mean", geom="bar") +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.data="mean_se", geom="errorbar", col="green", width=.8, size=2) +
geom_line(aes(x=Condition, y=Outcome, group=ID), col="red")

Resources