Is there a better representation which shows multiple lines in ggplot [closed] - r

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
I have plot 100 lines. Each line has color based on score. It is very difficult to see any pattern in the figure because lines overlap.
Is there a better representation which shows lines and scores are linked to one another.
I believe some kind of density plot can show the pattern.
library(tidyverse)
x <- rep(seq(0, 3.2, 0.01), times = 100)
score <- rep(1:100, each = 321)
y = runif(1000) * score * 0.01
df <- tibble(x = x,
score = score,
y = y)
ggplot(data = df,
aes(x = x,
y = y,
group = score,
color = score)) +
geom_line(size = 0.15) +
theme_bw() +
theme(aspect.ratio = 0.5) +
# legend.position="none") +
scale_color_gradient(low = 'blue', high = 'yellow')

The sample data is simply too messy and complex to show in an unfiltered line plot. One option is to show a summary of each line via geom_smooth. Although you lose details in the data, it allows you to convey the message that you want the plot to show.
library(tidyverse)
x <- rep(seq(0, 3.2, 0.01), times = 100)
score <- rep(1:100, each = 321)
y = runif(32100) * score * 0.01
df <- tibble(x = x,
score = score,
y = y)
ggplot(data = df,
aes(x = x,
y = y,
group = score,
color = score)) +
geom_smooth(linewidth = 0.5, se = FALSE) +
theme_bw() +
theme(aspect.ratio = 0.5) +
scale_color_gradient(low = 'blue', high = 'yellow')

What about a heat map - which you could make by categorizing both x and y and then taking the average score in each x-y combination.
library(tidyverse)
x <- rep(seq(0, 3.2, 0.01), times = 100)
score <- rep(1:100, each = 321)
y = runif(32100) * score * 0.01
df <- tibble(x = x,
score = score,
y = y) %>%
mutate(x_cat = cut(x, breaks=11),
y_cat = cut(y, breaks=11)) %>%
group_by(x_cat, y_cat) %>%
summarise(score = mean(score),
x = median(range(x)),
y=median(range(y)))
#> `summarise()` has grouped output by 'x_cat'. You can override using the
#> `.groups` argument.
ggplot(df, aes(x=x_cat, y=y_cat, fill=score)) +
geom_tile() +
scale_fill_gradient(low = 'blue', high = 'yellow') +
scale_x_discrete(labels=sprintf("%.2f", sort(unique(df$x)))) +
scale_y_discrete(labels=sprintf("%.6f", sort(unique(df$y)))) +
theme_classic() +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(x="X", y="Y", fill="Average\nScore")
Created on 2023-01-19 by the reprex package (v2.0.1)

If the x pattern isn't too important, we could just focus on score and the average y for each. This shows that relationship more clearly.
library(dplyr)
df |>
group_by(score) |>
summarize(avg_y = mean(y)) |>
ggplot(aes(score, avg_y)) +
geom_point()
Or perhaps there's another salient feature of each score line, like "average slope" or "spikiness" or "variability vs. linear regression." You could code that to color in this plot. More ideas for time series features here: https://github.com/tidyverts/feasts

Related

Different objects are not showing up on my ggplot2

I'm studying the returns to college admission for marginal student and i'm trying to make a ggplot2 of the following data which is, average salaries of students who finished or didn't finish their masters in medicin and the average 'GPA' (foreign equivalent) distance to the 'acceptance score':
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
I have to do a Regression Discontinuity Design (RDD), so to do the regression - as far as i understand it - i have to rewrite the DistanceGrades to numeric so i just created a variable z
z <- -5:4
where 0 is the cutoff (ie. 0 is equal to "0.0" in DistanceGrades).
I then make a dataframe
df <- data.frame(z,SalaryAfter)
Now my attempt to create the plot gets a bit messy (i use the package 'fpp3', but i suppose that it is just the ggplot2 and maybe dyplr packages)
df %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0))) %>%
ggplot(aes(x = z, y = SalaryAfter, color = D)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
xlim(-6,5) +
xlab("Distance to acceptance score") +
labs(title = "Figur 1.1", subtitle = "Salary for every distance to the acceptance score")
Which plots:
What i'm trying to do is firstly, split the data with a dummy variable D=1 if z>0 and D=0 if z<0. Then i plot it with a linear regression and a vertical line at z=0. Lastly i write the title and subtilte. Now i have two problems:
The x axis is displaying -5, -2.5, ... but i would like for it to show all the integers, the rational numbers have no relation to the z variable which is discrete. I have tried to fix this with several different methods, but none of them have worked, i can't remember all the ways i have tried (theme(panel.grid...),scale_x_discrete and many more), but the outcome has all been pretty similar. They all cause the x-axis to be completely removed such that there is no numbers and sometimes it even removes the axis title.
i would like for the regression channel for the first part of the data to extend to z=0
When i try to solve both of these problems i again get similar results, most of the things i try is not producing an error message when i run the code, but they either do nothing to my plot or they remove some of the existing elements which leaves me made of questions. I suppose that the error is caused by some of the elements not working together but i have no idea.
Try this:
library(tidyverse)
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
z <- -5:4
df <- data.frame(z,SalaryAfter) %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0)))
# Fit a lm model for the left part of the panel
fit_data <- lm(SalaryAfter~z, data = filter(df, z <= -0.1)) %>%
predict(., newdata = data.frame(z = seq(-5, 0, 0.1)), interval = "confidence") %>%
as.data.frame() %>%
mutate(z = seq(-5, 0, 0.1), D = factor(0, levels = c(0, 1)))
# Plot
ggplot(mapping = aes(color = D)) +
geom_ribbon(data = filter(fit_data, z <= 0 & -1 <= z),
aes(x = z, ymin = lwr, ymax = upr),
fill = "grey70", color = "transparent", alpha = 0.5) +
geom_line(data = fit_data, aes(x = z, y = fit), size = 1) +
geom_point(data = df, aes(x = z, y = SalaryAfter), stat = "identity") +
geom_smooth(data = df, aes(x = z, y = SalaryAfter), method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
scale_x_continuous(limits = c(-6, 5), breaks = -6:5) +
xlab("Distance to acceptance score") +
labs(title = "Figure 1.1", subtitle = "Salary for every distance to the acceptance score")

Displaying geom_smooth() trend line from a specified x value

Suppose a dataset containing count data per multiple time periods and per multiple groups in the following format:
set.seed(123)
df <- data.frame(group = as.factor(rep(1:3, each = 50)),
week = rep(1:50, 3),
rate = c(round(700 - rnorm(50, 100, 10) - 1:50 * 2, 0),
round(1000 - rnorm(50, 200, 10) - 1:50 * 2, 0),
round(1000 - rnorm(50, 200, 10) - 1:50 * 2, 0)))
group week rate
1 1 1 604
2 1 2 598
3 1 3 578
4 1 4 591
5 1 5 589
6 1 6 571
7 1 7 581
8 1 8 597
9 1 9 589
10 1 10 584
I'm interested in fitting a model-based trend line per groups, however, I want this trend line to be displayed only from a certain x value. To visualize the trend line using all data points (requires ggplot2):
df %>%
ggplot(aes(x = week,
y = rate,
group = group,
lty = group)) +
geom_line() +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "quasipoisson"),
se = FALSE)
Or to fit a model based on a specific range of values (requires ggplot2 and dplyr):
df %>%
group_by(group) %>%
mutate(rate2 = ifelse(week < 35, NA, rate)) %>%
ggplot(aes(x = week,
y = rate,
group = group,
lty = group)) +
geom_line() +
geom_point() +
geom_smooth(aes(y = rate2),
method = "glm",
method.args = list(family = "quasipoisson"),
se = FALSE)
However, I cannot find a way to fit the models using all data, but display the trend line only from a specific x value (let's say 35+). Thus, I essentially want the trend line as computed for plot one, but displaying it according the second plot, using ggplot2 and ideally only one pipeline.
I went to look at the after_stat function mentioned by #tjebo. See if the following works for you?
df %>%
ggplot(aes(x = week,
y = rate,
lty = group)) +
geom_line() +
geom_point() +
geom_smooth(method = "glm",
aes(group = after_stat(interaction(group, x > 35)),
colour = after_scale(alpha(colour, as.numeric(x > 35)))),
method.args = list(family = "quasipoisson"),
se = F)
This works by splitting the points associated with each line into two groups, those in the x <=35 region and those in the x >35 region, since a line's colour shouldn't vary, and defining a separate colour transparency for each new group. As a result, only the lines in the x > 35 region are visible.
When used, the code triggers a warning that the after_scale modification isn't applied to the legend. I don't think that's a problem though, since we don't need it to appear in the legend anyway.
If you can tolerate a warning, you can solve this with 1 line difference from the example code using stage().
library(tidyverse)
set.seed(123)
df <- data.frame(group = as.factor(rep(1:3, each = 50)),
week = rep(1:50, 3),
rate = c(round(700 - rnorm(50, 100, 10) - 1:50 * 2, 0),
round(1000 - rnorm(50, 200, 10) - 1:50 * 2, 0),
round(1000 - rnorm(50, 200, 10) - 1:50 * 2, 0)))
df %>%
ggplot(aes(x = week,
y = rate,
group = group,
lty = group)) +
geom_line() +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "quasipoisson"),
aes(x = stage(week, after_stat = ifelse(x > 35, x, NA))),
se = FALSE)
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 165 rows containing missing values (geom_smooth).
One way to do this is to construct the fitted values outside of ggplot so you have control over them:
df$fit <- glm(rate ~ week + group, data = df, family = "quasipoisson")$fitted.values
library(dplyr)
library(ggplot2)
ggplot(df, aes(x = week, group = group, lty = group)) +
geom_line(aes(y = rate)) +
geom_point(aes(y = rate)) +
geom_line(data = df %>% filter(week >= 35), aes(y = fit), color = "blue", size = 1.25)
I am not sure if it is generally correct to use a linear model in time series. The whole point about time series is that they require specific statistics because of their expected autocorrelation. You might want something like average rolling models instead.
I am not sure if your visualisation might not be quite confusing and, more dangerously, misleading.
Besides, an interesting problem. I thought the new after_stat might somehow help, but I couldn't get it working.
So, here a quick hack. Change the order of your geom-s and draw a rectangle in-between. I am cheekily using a different theme, but if you really want to use theme_grey(), you can fake the axis lines as well.
library(tidyverse)
set.seed(123)
df <- data.frame(group = as.factor(rep(1:3, each = 50)),
week = rep(1:50, 3),
rate = c(round(700 - rnorm(50, 100, 10) - 1:50 * 2, 0),
round(1000 - rnorm(50, 200, 10) - 1:50 * 2, 0),
round(1000 - rnorm(50, 200, 10) - 1:50 * 2, 0)))
df %>%
ggplot(aes(x = week, y = rate, group = group, lty = group)) +
stat_smooth(se = FALSE) +
geom_rect(xmin = -Inf, xmax = 35, ymin = -Inf, ymax = Inf,
fill = "white") +
geom_line() +
geom_point() +
theme_classic()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2021-02-09 by the reprex package (v1.0.0)
P.S. I've removed a few of the unnecessary bits in the code to reproduce this, like the model specs.
You could use ggplot_build to get the structure of the plot :
p <- ggplot(df, aes(x = week,
y = rate,
group = group,
lty = group)) +
geom_line() +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "quasipoisson"),
se = FALSE)
p_build <- ggplot_build(p)
You could then modify the internal data, here the third element of the data list (geom_smooth):
p_build$data[[3]]$x <- sapply(p_build$data[[3]]$x,function(x) {ifelse(x<35,NA,x)})
and use ggplot_gtable to regenerate the plot (the lm calculations still apply to the whole dataset):
plot(ggplot_gtable(p_build))

Comparing mean values by group between several variables

I am trying to reproduce a graph from Stata in R. I have several variables and want to display their mean in each treatment group of which there are two. The Stata graph is as follows:
This coefficient plot is not actually a plot of coefficients, but of the mean values by each treatment for each separate variable. The df basically looks something like.
workable data
It is difficult to answer your question without reproducible data.
However, this might get what you desire just with mean:
library(dplyr)
mpg %>%
select(manufacturer, cty, trans) %>%
group_by(manufacturer, trans) %>%
summarize(cty_mean = mean(cty)) %>%
ggplot(aes(x=cty_mean, y=reorder(manufacturer, cty_mean), color=trans)) +
geom_point()
If you also wish to include the coefficients or std errors, then you could achieve by including a function in summarize().
I figured out geom_pointrange() is probably what you are looking for:
library("ggplot2")
set.seed(111018)
interval1 <- -qnorm((1-0.9)/2)
means_treatment_1 <- rnorm(2)
se_treatment_1 <- rnorm(2)
df_treatment_1 <- data.frame("Mean" = means_treatment_1,
"lower" = means_treatment_1 - se_treatment_1*interval1,
"upper" = means_treatment_1 + se_treatment_1*interval1,
"Variable" = c("medicare_spending_dummy",
"job_training_dummy"),
"Treatment" = "a")
means_treatment_2 <- rnorm(2)
se_treatment_2 <- rnorm(2)
df_treatment_2 <- data.frame("Mean" = means_treatment_2,
"lower" = means_treatment_2 - se_treatment_2*interval1,
"upper" = means_treatment_2 + se_treatment_2*interval1,
"Variable" = c("medicare_spending_dummy",
"job_training_dummy"),
"Treatment" = "b")
df_tot<-rbind(df_treatment_1, df_treatment_2)
# Plot
ggplot(df_tot, aes(colour = Treatment)) +
geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
geom_pointrange(aes(x = Variable, y = Mean, ymin = lower, ymax = upper ),lwd = 1, position = position_dodge(width = 1/2)) +
coord_flip() +
theme_bw()

How can I add annotation in ggplotly animation?

I am creating animated plotly graph for my assignment in r, where I am comparing several models with various number of observations. I would like to add annotation showing what is the RMSE of the current model - this means I would like to have text that changes together with slider. Is there any easy way how to do that?
Here is my dataset stored on GitHub. There already is created variable with RMSE: data
The base ggplot graphic is as follows:
library(tidyverse)
library(plotly)
p <- ggplot(values_predictions, aes(x = x)) +
geom_line(aes(y = preds_BLR, frame = n, colour = "BLR")) +
geom_line(aes(y = preds_RLS, frame = n, colour = "RLS")) +
geom_point(aes(x = x, y = target, frame = n, colour = "target"), alpha = 0.3) +
geom_line(aes(x = x, y = sin(2 * pi * x), colour = "sin(2*pi*x)"), alpha = 0.3) +
ggtitle("Comparison of performance) +
labs(y = "predictions and targets", colour = "colours")
This is converted to plotly, and I have added an animation to the Plotly graph:
plot <- ggplotly(p) %>%
animation_opts(easing = "linear",redraw = FALSE)
plot
Thanks!
You can add annotations to a ggplot graph using the annotate function: http://ggplot2.tidyverse.org/reference/annotate.html
df <- data.frame(x = rnorm(100, mean = 10), y = rnorm(100, mean = 10))
# Build model
fit <- lm(x ~ y, data = df)
# function finds RMSE
RMSE <- function(error) { sqrt(mean(error^2)) }
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_point() +
annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 2,
label = paste("RMSE", RMSE(fit$residuals)) )
There seems to be a bit of a problem converting between ggplot and plotly. However this workaround here shows a workaround which can be used:
ggplotly(plot) %>%
layout(annotations = list(x = 12, y = 13, text = paste("RMSE",
RMSE(fit$residuals)), showarrow = F))
Here's an example of adding data dependent text using the built in iris dataset with correlation as text to ggplotly.
library(plotly)
library(ggplot2)
library(dplyr)
mydata = iris %>% rename(variable1=Sepal.Length, variable2= Sepal.Width)
shift_right = 0.1 # number from 0-1 where higher = more right
shift_down = 0.02 # number from 0-1 where higher = more down
p = ggplot(mydata, aes(variable1,variable2))+
annotate(geom = "text",
label = paste0("Cor = ",as.character(round(cor.test(mydata$variable1,mydata$variable2)$estimate,2))),
x = min(mydata$variable1)+abs(shift_right*(min(mydata$variable1)-max(mydata$variable1))),
y = max(mydata$variable2)-abs(shift_down*(min(mydata$variable2)-max(mydata$variable2))), size=4)+
geom_point()
ggplotly(p) %>% style(hoverinfo = "none", traces = 1) # remove hover on text

Bar plot of group means with lines of individual results overlaid

this is my first stack overflow post and I am a relatively new R user, so please go gently!
I have a data frame with three columns, a participant identifier, a condition (factor with 2 levels either Placebo or Experimental), and an outcome score.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
I would like to construct a bar plot with two bars with the mean outcome score for each condition and the standard deviation as an error bar. I would like to then overlay lines connecting points for each participant's score in each condition. So the plot displays the individual response as well as the group mean.If it is also possible I would like to include an axis break.
I don't seem to be able to find any advice in other threads, apologies if I am repeating a question.
Many Thanks.
p.s. I realise that presenting data in this way will not be to everyones tastes. It is for a specific requirement!
This ought to work:
library(ggplot2)
library(dplyr)
dat.summ <- dat %>% group_by(Condition) %>%
summarize(mean.outcome = mean(Outcome),
sd.outcome = sd(Outcome))
ggplot(dat.summ, aes(x = Condition, y = mean.outcome)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean.outcome - sd.outcome,
ymax = mean.outcome + sd.outcome),
color = "dodgerblue", width = 0.3) +
geom_point(data = dat, aes(x = Condition, y = Outcome),
color = "firebrick", size = 1.2) +
geom_line(data = dat, aes(x = Condition, y = Outcome, group = ID),
color = "firebrick", size = 1.2, alpha = 0.5) +
scale_y_continuous(limits = c(0, max(dat$Outcome)))
Some people are better with ggplot's stat functions and arguments than I am and might do it differently. I prefer to just transform my data first.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
dat.w <- reshape(dat, direction = 'wide', idvar = 'ID', timevar = 'Condition')
means <- colMeans(dat.w[, 2:3])
sds <- apply(dat.w[, 2:3], 2, sd)
ci.l <- means - sds
ci.u <- means + sds
ci.width <- .25
bp <- barplot(means, ylim = c(0,20))
segments(bp, ci.l, bp, ci.u)
segments(bp - ci.width, ci.u, bp + ci.width, ci.u)
segments(bp - ci.width, ci.l, bp + ci.width, ci.l)
segments(x0 = bp[1], x1 = bp[2], y0 = dat.w[, 2], y1 = dat.w[, 3], col = 1:10)
points(c(rep(bp[1], 10), rep(bp[2], 10)), dat$Outcome, col = 1:10, pch = 19)
Here is a method using the transfomations inside ggplot2
ggplot(dat) +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.y="mean", geom="bar") +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.data="mean_se", geom="errorbar", col="green", width=.8, size=2) +
geom_line(aes(x=Condition, y=Outcome, group=ID), col="red")

Resources