Stack different variables on on graph using stat_summary (ggplot) - r

Context: I want to compare graphically the evolution of workload and trust over time during an experiment. Time is represented by 2 blocks.
Issue: I'm trying to plot different variables with different units on the same graph to compare the evolution. I only found that it works with geom_line, but it doesn't for stat_summary.
Data: x is "Block" (2 blocks) representing time. Variables used for y are "Workload" and "Trust" (both from 1 to 5, obtained by asking the subject).
To give some data:
data = data.frame("Subject" = c(1,1,2,2,3,3), "Block" = c(1,2,1,2,1,2), "Workload" = c(1,5,2,4,3,3), "Trust" = c(4,1,3,2,2,1))
I tried this, it works:
ggplot(data, aes(Block)) + geom_line(aes(y = Trust)) + geom_line(aes(y = Workload))
However it does not produce a convincing result: since I have multiple points, it links them for each value so that I obtain only vertical traits. And it's perfectly normal considering what geom_line is supposed to do.
So I can still compute the mean for each block and each variable, however I was wondering if it is possible to obtain a direct result with stat_summary, using something like:
ggplot(data, aes(Block)) + stat_summary(fun.y = mean, geom = line, aes(y = Trust)) + stat_summary(fun.y = mean, geom = line, aes(y = Workload))
Thank you for anyone dedicating even a little of their time trying to answer that.
Have a nice day!
Pyxel

I'd recommend summarizing your data before plotting. Consider here:
library(tidyverse)
df <- data_frame("Subject" = c(1,1,2,2,3,3),
"Block" = c(1,2,1,2,1,2),
"Workload" = c(1,5,2,4,3,3),
"Trust" = c(4,1,3,2,2,1))
grouped <-
df %>%
group_by(Block) %>%
summarise(trust = mean(Trust),
workload = mean(Workload))
ggplot(grouped, aes(x = Block)) +
geom_line(aes(y = trust)) +
geom_line(aes(y = workload))

Related

How to calculate and label peak value of distribution by multiple conditions/facets in R ggplot?

While the question appears similar to others, there's a key difference in my mind.
I want to be able to calculate and/or print (graphing it would be the ultimate goal, but calculating it in the data frame the primary goal) the peak value of a density curve of EACH SUB-CONDITION BY FACET The density graph looks like this:
So, ideally, I would be able to know the intensity (x-axis value) corresponding to the highest peak of the density curves for each condition.
Here's some dummy data:
set.seed(1234)
library(tidyverse)
library(fs)
n = 100000
silence = factor(c("sil1", "sil2", "sil3", "sil4", "sil5"))
treat = factor(c("con", "uos", "uos+wnt5a", "wnt5a"))
silence = rep(silence, n)
treat = rep(treat, n)
intensity = sample(4000:10000, n)
df <- cbind(silence, treat, intensity)
df$silence <- silence
df$treat <- treat
What I've tried:
Subsetting the primary DF and going through and calculating the density of each condition, but this could take days
Something close to this answer: Calculating peaks in histograms or density functions but not quite. I think the data look better as a histogram personally, but that constructs an arbitrary number of bins for intensity data (a continuous measure). The histogram looks like this:
Again, it would be sufficient to get the peak values for each of these groups (i.e., treatments by silencing subdistributions) just in the console, but adding them as a vertical line in the graphs would be a sweet cherry on top (it could also make it hella busy, so I will see about that piece later)
Thank you!!
Depending on the way you're producing the density plots, there may be a more direct way to recreate the density calculation before it goes into ggplot. That'll be the easiest way to get the peak values and keep them in the format of your data.
Without that, here's a hack that should work in general, but requires some kludging to fit the extracted points back into the form of your original data.
Here's a plot like yours:
mtcars %>%
mutate(gear = as.character(gear)) %>%
ggplot(aes(wt, fill = gear, group = gear)) +
geom_density(alpha = 0.2) +
facet_wrap(~am) ->my_plot
Here are the components that make up that plot:
ggplot_build(my_plot) -> my_plot_innards
With some ugly hacking we can extract the points that make up the curves and make them look kind of like our original data. Some info is destroyed, e.g. the gear values 3/4/5 become group 1/2/3. There might be a cool way to convert back, but I don't know it yet.
extracted_points <- tibble(
wt = my_plot_innards[["data"]][[1]][["x"]],
y = my_plot_innards[["data"]][[1]][["y"]],
gear = (my_plot_innards[["data"]][[1]][["group"]] + 2) %>% as.character, # HACK
am = (my_plot_innards[["data"]][[1]][["PANEL"]] %>% as.numeric) - 1 # HACK
)
ggplot(extracted_points, aes(wt, y, fill = gear)) +
geom_point(size = 0.3) +
facet_wrap(~am)
extracted_points_notes <- extracted_points %>%
group_by(gear, am) %>%
slice_max(y)
my_plot +
geom_point(data = extracted_points_notes,
aes(y = y), color = "red", size = 3, show.legend = FALSE) +
geom_text(data = extracted_points_notes, hjust = -0.5,
aes(y = y, label = scales::comma(y)), color = "red", size = 3, show.legend = FALSE)

R ggplot2 Specify separate color gradients by group

I'm trying to make separate color gradients for grouped data that is displayed on the same scatterplot. I've included sample data below. User is unique user IDs, task is unique task IDs, days_completion is the time in days when the task was completed, task_group is the group indicator that the tasks are grouped into, and task_order is the order in which the tasks were made available for users to complete. Each row represents the time that the user completed a specific task. The task_order may not logically follow this organization as it was randomly generated, but it should suffice for demonstration.
The resulting plot would have days_completion of the x axis, user on the y axis, each point from geom_point would represent the time in days that the user completed their task. The tasks groups would each have their own color in a gradient of dark to light by task_order. For example, task group 1 would be dark red at task order == 1 and light red at task order == 7.
Sample code is below:
library(dplyr)
library(forcats)
library(ggplot2)
test_data <- tibble(user = rep(seq(1:50), 10) %>%
as_factor(),
task = sample(1:10, 500, replace = TRUE) %>%
as_factor(),
days_completion = sample(1:500, 500, replace = FALSE),
task_group = sample(1:3, 500, replace = TRUE) %>%
as_factor(),
task_order = sample(1:7, 500, replace = TRUE, prob = c(rep(.25,3),.2,.2,.1,.1)) %>%
as_factor()) %>%
arrange(days_completion)
#Sample plotting approach; does not work
test_plot <- test_data %>%
ggplot(aes(x = days_completion, y = user, color = task)) +
geom_point() +
#This seems to be what I need, but I can't figure out how to specify multiple gradients by task_group
scale_color_gradient()
I know I could manually order the factors and map colors with hex codes, but I'd like something that can scale and avoid the manual process. Also, if anyone has any suggestions for how to display this plot other than a scatterplot, I'm open to suggestions. The main idea is to detect patterns in completion time in trends displayed by the color. The trends may not show due to it being randomly generated data, but that's okay.
My coworker found a solution in another post that requires an additional package called ggnewscale. I still don't know if this can be done only with ggplot2, but this works. I'm still open to alternative plotting suggestions though. The purpose is to detect any trends in day of completion across and within users. Across users is where I expect to see more of a trend, but within could be informative too.
How merge two different scale color gradient with ggplot
library(ggnewscale)
dat1 <- test_data %>% filter(task_group == 1)
dat2 <- test_data %>% filter(task_group == 2)
dat3 <- test_data %>% filter(task_group == 3)
ggplot(mapping = aes(x = days_completion, y = user)) +
geom_point(data = dat1, aes(color = task_order)) +
scale_color_gradientn(colors = c('#99000d', '#fee5d9')) +
new_scale_color() +
geom_point(data = dat2, aes(color = task_order)) +
scale_color_gradientn(colors = c('#084594', '#4292c6')) +
new_scale_color() +
geom_point(data = dat3, aes(color = task_order)) +
scale_color_gradientn(colors = c('#238b45'))
You can have generate your own color scale by using RColorBrewer and pass it to scale_color_manual:
library(RColorBrewer)
colo <- colorRampPalette(c("darkred", "orangered"))(10)
library(ggplot2)
ggplot(test_data, aes(x = days_completion, y = user))+
geom_point(aes(color = task))+
scale_color_manual(values = colo)
Regarding the representation other than scatterplot, it is difficult to propose something else. It will based on your original data and the question you are trying to solve. Do you need to see the pattern per user ? or does your 50 users are just replicate of your experiments. In those cases, maybe some geom_density could be helpful. Otherwise, maybe you can take a look at stat_contour function.

Time series data using ggplot: how use different color for each time point and also connect with lines data belonging to each subject?

I have data from several cells which I tested in several conditions: a few times before and also a few times after treatment. In ggplot, I use color to indicate different times of testing.
Additionally, I would like to connect with lines all data points which belong to the same cell. Is that possible?...
Here is my example data (https://www.dropbox.com/s/eqvgm4yu6epijgm/df.csv?dl=0) and a simplified code for the plot:
df$condition = as.factor(df$condition)
df$cell = as.factor(df$cell)
df$condition <- factor(df$condition, levels = c("before1", "before2", "after1", "after2", "after3")
windows(width=8,height=5)
ggplot(df, aes(x=condition, y=test_variable, color=condition)) +
labs(title="", x = "Condition", y = "test_variable", color="Condition") +
geom_point(aes(color=condition),size=2,shape=17, position = position_jitter(w = 0.1, h = 0))
I think you get in the wrong direction for your code, you should instead group and colored each points based on the column Cell. Then, if I'm right, you are looking to see the evolution of the variable for each cell before and after a treatment, so you can order the x variable using scale_x_discrete.
Altogether, you can do something like that:
library(ggplot2)
ggplot(df, aes(x = condition, y = variable, group = Cell)) +
geom_point(aes(color = condition))+
geom_line(aes(color = condition))+
scale_x_discrete(limits = c("before1","before2","after1","after2","after3"))
Does it look what you are expecting ?
Data
df = data.frame(Cell = c(rep("13a",5),rep("1b",5)),
condition = rep(c("before1","before2","after1","after2","after3"),2),
variable = c(58,55,36,29,53,57,53,54,52,52))

Showing number of values outside axis range in boxplot (using ggplot2 in R)

Sometimes you want to limit the axis range of a plot to a region of interest so that certain features (e.g. location of the median & quartiles) are emphasized. Nevertheless, it may be of interest to make it clear how many/what proportion of values lie outside the (truncated) axis range.
I am trying to show this when using ggplot2 in R and am wondering whether there is some buildt-in way of doing this in ggplot2 (or alternatively some sensible solution some of you may have used). I am not actually particularly wedded to any particular way of displaying this (e.g. jittered points with a different symbol at the edge of the plot, a little bar outside that depending on how full it is shows the proportion outside the range, some kind of other display that somehow conveys the information).
Below is some example code that creates some mock data and the kind of plot I have in mind (shown below the code), but without any clear indication exactly how much data is outside the y-axis range.
library(ggplot2)
set.seed(seed=123)
group <- rep(c(0,1),each=500)
y <- rcauchy(1000, group, 10)
mockdata <- data.frame(group,y)
ggplot(mockdata, aes(factor(group),y)) + geom_boxplot(aes(fill = factor(group))) + coord_cartesian(ylim = c(-40,40))
You may compute these values in advance and display them via e.g. geom_text:
library(dplyr)
upper_lim <- 40
lower_lim <- -40
mockdata$upper_cut <- mockdata$y > upper_lim
mockdata$lower_cut <- mockdata$y < lower_lim
mockdata$group <- as.factor(mockdata$group)
mockpts <- mockdata %>%
group_by(group) %>%
summarise(upper_count = sum(upper_cut),
lower_count = sum(lower_cut))
ggplot(mockdata, aes(group, y)) +
geom_boxplot(aes(fill = group)) +
coord_cartesian(ylim = c(lower_lim, upper_lim)) +
geom_text(y = lower_lim, data = mockpts,
aes(label = lower_count, x = group), hjust = 1.5) +
geom_text(y = upper_lim, data = mockpts,
aes(label = upper_count, x = group), hjust = 1.5)

How to plot the mean of a single factor in a barplot with

I'm having trouble to create a figure with ggplot2.
In this plot, I'm using geom_bar to plot three factors. I mean, for each "time" and "dose" I'm plotting two bars (two genotypes).
To be more specific, this is what I mean:
This is my code till now (Actually I changed some settings, but I'm presenting just what is need for):
ggplot(data=data, aes(x=interaction(dose,time), y=b, fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")
Question: I intend to add the mean of each time using points and that these points are just in the middle of the bars of a certain time. How can I proceed?
I tried to add these points using geom_dotplot and geom_point but I did not succeed.
library(dplyr)
time_data = data %>% group_by(time) %>% summarize(mean(b))
data <- inner_join(data,time_data,by = "time")
this gives you data with the means attached. Now make the plot
ggplot(data=data, aes(x=interaction(dose,time), y=b,fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")+
geom_text(aes(b),vjust = 0)
You might need to fiddle around with the argument hjust and vjust in the geom_text statement. Maybe the aes one too, I didn't run the program so I don't know.
It generally helps if you can give a reproducible example. Here, I made some of my own data.
sampleData <-
data.frame(
dose = 1:3
, time = rep(1:3, each = 3)
, genotype = rep(c("AA","aa"), each = 9)
, b = rnorm(18, 20, 5)
)
You need to calculate the means somewhere, and I chose to do that on the fly. Note that, instead of using points, I used a line to show that the mean is for all of those values. I also sorted somewhat differently, and used facet_wrap to cluster things together. Points would be a fair bit harder to place, particularly when using position_dodge, but you could likely modify this code to accomplish that.
ggplot(
sampleData
, aes(x = dose
, y = b
, fill = genotype)
) +
geom_bar(position = "dodge", stat = "identity") +
geom_hline(data =
sampleData %>%
group_by(time) %>%
summarise(meanB = mean(b)
, dose = NA, genotype = NA)
, aes(yintercept = meanB)
, col = "black"
) +
facet_wrap(~time)

Resources