Need help with ggplot that plots averages for y axis and returns the line plot with points and also the text labels for each points (using ggplot functionality) that are color coded as per the respective "color" object parameter. As far as possible I don't want to create any intermediate dataframe from original data to create summary for y means. I tried using fun.y as shown in the code snippet. Excel chart is also attached.
Sample data
set.seed(1)
age_range = sample(c("ar2-15", "ar16-29", "ar30-44"), 20, replace = TRUE)
gender = sample(c("M", "F"), 20, replace = TRUE)
region = sample(c("A", "B", "C"), 20, replace = TRUE)
physi = sample(c("Poor", "Average", "Good"), 20, replace = TRUE)
height = sample(c(4,5,6), 20, replace = TRUE)
survey = data.frame(age_range, gender, region,physi,height)
ggplot code I tried
ggplot(survey, aes(x=age_range, y=height, color=gender)) + stat_summary(fun.y=mean, geom = "point")+geom_line()
Output I am getting
Output I am looking for
Following up on #Sandy's comment, you can also add the labels in a similar fashion, though here I am using the package ggrepel to make sure they don't overlap (without having to manually code the location). For the location, you can read the result from the call to mean which is returned as y by calling ..y.. in the aesthetics.
ggplot(survey, aes(x=age_range, y=height, color=gender, group = gender)) +
stat_summary(fun.y=mean, geom = "point") +
stat_summary(fun.y=mean, geom = "line") +
stat_summary(aes(label = round(..y.., 2)), fun.y=mean, geom = "label_repel", segment.size = 0)
Gives
(Note that segment.size = 0 is to ensure that there is not an additional line drawn from the point to the label.)
As of now, it does not appear that ggrepel offers text displacement in only one axis (see here ), so you may have to manually position labels if you want more precision.
If you want to set the label locations manually, here is an approach that uses dplyr and the %>% pipe to avoid having to save any intermediate data.frames
The basic idea is described here. To see the result after any step, just highlight up to just before the %>% at the end of a line and run. First, group_by the x location and grouping that you want to plot. Get the average of each using summarise. The data are still group_by'd the age_range (summarise only rolls up one group at a time). So, you can determine which of the groups has a higher mean at that point by subtracting the mean. I used sign just to pull if it was positive or negative, then multiplied/divided by a facto to get the spacing I wanted (in this case, divided by ten to get spacing of 0.1). Add that adjustment to the mean to set where you want the label to land. Then, pass all of that into ggplot and proceed as you would with any other data.frame.
survey %>%
group_by(age_range, gender) %>%
summarise(height = mean(height)) %>%
mutate(myAdj = sign(height - mean(height)) / 10
, labelLoc = height + myAdj) %>%
ungroup() %>%
ggplot(aes(x = age_range
, y = height
, label = round(height, 2)
, color = gender
, group = gender
)) +
geom_point() +
geom_line() +
geom_label(aes(y = labelLoc)
, show.legend = FALSE)
Gives:
Which seems to accomplish your base goals, though you may want to play around with spacing etc. for your actual use case.
Related
So I have the following code which produces:
The issue here is twofold:
The group bar chart automatically places the highest value on the top (i.e. for avenue 4 CTP is on top), whereas I would always want FTP to be shown first then CTP to be shown after (so always blue bar then red bar)
I need all of the values to scale to 100 or 100% for their respective group (so for CTP avenue 4 would have a huge bar graph but the other avenues should be extremely tiny)
I am new to 'R'/Stack overflow so sorry if anything is wrong/you need more but any help is greatly appreciated.
library(ggplot2)
library(tidyverse)
library(magrittr)
# function to specify decimals
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
# sample data
avenues <- c("Avenue1", "Avenue2", "Avenue3", "Avenue4")
flytip_amount <- c(1000, 2000, 1500, 250)
collection_amount <- c(5, 15, 10, 2000)
# create data frame from the sample data
df <- data.frame(avenues, flytip_amount, collection_amount)
# got it working - now to test
df3 <- df
SumFA <- sum(df3$flytip_amount)
df3$FTP <- (df3$flytip_amount/SumFA)*100
df3$FTP <- specify_decimal(df3$FTP, 1)
SumCA <- sum(df3$collection_amount)
df3$CTP <- (df3$collection_amount/SumCA)*100
df3$CTP <- specify_decimal(df3$CTP, 1)
# Now we have percentages remove whole values
df2 <- df3[,c(1,4,5)]
df2 <- df2 %>% pivot_longer(-avenues)
FTGraphPos <- df2$name
ggplot(df2, aes(x = avenues, fill = as.factor(name), y = value)) +
geom_col(position = "dodge", width = 0.75) + coord_flip() +
labs(title = "Flytipping & Collection %", x = "ward_name", y = "Percentageperward") +
geom_text(aes(x= avenues, label = value), vjust = -0.1, position = "identity", size = 5)
I have tried the above and I have looked at lots of tutorials but nothing is exactly precise to what I need of ensuring the group bar charts puts the layers in the same order despite amount and scaling to 100/100%
As Camille notes, to handle ordering of the categories in a plot, you need to set them as factors, and then use functions from the forcats package to handle the order. Here I am using fct_relevel() (note that it will automatically convert character variables to factors).
Your numeric values are in fact set to character, so they need to be set to numeric for the chart to make sense.
To cover point #2, I'm using group_by() to calculate percentages within each name.
I have also fixed the labels so that they are properly dodged along with the bar chart. Also, note that you don't need to call ggplot2 or magrittr if you are calling tidyverse - those packages come along with it already.
df_plot <- df2 |>
mutate(name = fct_relevel(name, "CTP"),
value = as.numeric(value)) |>
group_by(name) |>
mutate(perc = value / sum(value)) |>
ungroup()
ggplot(df_plot, aes(x = value, y = avenues, fill = name)) +
geom_col(position = "dodge", width = 0.75) +
geom_text(aes(label = value), position = position_dodge(width = 0.75), size = 5) +
labs(title = "Flytipping & Collection %", x = "Percentageperward", y = "ward_name") +
guides(fill = guide_legend(reverse = TRUE))
I've tried everywhere to find the answer to this question but I am still stuck, so here it is:
I have a data frame data_1 that contains data from an ongoing latent profile analysis. The variables of interest for this question are profiles and gender.
I would like to plot gender distribution by profile, but within each profile show what % of each gender we have compared to the entire sample of this gender. For example, if we have 10 women and 5 in Profile 1, I want the text on top of the women bar for Profile 1 to show 50%.
Right now I am using the following code but it is giving me the percentage for the entire population, while I just want the percentage compared to the total number of women.
ggplot(data = subset(data_1, !is.na(gender)),
aes(x = gender, fill = gender)) + geom_bar() +
facet_grid(cols=vars(profiles)) + theme_minimal() +
scale_fill_brewer(palette = 'Accent', name = "Gender",
labels = c("Non-binary", "Man", "Woman")) +
labs(x = "Gender", title = "Gender distribution per LPA profile") +
geom_text(aes(y = ((..count..)/sum(..count..)),
label = scales::percent((..count..)/sum(..count..))),
stat = "count", vjust = -28)
Thanks in advance for your help!
I tried multiple alternatives including creating the variable within the dataset using summarize and mutate but with no success unfortunately.
As untidy as it seems, it's likely the best approach to summarise outside of the ggplot2 call, which can be done like this:
library(tidyverse)
data1 <- tibble(gender = sample(c("male", "female"), 100, replace = TRUE),
profile = sample(c("profile1", "profile2"), 100, replace = TRUE))
data1 |>
count(gender, profile) |>
group_by(gender) |>
mutate(perc = n / sum(n)) |>
ggplot(aes(x = gender, y = n, fill = gender)) +
geom_col() +
facet_grid(~profile) +
geom_text(aes(y = n + 3, label = scales::percent(perc)))
The facet_grid is essentially grouping the dataset by profile before doing any calculations of values, so in essence it's blind to the data in the other facet. I think only approach is thus summarising before the call and using geom_col (defaulting to stat = "identity") to make the plots. Note that the y value for the labels is calculated from the count variable - R will position the text relative to the counted values of the bars.
Edit - actually no, there's a "simpler" way
I tell a lie, you can actually do it in the ggplot2 call, but it's a little messier:
data1 |>
ggplot(aes(x = gender, fill = gender)) +
geom_bar() +
facet_grid(~ profile) +
stat_count(aes(y = after_stat(count) + 2,
label = scales::percent(after_stat(count) /
tapply(after_stat(count),
after_stat(group),
sum)[after_stat(group)]
)),
geom = "text")
Code borrowed from here. The after_stat(group) part is accessing the grouped gender count across both facets. Today I learned something!
I'm making a plot where several data points have the same coordinates. By default, the labels all overlap, but using geom_text_repel with direction = "y", I can vertically space them out.
However, every time I generate the plot, it chooses a new order for the labels. I would like them to be ordered based on a value.
I have tried:
using "arrange" to order the dataframe in the order that I want to see the labels (this seems to have no effect)
Trying to use "nudge_y" to re-arrange the labels in the order I want them. This seems to change the plot - it does "nudge" them - but it does NOT nudge them into the correct order!
Here is sample code to recreate the problem. Basically, I want the final plot to be ordered by the "order" value - so, for the three datapoints on "10", the order should be Ayala, Zoe, JL, and for the two datapoints on "5", the order should be Raph, Oona.
I've color-coded the plot to make it obvious what order they should be in - for each value, the lightest blue should be on top, and the darkest should be on the bottom.
library(tidyverse)
library(ggrepel)
name <- c("Oona","Sam","Raph", "JL", "Zoe","Ayala")
year <- rep(c("2016"),6)
value <- c(5,15,5,10,10,10) #The value I'm plotting
order <- c(5,-10,10,-5,0,5) #The value I want to order the labels by
test_df <- bind_cols(name = name, year = year, value = value, order = order) %>%
arrange(-value, -order) #arranging the df doesn't seem to affect the order on the plot at all, I just do it so I can easily preview the df in the correct order
ggplot(data = test_df, aes(x = year, y = value, group = name)) +
geom_point(aes(color = order)) +
geom_text_repel(data = test_df,
aes(label = name, color = order),
hjust = "left",
nudge_y = order, #This is where I'm trying to "nudge" them into the right order
nudge_x = -.45,
direction = "y")
I think the values in your order column were too big for the y-axis scale provided, so geom_text_repel was doing behind-the-scenes work to make it all actually fit, and changed the order of the labels in the process. When I scaled the order column down to one-fifth the sizes you had originally, it worked perfectly.
test_df$order <- test_df$order*1/5
ggplot(data = test_df, aes(x = year, y = value, group = name)) +
geom_point(aes(color = order)) +
geom_text_repel(data = test_df,
aes(label = name, color = order),
hjust = "left",
nudge_y = test_df$order,
nudge_x = -.45,
direction = "y"
)
I try to connect jittered points between measurements from two different methods (measure) on an x-axis. These measurements are linked to one another by the probands (a), that can be separated into two main groups, patients (pat) and controls (ctr),
My df is like that:
set.seed(1)
df <- data.frame(a = rep(paste0("id", "_", 1:20), each = 2),
value = sample(1:10, 40, rep = TRUE),
measure = rep(c("a", "b"), 20), group = rep(c("pat", "ctr"), each = 2,10))
I tried
library(ggplot2)
ggplot(df,aes(measure, value, fill = group)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0.1,
dodge.width = 0.75), shape = 1) +
geom_line(aes(group = a), position = position_dodge(0.75))
Created on 2020-01-13 by the reprex package (v0.3.0)
I used the fill aesthetic in order to separate the jittered dots from both groups (pat and ctr). I realised that when I put the group = a aesthetics into the ggplot main call, then it doesn't separate as nicely, but seems to link better to the points.
My question: Is there a way to better connect the lines to the (jittered) points, but keeping the separation of the two main groups, ctr and pat?
Thanks a lot.
The big issue you are having is that you are dodging the points by only group but the lines are being dodged by a, as well.
To keep your lines with the axes as is, one option is to manually dodge your data. This takes advantage of factors being integers under the hood, moving one level of group to the right and the other to the left.
df = transform(df, dmeasure = ifelse(group == "ctr",
as.numeric(measure) - .25,
as.numeric(measure) + .25 ) )
You can then make a plot with measure as the x axis but then use the "dodged" variable as the x axis variable in geom_point and geom_line.
ggplot(df, aes(x = measure, y = value) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
If you also want jittering, that can also be added manually to both you x and y variables.
df = transform(df, dmeasure = ifelse(group == "ctr",
jitter(as.numeric(measure) - .25, .1),
jitter(as.numeric(measure) + .25, .1) ),
jvalue = jitter(value, amount = .1) )
ggplot(df, aes(x = measure, y = jvalue) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
This turned out to be an astonishingly common question and I'd like to add an answer/comment to myself with a suggestion of a - what I now think - much, much better visualisation:
The scatter plot.
I originally intended to show paired data and visually guide the eye between the two comparisons. The problem with this visualisation is evident: Every subject is visualised twice. This leads to a quite crowded graphic. Also, the two dimensions of the data (measurement before, and after) are forced into one dimension (y), and the connection by ID is awkwardly forced onto your x axis.
Plot 1: The scatter plot naturally represents the ID by only showing one point per subject, but showing both dimensions more naturally on x and y. The only step needed is to make your data wider (yes, this is also sometimes necessary, ggplot not always requires long data).
The box plot
Plot 2: As rightly pointed out by user AllanCameron, another option would be to plot the difference of the paired values directly, for example as a boxplot. This is a nice visualisation of the appropriate paired t-test where the mean of the differences is tested against 0. It will require the same data shaping to "wide format". I personally like to show the actual values as well (if there are not too many).
library(tidyr)
library(dplyr)
library(ggplot2)
## first reshape the data wider (one column for each measurement)
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
## now use the new columns for your scatter plot
ggplot() +
geom_point(aes(time_a, time_b, color = group)) +
## you can add a line of equality to make it even more intuitive
geom_abline(intercept = 0, slope = 1, lty = 2, linewidth = .2) +
coord_equal()
Box plot to show differences of paired values
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
ggplot(aes(x = "", y = time_a - time_b)) +
geom_boxplot() +
# optional, if you want to show the actual values
geom_point(position = position_jitter(width = .1))
I'm having trouble to create a figure with ggplot2.
In this plot, I'm using geom_bar to plot three factors. I mean, for each "time" and "dose" I'm plotting two bars (two genotypes).
To be more specific, this is what I mean:
This is my code till now (Actually I changed some settings, but I'm presenting just what is need for):
ggplot(data=data, aes(x=interaction(dose,time), y=b, fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")
Question: I intend to add the mean of each time using points and that these points are just in the middle of the bars of a certain time. How can I proceed?
I tried to add these points using geom_dotplot and geom_point but I did not succeed.
library(dplyr)
time_data = data %>% group_by(time) %>% summarize(mean(b))
data <- inner_join(data,time_data,by = "time")
this gives you data with the means attached. Now make the plot
ggplot(data=data, aes(x=interaction(dose,time), y=b,fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")+
geom_text(aes(b),vjust = 0)
You might need to fiddle around with the argument hjust and vjust in the geom_text statement. Maybe the aes one too, I didn't run the program so I don't know.
It generally helps if you can give a reproducible example. Here, I made some of my own data.
sampleData <-
data.frame(
dose = 1:3
, time = rep(1:3, each = 3)
, genotype = rep(c("AA","aa"), each = 9)
, b = rnorm(18, 20, 5)
)
You need to calculate the means somewhere, and I chose to do that on the fly. Note that, instead of using points, I used a line to show that the mean is for all of those values. I also sorted somewhat differently, and used facet_wrap to cluster things together. Points would be a fair bit harder to place, particularly when using position_dodge, but you could likely modify this code to accomplish that.
ggplot(
sampleData
, aes(x = dose
, y = b
, fill = genotype)
) +
geom_bar(position = "dodge", stat = "identity") +
geom_hline(data =
sampleData %>%
group_by(time) %>%
summarise(meanB = mean(b)
, dose = NA, genotype = NA)
, aes(yintercept = meanB)
, col = "black"
) +
facet_wrap(~time)