I want to make a funnel chart in R with ggplot2 as following:
https://chartio.com/assets/c15a30/tutorials/charts/funnel-charts/c7cd4465bc714689646515692b6dbe7c74ae7550a265cd2d6a530f1f34d68ae1/funnel-chart-example.png
My code looks like this, but I don't know how to do the the light blue fills between the bars. (maybe with polygon?)
library(ggplot2)
library(reshape2) # for melt()
library(dplyr)
# get data
dat <- read.table(text=
"steps numbers rate
clicks 332835 100.000000
signup 157697 47.379933
cart 29866 8.973215
buys 17012 5.111241",
header = T)
barWidth <- 0.9
# add spacing, melt, sort
total <- subset(dat, rate==100)$numbers
dat$padding <- (total - dat$numbers) / 2
molten <- melt(dat[, -3], id.var='steps')
molten <- molten[order(molten$variable, decreasing = T), ]
molten$steps <- factor(molten$steps, levels = rev(dat$steps))
ggplot(molten, aes(x=steps)) +
geom_bar(aes(y = value, fill = variable),
stat='identity', position='stack') +
geom_text(data=dat,
aes(y=total/2, label= paste(round(rate), '%')),
color='white') +
scale_fill_manual(values = c('grey40', NA) ) +
coord_flip() +
theme(legend.position = 'none') +
labs(x='steps', y='volume')
I needed the same but hadn't found one, so I created a function to do so. It might need some improvements, but it is working well. The example below shows only numbers, but you can also add texts.
x <- c(86307,
34494,
28127,
17796,
12488,
11233
)
source("https://gist.github.com/jjesusfilho/fd14b58becab4924befef5be239c6011")
gg_funnel(x, color = viridisLite::plasma(6))
This should be just a comment, since you explicitly asked for a ggplot solution, which this is not - I posted it as an answer purely for reasons of code formatting.
You could consider plotly, which has a funnel type. Something like
library(plotly)
dat %>% mutate(steps=factor(steps, unique(steps)),
rate=sprintf("%.2f%%", rate)) %>%
plot_ly(
type = "funnel",
y = ~steps,
text= ~rate,
x = ~numbers)
could get you started; I do not really grasp the padding you have in your data, so this might not be exactly what you want.
I am using some code I got from an answer to a previous question, but I ran into a funny problem and Id like some expert insight into what is going on. I am trying to plot monthly deviations from an annual mean using bar charts. Specifically I am coloring the different bars different colors depending on whether the monthly mean is above or below the annual mean. I am using the txhousing dataset, which is included with the ggplot2 package.
I thought I could use a factor to denote whether or not this is the case. The months are correctly ordered when I only plot a subset of the data (the "lower" values, but when I add another plot, ggplot rearranges all of the months to be alphabetical. Does anyone know why this happens, and what a workaround would be?
Thank you so much for any input! Criticism of my code is welcome :)
Reproducible Examples
1. Using just one plot
library(tidyverse)
# subset txhousing to just years >= 2011, and calculate nested means and dates
housing_df <- filter(txhousing, year == 2014) %>%
group_by(year, month) %>%
summarise(monthly_mean = mean(sales, na.rm = TRUE),
date = first(date)) %>%
mutate(month = factor(month.abb[month], levels = month.abb, ordered = TRUE),
salesdiff = monthly_mean - mean(monthly_mean), # monthly deviation
higherlower = case_when(salesdiff >= 0 ~ "higher",
salesdiff < 0 ~ "lower"))
ggplot(data = housing_df, aes(x = month, y = salesdiff, higherlower)) +
geom_col(data = filter(housing_df, higherlower == "higher"), aes(y = salesdiff, fill = higherlower)) +
scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
theme_bw() +
theme(legend.position = "none") # remove legend
2. Using two plots with all of the data:
ggplot(data = housing_df, aes(x = month, y = salesdiff, higherlower)) +
geom_col(data = filter(housing_df, higherlower == "higher"), aes(y = salesdiff, fill = higherlower)) +
geom_col(data = filter(housing_df, higherlower == "lower"), aes(y = salesdiff, fill = higherlower)) +
scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
theme_bw() +
theme(legend.position = "none") # remove legend
There are multiple ways to do this but I find it a bit of a hit and trial. You are already doing the most common fix which is t convert month into a factor and that's why the first plot works. Why does it not work in the 2nd case is a bit of a mystery but try adding + scale_x_discrete(limits= housing_df$month) to override the x axis order and see if that works.
I agree to the other comments that the best way would be not even use the extra layer as its not needed in this specific case but the above solution works even when there are multiple layers.
Additionally, + scale_x_discrete(drop = FALSE) also overrides potentially different factor levels from different data sources in your ggplot.
This topic is also addressed here: https://github.com/tidyverse/ggplot2/issues/577
I am trying to create a picture that summarises my data. Data is about prevalence of drug use obtained from different practices form different countries. Each practice has contributed with a different amount of data and I want to show all of this in my picture.
Here is a subset of the data to work on:
gr<-data.frame(matrix(0,36))
gr$drug<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b")
gr$practice<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r")
gr$country<-c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3")
gr$prevalence<-c(9.14,5.53,16.74,1.93,8.51,14.96,18.90,11.18,15.00,20.10,24.56,22.29,19.41,20.25,25.01,25.87,29.33,20.76,18.94,24.60,26.51,13.37,23.84,21.82,23.69,20.56,30.53,16.66,28.71,23.83,21.16,24.66,26.42,27.38,32.46,25.34)
gr$prop<-c(0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406,0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406)
gr$low.CI<-c(8.27,4.80,12.35,1.83,7.22,14.53,18.25,10.56,14.28,18.76,24.25,21.72,18.62,19.83,24.36,25.22,28.80,20.20,17.73,23.15,21.06,13.12,21.79,21.32,22.99,19.76,29.60,15.41,28.39,23.25,20.34,24.20,25.76,26.72,31.92,24.73)
gr$high.CI<-c(10.10,6.37,22.31,2.04,10.00,15.40,19.56,11.83,15.74,21.52,24.87,22.86,20.23,20.68,25.67,26.53,29.86,21.34,20.21,26.10,32.79,13.63,26.02,22.33,24.41,21.39,31.48,17.98,29.04,24.43,22.01,25.12,27.09,28.05,33.01,25.95)
The code I wrote is this
p<-ggplot(data=gr, aes(x=factor(drug), y=as.numeric(gr$prevalence), ymax=max(high.CI),position="dodge",fill=practice,width=prop))
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
p + theme_bw()+
geom_bar(stat="identity",position = position_dodge(0.9)) +
labs(x="Drug",y="Prevalence") +
geom_errorbar(ymax=gr$high.CI,ymin=gr$low.CI,position=position_dodge(0.9),width=0.25,size=0.25,colour="black",aes(x=factor(drug), y=as.numeric(gr$prevalence), fill=practice)) +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The figure I obtain is this one where bars are all on top of each other while I want them "dodge".
I also obtain the following warning:
ymax not defined: adjusting position using y instead
Warning message:
position_dodge requires non-overlapping x intervals
Ideally I would get each bar near one another, with their error bars in the middle of its bar, all organised by country.
Also should I be concerned about the warning (which I clearly do not fully understand)?
I hope this makes sense. I hope I am close enough, but I don't seem to be going anywhere, some help would be greatly appreciated.
Thank you
ggplot's geom_bar() accepts the width parameter, but doesn't line them up neatly against one another in dodged position by default. The following workaround references the solution here:
library(dplyr)
# calculate x-axis position for bars of varying width
gr <- gr %>%
group_by(drug) %>%
arrange(practice) %>%
mutate(pos = 0.5 * (cumsum(prop) + cumsum(c(0, prop[-length(prop)])))) %>%
ungroup()
x.labels <- gr$practice[gr$drug == "a"]
x.pos <- gr$pos[gr$drug == "a"]
ggplot(gr,
aes(x = pos, y = prevalence,
fill = country, width = prop,
ymin = low.CI, ymax = high.CI)) +
geom_col(col = "black") +
geom_errorbar(size = 0.25, colour = "black") +
facet_wrap(~drug) +
scale_fill_manual(values = c("c1" = "gray79",
"c2" = "gray60",
"c3" = "gray39"),
guide = F) +
scale_x_continuous(name = "Drug",
labels = x.labels,
breaks = x.pos) +
labs(title = "Drug usage by country and practice", y = "Prevalence") +
theme_classic()
There is a lot of information you are trying to convey here - to contrast drug A and drug B across countries using the barplots and accounting for proportions, you might use the facet_grid function. Try this:
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
gr$drug <- paste("Drug", gr$drug)
p<-ggplot(data=gr, aes(x=factor(practice), y=as.numeric(prevalence),
ymax=high.CI,ymin = low.CI,
position="dodge",fill=practice, width=prop))
p + theme_bw()+ facet_grid(drug~country, scales="free") +
geom_bar(stat="identity") +
labs(x="Practice",y="Prevalence") +
geom_errorbar(position=position_dodge(0.9), width=0.25,size=0.25,colour="black") +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The width is too small in the C1 country and as you indicated the one clinic is quite influential.
Also, you can specify your aesthetics with the ggplot(aes(...)) and not have to reset it and it is not needed to include the dataframe objects name in the aes function within the ggplot call.
I'm having trouble to create a figure with ggplot2.
In this plot, I'm using geom_bar to plot three factors. I mean, for each "time" and "dose" I'm plotting two bars (two genotypes).
To be more specific, this is what I mean:
This is my code till now (Actually I changed some settings, but I'm presenting just what is need for):
ggplot(data=data, aes(x=interaction(dose,time), y=b, fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")
Question: I intend to add the mean of each time using points and that these points are just in the middle of the bars of a certain time. How can I proceed?
I tried to add these points using geom_dotplot and geom_point but I did not succeed.
library(dplyr)
time_data = data %>% group_by(time) %>% summarize(mean(b))
data <- inner_join(data,time_data,by = "time")
this gives you data with the means attached. Now make the plot
ggplot(data=data, aes(x=interaction(dose,time), y=b,fill=factor(genotype)))+
geom_bar(stat="identity", position="dodge")+
scale_fill_grey(start=0.3, end=0.6, name="Genotype")+
geom_text(aes(b),vjust = 0)
You might need to fiddle around with the argument hjust and vjust in the geom_text statement. Maybe the aes one too, I didn't run the program so I don't know.
It generally helps if you can give a reproducible example. Here, I made some of my own data.
sampleData <-
data.frame(
dose = 1:3
, time = rep(1:3, each = 3)
, genotype = rep(c("AA","aa"), each = 9)
, b = rnorm(18, 20, 5)
)
You need to calculate the means somewhere, and I chose to do that on the fly. Note that, instead of using points, I used a line to show that the mean is for all of those values. I also sorted somewhat differently, and used facet_wrap to cluster things together. Points would be a fair bit harder to place, particularly when using position_dodge, but you could likely modify this code to accomplish that.
ggplot(
sampleData
, aes(x = dose
, y = b
, fill = genotype)
) +
geom_bar(position = "dodge", stat = "identity") +
geom_hline(data =
sampleData %>%
group_by(time) %>%
summarise(meanB = mean(b)
, dose = NA, genotype = NA)
, aes(yintercept = meanB)
, col = "black"
) +
facet_wrap(~time)
Is there a way to jitter the lines in geom_line()? I know it kinda defies the purpose of this plot, but if you have a plot with few lines and would like them all to show it could be handy. Maybe some other solution to this visibility problem.
Please see below for code,
A <- c(1,2,3,5,1)
B <- c(3,4,1,2,3)
id <- 1:5
df <- data.frame(id, A, B)
# install.packages(reshape2)
require(reshape2) # for melt
dfm <- melt(df, id=c("id"))
# install.packages(ggplot2)
require(ggplot2)
p1 <- ggplot(data = dfm, aes(x = variable, y = value, group = id,
color= as.factor(id))) + geom_line() + labs(x = "id # 1 is hardly
visible as it is covered by id # 5") + scale_colour_manual(values =
c('red','blue', 'green', 'yellow', 'black'))
p2 <- ggplot(subset(dfm, id != 5), aes(x = variable, y = value,
group = id, color= as.factor(id))) + geom_line() + labs(x = "id #
5 removed, id # 1 is visible") + scale_colour_manual(values =
c('red','blue', 'green', 'yellow', 'black'))
# install.packages(RODBC)
require(gridExtra)
grid.arrange(p1, p2)
You can try
geom_line(position=position_jitter(w=0.02, h=0))
and see if that works well.
If you just want to prevent two lines from overlapping exactly, there is now a better way: position_dodge(), which "adjusts position by dodging overlaps to the side". This is nicer than adding jitter to any line, even when it's not needed.
Avoid ggplot2 lines overlapping exactly using position_dodge()
Code example:
df<-data.frame(x=1:10,y=1:10,z=1:10);
df.m <- melt(df, id.vars = "x");
ggplot(df.m, aes(x=x,y=value,group=variable,colour=variable))
+ geom_line(position=position_dodge(width=0.2));
Thanks to position_dodge(), we can now see that there are two lines in the plot, which just happen to co-incide exactly:
I tend to use different linestyles, so that, say, a solid blue line "peeks through" a dashed red line on top of it.
Then again, it does depend on what you want to impart to the reader. Keep in mind first and foremost that data should be points and theory lines unless this makes things cluttered. Unless the y and x values are identical, it'll be easier to see the points. (or you could apply the existing jitter function to the x-values)
Next, if you just want to show which runs are in the "bundle" and which are outliers, overlap doesn't matter because it's very unlikely that two outliers will be near-equal.
If you want to show a bunch of near-equal runs, you may prefer (which is to say, your readers will understand better) to plot the deltas against a mean rather than the actual values.
I would like to suggest a solution to a different problem than described, in which the Y axis is a factor, so position_dodge does nothing.
code:
library(tidyverse)
time_raw <- tibble(year=1900:1909,
person_A=c(rep("Rome",2),rep("Jerusalem",8)),
person_B=c(rep("Jerusalem",5),rep("Rome",5)))
achievements <- tribble(~year,~who,~what,
1900,"person_A","born",
1900,"person_B","born",
1909,"person_A","died",
1909,"person_B","died",
1905,"person_A","super star",
1905,"person_B","super star")
SCALE=0.5
jitter_locations <- time_raw %>%
pivot_longer(-year,names_to="who",values_to="place") %>%
distinct(place)%>%
filter(!is.na(place)) %>%
mutate(y_place=seq_along(place))
jitter_lines <- time_raw %>%
pivot_longer(-year,names_to="who",values_to="place") %>%
distinct(who) %>%
mutate(y_jitter=scale(seq_along(who))*0.015)
data_for_plot <- time_raw %>%
pivot_longer(-year,names_to="who",values_to="place") %>%
filter(!is.na(place)) %>%
left_join(achievements) %>%
left_join(jitter_locations) %>%
left_join(jitter_lines)
data_for_plot %>%
ggplot(aes(x=year,y=y_place+y_jitter,color=who,group=who))+
geom_line(size=2)+
geom_hline(aes(yintercept=y_place),size=50,alpha=0.1)+
geom_point(data = . %>% filter(!is.na(what)),size=5)+
geom_label(aes(label=what),size=3,nudge_y = -0.025)+
theme_bw()+
coord_cartesian(ylim = c(min(jitter_locations$y_place)-0.5*SCALE,
max(jitter_locations$y_place)+0.5*SCALE))+
scale_y_continuous(breaks =
min(jitter_locations$y_place):max(jitter_locations$y_place),
labels = jitter_locations$place)+
scale_x_continuous(breaks =
min(data_for_plot$year):max(data_for_plot$year))+
ylab("Place")