Linking legend to plot with a line or an arrow - r

Context: when you have "many" categories it can become hard to distinguish them in a bar plot. I found the plot below dealing with this situation quite nicely by linking the legend with categories in the plot.
Question: is it possible to do something similar with ggplot2?
With ggplot2 it is straighforward to get this:
But I really do not know were to start to acheive the result shown in the 1st plot.
Here is some code to sort it out:
library(ggplot2)
ggplot(data = mtcars, aes(x = vs, y = disp, fill = factor(carb))) +
geom_bar(stat = "identity")
Expected output (not as nice as the one presented above but it shows the idea)

There is no proper legend on the axes in any of the plots, but my guess is that the desired chart is based on relative frequencies, while your plot seems to show absolute frequencies, though I'm not sure about that.
Assuming that you want to produce a stacked bar chart giving the (relative) number of observations of a categorial variable in two groups, there are two ways to get the two stacked bars to be of the same height:
There need to be the exact same amount of observations in both of
them. Then you can use absolute frequencies.
The absolute frequencies need to be transformed to relative frequencies (or percent) by dividing them by the total number of observations in each group.
You can calculate the relative frequencies yourself and use them as the y-values.
Or refer to this post, as it seems to describe exactly what you want using ggplot2.

Related

specify order of variables in position dodge

I honestly don't know why this is being so hard.
I'm creating a simple scatter plot. The x axis is a continuous variable, and at every tick in x I need to plot four points with error bars. I'm using position dodge and everything works fine.
Each point has a different color, size and shape as governed by three further variables: color and shape are governed by factors, size by a continuous variable.
By default, the four points reflect the order of the levels in the color variable (red always left, then green, then blue) but I would like them to reflect the order of the size variable (the continuous one), smallest left and largest right. How do I specify that size should be prioritised when ordering points in position dodge? I tried using reverse ordering but then the points are ordered first according to the shape legend.
I could change the mapping between variable and aesthetics (all variables are fundamentally continuous and could be used with size) but I think it'd be useful to know how to specify the order in which multiple variables should be considered when dodging points.
The question is somewhat unclear unfortunately. You don't show "a simple scatter plot". You are showing some statistics (mean with error band??) for specific x values - although this is seemingly continuous, this looks as if you have categorised it beforehand - resulting in some summary statistics which you are plotting.
Also, it is not easy (impossible) to fully help you without knowing what you have done until now to come to where you are.
I have tried to reproduce a similar looking plot with mtcars.
Dodging is only possible by one group (but one group can contain more than one variable). To specify how to group, add group = ... to your aesthetics.
Like so:
library(tidyverse)
ggplot(filter(mtcars, carb %in% 1:4)) +
geom_point(aes(carb, mpg, size= gear, group = gear, shape = as.character(vs), color = as.factor(cyl)),
position = position_dodge(width = .5))
This is now dodged by gear, which is also used as size aesthetic.

Reordering data based on a column in [r] to order x-value items from lowest to highest y-values in ggplot

I have a dataframe that I want to reorder to make a ggplot so I can easily see which items have the highest and lowest values in them. In my case, I've grouped the data into two groups, and it'd be nice to have a visual representation of which group tends to score higher. Based on this question I came up with:
library(ggplot2)
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- line that doesn't seem to be working
ggplot(cor.data.sorted,aes(x=pic,y=r.val,size=df.val,color=exp)) + geom_point()
which produces this:
I've tried quite a few variants to reorder the data, and I feel like this should be pretty simple to achieve. To clarify, if I had succesfully reorganised the data then the y-values would go up as the plot moves along the x-value. So maybe i'm focussing on the wrong part of the code to achieve this in a ggplot figure?
You could do something like this?
library(tidyverse);
cor.data %>%
mutate(pic = factor(pic, levels = as.character(pic)[order(r.val)])) %>%
ggplot(aes(x = pic, y = r.val, size = df.val, color = exp)) + geom_point()
This obviously still needs some polishing to deal with the x axis label clutter etc.
Rather than try to order the data before creating the plot, I can reorder the data at the time of writing the plot:
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- This line controls order points drawn created to make (slightly) more readible plot
gplot(cor.data.sorted,aes(x=reorder(pic,r.val),y=r.val,size=df.val,color=exp)) + geom_point()
to create

Adding group mean lines to geom_bar plot and including in legend

I want to be able to create a bar graph which shows also shows the mean value for bars in each group. AND shows the mean bar in the legend.
I have been able to get this graph Bar chart with means using the code below, which is fine, but I would like to be able to see the mean lines in the legend.
##The data to be graphed is the proportion of persons receiving a treatment
## (num=numerator) in each population (denom=demoninator). The population is
##grouped by two age groups and (Age) and further divided by a categorical
##variable V1
###SET UP DATAFRAME###
require(ggplot2)
df <- data.frame(V1 = c(rep(c("S1","S2","S3","S4","S5"),2)),
Age= c(rep(70,5),rep(80,5)),
num=c(5280,6570,5307,4894,4119,3377,4244,2999,2971,2322),
denom=c(9984,12600,9425,8206,7227,7290,8808,6386,6206,5227))
df$prop<-df$num/df$denom*100
PopMean<-sum(df$num)/sum(df$denom)*100
df70<-df[df$Age==70,]
group70mean<-sum(df70$num)/sum(df70$denom)*100
df80<-df[df$Age==80,]
group80mean<-sum(df80$num)/sum(df80$denom)*100
df$PopMean<-c(rep(PopMean,10))
df$groupmeans<-c(rep(group70mean,5),rep(group80mean,5))
I want the plot to look like this, but want the lines in the legend too, to be labelled as 'mean of group' or similar.
#basic plot
P<-ggplot(df, aes(x=factor(Age), y=prop, fill=factor(V1))) +
geom_bar(position=position_dodge(), colour='black',stat="identity")
P
####add mean lines
P+geom_errorbar(aes(y=df$groupmeans, ymax=df$groupmeans,
ymin=df$groupmeans), col="red", lwd=2)
Adding show.legend=TRUE overlays the error bars onto the factor legend, rather than separately. If there is a way of showing geom_errorbar separately in the legend this is probably the simplest solution.
I have also tried various things with geom_line
The syntax below produces a line for the population mean value, but running from the centre of each point rather than covering the width of the bars
This produces a line for the population mean and it does produce a legend but one showing a bar of colour rather than a line.
P+geom_line(aes(y=df$PopMean, group=df$PopMean, color=df$PopMean),lwd=1)
If i try to do lines for group means the lines are not visible (because they are only single points).
P+geom_line(aes(y=df$groupmeans, group=df$groupmeans, color=df$groupmeans))
I also tried to get round this with facet plot, although this requires me to pretend my categorical variable is numeric to get it to work.
###set up new df
df2<-df
df2$V1<-c(rep(c(1,2,3,4,5),2))
P<-ggplot(df2, aes(x=factor(V1), y=prop, fill=factor(V1))) +
geom_bar(position=position_dodge(),
colour='black',stat="identity",width=1)
P+facet_grid(.~factor(df2$Age))
P+facet_grid(.~factor(df2$Age))+geom_line(aes(y=df$groupmeans,
group=df$groupmeans, color=df$groupmeans))
Facetplot
This allows me to show the mean lines, using geom_line, so a legend does appear (although it doesn't look right, showing a colour gradient rather than coloured lines!). However, the lines still do not go the full width of the bars. Also my x-axis now needs relabelling to show S1, S2 etc rather than numeric 1,2,3
To sum up - is there a way of showing error bar lines separately in the legend?
If not, then, if i use facetting, how do I correct the legend appearance and relabel axes with my categorical variables and is is possible to get the line to go the full width of the plot?
Or is there an alternate solution that I am missing!?
Thanks
To get the legend for the geom_error you need to pass the colour argument in the aes.
As you want only one category (here red), I've create a dummy variable first
df$mean <- "Mean"
ggplot(df, aes(x=factor(Age), y=prop, fill=factor(V1))) +
geom_bar(position=position_dodge(), colour='black',stat="identity") +
geom_errorbar(aes (ymax=groupmeans,
ymin=groupmeans, colour=mean), lwd=2) +
scale_colour_manual(name="",values = "#ff0000")

R graphic: Shifting values of different series so that error bars do not overlap

Here is a code:
set.seed (12)
library(ggplot2)
dat = data.frame(a=runif(40,0,1),b=c('a','b','c','d','e'),c=c('Hi','Hello'))
ggplot(dat,aes(x=b,y=a,shape=factor(c))) + stat_summary(fun.data=mean_cl_normal)
The graph it creates has error bars that overlap so that it is hard to distinguish the limits. I've often seen graphs where the different series (given by the factor c) are slightly horizontally shifted so that error bars does not overlap. Is there a way to achieve this with R when using a categorical variable in x ?
Thank you
You can use something like position_dodge():
ggplot(dat,aes(x=b,y=a,shape=factor(c))) +
stat_summary(fun.data=mean_cl_normal, position=position_dodge(width=0.2))
Example plot:

ggplot geom_bar vs geom_histogram

What is the difference (if any) between geom_bar and geom_histogram in ggplot? They seem to produce the same plot and take the same parameters.
Bar charts provide a visual presentation of categorical data. Examples:
The number of people with red, black and brown hair
Look at the geom_bar help file. The examples are all counts.
Wikipedia page
Histograms are used to plot density of interval (usually numeric) data. Examples,
Distributions of age and height
geom_hist help file. The examples are distribution of movie ratings.
ggplot2
After a bit more investigating, I think in ggplot2 there is no difference between geom_bar and geom_histogram. From the docs:
geom_histogram(mapping = NULL, data = NULL, stat = "bin",
position = "stack", ...)
geom_bar(mapping = NULL, data = NULL, stat = "bin",
position = "stack", ...)
I realise that in the geom_histogram docs it states:
geom_histogram is an alias for geom_bar plus stat_bin
but to be honest, I'm not really sure what this means, since my understanding of ggplot2 is that both stat_bin and geom_bar are layers (with a slightly different emphasis).
The default behavior is the same from both geom_bar and geom_histogram. This is because (and as #csgillespie mentioned), there is an implied stat_bin when you call geom_histogarm (understandable), and it is also the default statistics transformation applied to geom_bar (arguable behavior IMO). That's why you need to specify stat='identity' when you want the to plot the data as is.
The stat='bin' or stat_bin() is a statistical transformation that ggplot does for you. It provides you as output the variables surrounded with two dots (the ..count.. and ..density... If you don't specify stat='bin' you won't get those variables.
geom_bar() is for both x and y-values are categorical data -- so there are spaces between two bars as x-values are factor with distinct levels.
geom_histogram() is for one continuous data and one categorical data. Usually we put the continuous data to the x-axis (so the bars are touching each other as they are continuous) and categorical data to the y-axis.
There is another plot we can use to show the above situation (1 categorical 1 continuous) -- geom_boxplot(). Usually we use y-axis to represent the continuous data as it's going to be a vertical box-and-whisker.

Resources