R; ggplot2: Overlaying 1 plot with another - r

I have two ggplots. The first 1 looks like this:
ggplot(nurse, aes(x = nurse$z2.bk, y = nurse$z1.bk, color = nurse$phoneme)) +
geom_point() +
scale_x_reverse() + scale_y_reverse() +
scale_color_discrete() +
theme_classic()
I then created a subset which calculates the z1.bk averages and z2.bk for each of the phoneme categories.
mean_F1 = the z1.bk average and mean_F2 = the z1.bk average.
vowel mean_F1 mean_F2
<fct> <dbl> <dbl>
1 Er 0.00830 0.612
2 Ir -0.0433 0.0456
3 Vr 0.0365 -0.576
I then created another ggplot (below) for these values and labelled them according to the nurse$phoneme values. I just renamed them here to vowels to keep everything a bit cleaner.
ggplot(means, aes(x = mean_F2, y = mean_F1, label = vowel)) +
geom_label() +
scale_x_reverse() + scale_y_reverse() +
theme_classic()
I now wanted to overlay them, so that the labels are displayed above the other points in the corresponding colour, i.e. Er in red.... I tried the following but got an error message.
ggplot(nurse, aes(x = nurse$z2.bk, y = nurse$z1.bk, color = nurse$phoneme, label = means$vowel)) +
geom_point() +
geom_label(data = means, aes(x = mean_F2, y = mean_F1)) +
scale_x_reverse() + scale_y_reverse() +
theme_classic()
Error: Aesthetics must be either length 1 or the same as the data (563): label
If I change 'label = means$vowel' to just 'vowel', I get another error message saying the object can't be found. If I change it to nurse$phoneme, I get this error message Error: Aesthetics must be either length 1 or the same as the data (3): colour, label.
How do I combine them properly? If I need to supply you with more data, just let me know. And thanks in advance!

First, it's a bit of bad form to use the $ convention to call columns in ggplot2, where you should simply give the name of the column in the dataset: thus nurse$z2.bk becomes simply z2.bk in the aes() call. With that being said, you can use it and it should still work... it's just frowned upon. :)
Now, for the error message you are receiving - this is because the aesthetic for label= is indicated in your ggplot() call to be means$vowel, but in the dataset nurse, there are 563 observations. Since you have two datasets being applied separately to your point and label geoms, I would state them within the aes() for each geom.
Without your full dataset, I can't confirm, but this should work below. Note also that I'm indicating a label for the legend for color, because it is likely that calling the two columns in the separate datasets with different names could split the legend. Setting the name of the legend to be the same (and having the same labels in each) should keep the two color legends together.
ggplot(nurse, aes(x = z2.bk, y = z1.bk, color = phoneme)) +
geom_point() +
geom_label(data = means, aes(x = mean_F2, y = mean_F1, label=vowel, color=vowel)) +
scale_x_reverse() + scale_y_reverse() +
labs(color='The colors') +
theme_classic()

Related

Percentages in faceted histogram with scale_y_continuous()

I am trying to use scale_y_continuous() with a faceted histogram and running into an issue. I am hoping to get each count to be a percentage instead. My code is:
ggplot(d, aes(x = likely_att)) +
geom_histogram(binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())
It looks like the distributions themselves are accurate, but the scaling is off: the percentages are "200 000%", "5 000%", etc. and that seems wrong, but I'm not quite sure why it's happening.
There are many more "yes" than "no" or "separated" married values in my dataset, which is why I use scales = "free_y" and why I'm hoping to just have percentages shown and only need one axis value shown.
I can't share this exact data for privacy reasons, but the likely_att variable is just a 1-5 numeric var, and married is a character var with 3 values: yes, no, separated.
In case it's helpful, I basically want it to look just like this image, but with percentages instead of counts, so I can just have one single y axis on the far left with 0 - 100 %
The problem is that using the percentage_format() function changes the way the labels are printed, but it doesn't actually rescale the numbers. To do that, you could use the density constructed variable and multiply it by the bin-width, then use the percent formatting.
ggplot(d, aes(x = likely_att)) +
stat_bin(aes(y=..density..*.5, group = married),
binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())

Selecting color to mark means on boxplot with stat_summary

So there's several useful pages up about marking means on boxplots with multiple series; but even with those I'm having an issue where I can't select a color for the points and still show the two different means. I can do this:
library(ggplot2)
d <- subset(mpg,class=="compact"|class=="midsize")
ggplot(d,aes(drv,hwy,color=class)) + geom_boxplot() + scale_color_manual(values=c("blue","orange")) +
stat_summary(fun=mean,size=.5,shape=5,position=position_dodge(width=.75))
And that gives me the two different means, but they're the same color as the boxplots themselves and so not the best to look at.
So I add a color specification into the code:
ggplot(d,aes(drv,hwy,color=class)) + geom_boxplot() + scale_color_manual(values=c("blue","orange")) +
stat_summary(fun=mean,size=.5,color="black",shape=5,position=position_dodge(width=.75))
But then it's only showing the one mean.
So what am I missing here to get both a specified color and the multiple means being marked?
When you overwrite the colour aesthetic in stat_summary() you also lose
the grouping information. You need to bring it back explicitly with aes(group = class):
library(ggplot2)
d <- subset(mpg, class == "compact" | class == "midsize")
ggplot(d, aes(drv, hwy, color = class)) +
geom_boxplot() +
stat_summary(
aes(group = class),
colour = "black",
fun = mean,
size = .5,
shape = 5,
position = position_dodge(width = .75)
)
#> Warning: Removed 4 rows containing missing values (geom_segment).
Using fill to color the box, and color for stat_summary you get the desired output.
ggplot(d,aes(drv,hwy, fill=class)) + geom_boxplot() + scale_fill_manual(values=c("cyan","orange")) +
stat_summary(fun=mean,size=.5, color="red",
shape=5,position=position_dodge(width=.75))

How to plot multiple boxplots with numeric x values properly in ggplot2?

I am trying to get a boxplot with 3 different tools in each dataset size like the one below:
ggplot(data1, aes(x = dataset, y = time, color = tool)) + geom_boxplot() +
labs(x = 'Datasets', y = 'Seconds', title = 'Time') +
scale_y_log10() + theme_bw()
But I need to transform x-axis to log scale. For that, I need to numericize each dataset to be able to transform them to log scale. Even without transforming them, they look like the one below:
ggplot(data2, aes(x = dataset, y = time, color = tool)) + geom_boxplot() +
labs(x = 'Datasets', y = 'Seconds', title = 'Time') +
scale_y_log10() + theme_bw()
I checked boxplot parameters and grouping parameters of aes, but could not resolve my problem. At first, I thought this problem is caused by scaling to log, but removing those elements did not resolve the problem.
What am I missing exactly? Thanks...
Files are in this link. "data2" is the numericized version of "data1".
Your question was a tough cookie, but I learned something new from it!
Just using group = dataset is not sufficient because you also have the tool variable to look out for. After digging around a bit, I found this post which made use of the interaction() function.
This is the trick that was missing. You want to use group because you are not using a factor for the x values, but you need to include tool in the separation of your data (hence using interaction() which will compute the possible crosses between the 2 variables).
# This is for pretty-printing the axis labels
my_labs <- function(x){
paste0(x/1000, "k")
}
levs <- unique(data2$dataset)
ggplot(data2, aes(x = dataset, y = time, color = tool,
group = interaction(dataset, tool))) +
geom_boxplot() + labs(x = 'Datasets', y = 'Seconds', title = 'Time') +
scale_x_log10(breaks = levs, labels = my_labs) + # define a log scale with your axis ticks
scale_y_log10() + theme_bw()
This plots

Best way to calculate number of facets in geom_hline/_vline

When I combine geom_vline() with facet_grid() like so:
DATA <- data.frame(x = 1:6,y = 1:6, f = rep(letters[1:2],3))
ggplot(DATA,aes(x = x,y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(xintercept = 2:3,
colour =c("goldenrod3","dodgerblue3"))
I get an error message stating Error: Aesthetics must be either length 1 or the same as the data (4): colour because there are two lines in each facet and there are two facets. One way to get around this is to use rep(c("goldenrod3","dodgerblue3"),2), but this requires that every time I change the faceting variables, I also have to calculate the number of facets and replace the magic number (2) in the call to rep(), which makes re-using ggplot code so much less nimble.
Is there a way to get the number of facets directly from ggplot for use in this situation?
You could put the xintercept and colour info into a data.frame to pass to geom_vline and then use scale_color_identity.
ggplot(DATA, aes(x = x, y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(data = data.frame(xintercept = 2:3,
colour = c("goldenrod3","dodgerblue3") ),
aes(xintercept = xintercept, color = colour) ) +
scale_color_identity()
This side-steps the issue of figuring out the number of facets, although that could be done by pulling out the number of unique values in the faceting variable with something like length(unique(DATA$f)).

Omitting a Missing x-axis Value in ggplot2 (Convert range to categorical variable)

I am using ggplot to generate a chart that summarises a race made up from several laps. There are 24 participants in the race,numbered 1-12, 14-25; I am plotting out a summary measure for each participant using ggplot, but ggplot assumes I want the number range 1-25, rather than categories 1-12, 14-25.
What's the fix for this? Here's the code I am using (the data is sourced from a Google spreadsheet).
sskey='0AmbQbL4Lrd61dHlibmxYa2JyT05Na2pGVUxLWVJYRWc'
library("ggplot2")
require(RCurl)
gsqAPI = function(key,query,gid){ return( read.csv( paste( sep="", 'http://spreadsheets.google.com/tq?', 'tqx=out:csv', '&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ) ) ) }
sin2011racestatsX=gsqAPI(sskey,'select A,B,G',gid='13')
sin2011proximity=gsqAPI(sskey,'select A,B,C',gid='12')
h=sin2011proximity
k=sin2011racestatsX
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
scale_x_discrete(limits =c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL','','SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))+
xlab(NULL) + opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position")+
scale_y_discrete(breaks=1:24,limits=1:24)+ opts(legend.position = "none")
Expanding on my cryptic comment, try this:
#Convert these to factors with the appropriate labels
# Note that I removed the ''
h$car <- factor(h$car,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
k$driverNum <- factor(k$driverNum,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position") +
scale_y_discrete(breaks=1:24,limits=1:24) + opts(legend.position = "none") +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) + xlab(NULL)
Calling scale_x_discrete is no longer necessary. And stylistically, I prefer putting opts and xlab stuff at the end.
Edit
A few notes in response to your comment. Many of your difficulties can be eased by a more streamlined use of ggplot. Your data is in an awkward format:
#Summarise so we can use geom_linerange rather than geom_step
d1 <- ddply(h,.(car),summarise,ymin = min(pos),ymax = max(pos))
#R has a special value for missing data; use it!
k$classification[k$classification == 'null'] <- NA
k$classification <- as.integer(k$classification)
#The other two data sets should be merged and converted to long format
d2 <- merge(l,k,by.x = "car",by.y = "driverNum")
colnames(d2)[3:5] <- c('End of Lap 1','Final Position','Grid Position')
d2 <- melt(d2,id.vars = 1:2)
#Now the plotting call is much shorter
ggplot() +
geom_linerange(data = d1,aes(x= car, ymin = ymin,ymax = ymax)) +
geom_point(data = d2,aes(x= car, y= value,shape = variable),size = 2) +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
labs(x = NULL, y = "Position", shape = "")
A few notes. You were setting aesthetics to fixed values (size = 2) which should be done outside of aes(). aes() is for mapping variables (i.e. columns) to aesthetics (color, shape, size, etc.). This allows ggplot to intelligently create the legend for you.
Merging the second two data sets and then melting it creates a grouping variable for ggplot to use in the legend. I used the shape aesthetic since a few values overlap; using color may make that hard to spot. In general, ggplot will resist mixing aesthetics into a single legend. If you want to use shape, color and size you'll get three legends.
I prefer setting labels using labs, since you can do them all in one spot. Note that setting the aesthetic label to "" removes the legend title.

Resources