I am trying to use scale_y_continuous() with a faceted histogram and running into an issue. I am hoping to get each count to be a percentage instead. My code is:
ggplot(d, aes(x = likely_att)) +
geom_histogram(binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())
It looks like the distributions themselves are accurate, but the scaling is off: the percentages are "200 000%", "5 000%", etc. and that seems wrong, but I'm not quite sure why it's happening.
There are many more "yes" than "no" or "separated" married values in my dataset, which is why I use scales = "free_y" and why I'm hoping to just have percentages shown and only need one axis value shown.
I can't share this exact data for privacy reasons, but the likely_att variable is just a 1-5 numeric var, and married is a character var with 3 values: yes, no, separated.
In case it's helpful, I basically want it to look just like this image, but with percentages instead of counts, so I can just have one single y axis on the far left with 0 - 100 %
The problem is that using the percentage_format() function changes the way the labels are printed, but it doesn't actually rescale the numbers. To do that, you could use the density constructed variable and multiply it by the bin-width, then use the percent formatting.
ggplot(d, aes(x = likely_att)) +
stat_bin(aes(y=..density..*.5, group = married),
binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())
So there's several useful pages up about marking means on boxplots with multiple series; but even with those I'm having an issue where I can't select a color for the points and still show the two different means. I can do this:
library(ggplot2)
d <- subset(mpg,class=="compact"|class=="midsize")
ggplot(d,aes(drv,hwy,color=class)) + geom_boxplot() + scale_color_manual(values=c("blue","orange")) +
stat_summary(fun=mean,size=.5,shape=5,position=position_dodge(width=.75))
And that gives me the two different means, but they're the same color as the boxplots themselves and so not the best to look at.
So I add a color specification into the code:
ggplot(d,aes(drv,hwy,color=class)) + geom_boxplot() + scale_color_manual(values=c("blue","orange")) +
stat_summary(fun=mean,size=.5,color="black",shape=5,position=position_dodge(width=.75))
But then it's only showing the one mean.
So what am I missing here to get both a specified color and the multiple means being marked?
When you overwrite the colour aesthetic in stat_summary() you also lose
the grouping information. You need to bring it back explicitly with aes(group = class):
library(ggplot2)
d <- subset(mpg, class == "compact" | class == "midsize")
ggplot(d, aes(drv, hwy, color = class)) +
geom_boxplot() +
stat_summary(
aes(group = class),
colour = "black",
fun = mean,
size = .5,
shape = 5,
position = position_dodge(width = .75)
)
#> Warning: Removed 4 rows containing missing values (geom_segment).
Using fill to color the box, and color for stat_summary you get the desired output.
ggplot(d,aes(drv,hwy, fill=class)) + geom_boxplot() + scale_fill_manual(values=c("cyan","orange")) +
stat_summary(fun=mean,size=.5, color="red",
shape=5,position=position_dodge(width=.75))
I am trying to get a boxplot with 3 different tools in each dataset size like the one below:
ggplot(data1, aes(x = dataset, y = time, color = tool)) + geom_boxplot() +
labs(x = 'Datasets', y = 'Seconds', title = 'Time') +
scale_y_log10() + theme_bw()
But I need to transform x-axis to log scale. For that, I need to numericize each dataset to be able to transform them to log scale. Even without transforming them, they look like the one below:
ggplot(data2, aes(x = dataset, y = time, color = tool)) + geom_boxplot() +
labs(x = 'Datasets', y = 'Seconds', title = 'Time') +
scale_y_log10() + theme_bw()
I checked boxplot parameters and grouping parameters of aes, but could not resolve my problem. At first, I thought this problem is caused by scaling to log, but removing those elements did not resolve the problem.
What am I missing exactly? Thanks...
Files are in this link. "data2" is the numericized version of "data1".
Your question was a tough cookie, but I learned something new from it!
Just using group = dataset is not sufficient because you also have the tool variable to look out for. After digging around a bit, I found this post which made use of the interaction() function.
This is the trick that was missing. You want to use group because you are not using a factor for the x values, but you need to include tool in the separation of your data (hence using interaction() which will compute the possible crosses between the 2 variables).
# This is for pretty-printing the axis labels
my_labs <- function(x){
paste0(x/1000, "k")
}
levs <- unique(data2$dataset)
ggplot(data2, aes(x = dataset, y = time, color = tool,
group = interaction(dataset, tool))) +
geom_boxplot() + labs(x = 'Datasets', y = 'Seconds', title = 'Time') +
scale_x_log10(breaks = levs, labels = my_labs) + # define a log scale with your axis ticks
scale_y_log10() + theme_bw()
This plots
When I combine geom_vline() with facet_grid() like so:
DATA <- data.frame(x = 1:6,y = 1:6, f = rep(letters[1:2],3))
ggplot(DATA,aes(x = x,y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(xintercept = 2:3,
colour =c("goldenrod3","dodgerblue3"))
I get an error message stating Error: Aesthetics must be either length 1 or the same as the data (4): colour because there are two lines in each facet and there are two facets. One way to get around this is to use rep(c("goldenrod3","dodgerblue3"),2), but this requires that every time I change the faceting variables, I also have to calculate the number of facets and replace the magic number (2) in the call to rep(), which makes re-using ggplot code so much less nimble.
Is there a way to get the number of facets directly from ggplot for use in this situation?
You could put the xintercept and colour info into a data.frame to pass to geom_vline and then use scale_color_identity.
ggplot(DATA, aes(x = x, y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(data = data.frame(xintercept = 2:3,
colour = c("goldenrod3","dodgerblue3") ),
aes(xintercept = xintercept, color = colour) ) +
scale_color_identity()
This side-steps the issue of figuring out the number of facets, although that could be done by pulling out the number of unique values in the faceting variable with something like length(unique(DATA$f)).
I am using ggplot to generate a chart that summarises a race made up from several laps. There are 24 participants in the race,numbered 1-12, 14-25; I am plotting out a summary measure for each participant using ggplot, but ggplot assumes I want the number range 1-25, rather than categories 1-12, 14-25.
What's the fix for this? Here's the code I am using (the data is sourced from a Google spreadsheet).
sskey='0AmbQbL4Lrd61dHlibmxYa2JyT05Na2pGVUxLWVJYRWc'
library("ggplot2")
require(RCurl)
gsqAPI = function(key,query,gid){ return( read.csv( paste( sep="", 'http://spreadsheets.google.com/tq?', 'tqx=out:csv', '&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ) ) ) }
sin2011racestatsX=gsqAPI(sskey,'select A,B,G',gid='13')
sin2011proximity=gsqAPI(sskey,'select A,B,C',gid='12')
h=sin2011proximity
k=sin2011racestatsX
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
scale_x_discrete(limits =c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL','','SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))+
xlab(NULL) + opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position")+
scale_y_discrete(breaks=1:24,limits=1:24)+ opts(legend.position = "none")
Expanding on my cryptic comment, try this:
#Convert these to factors with the appropriate labels
# Note that I removed the ''
h$car <- factor(h$car,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
k$driverNum <- factor(k$driverNum,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position") +
scale_y_discrete(breaks=1:24,limits=1:24) + opts(legend.position = "none") +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) + xlab(NULL)
Calling scale_x_discrete is no longer necessary. And stylistically, I prefer putting opts and xlab stuff at the end.
Edit
A few notes in response to your comment. Many of your difficulties can be eased by a more streamlined use of ggplot. Your data is in an awkward format:
#Summarise so we can use geom_linerange rather than geom_step
d1 <- ddply(h,.(car),summarise,ymin = min(pos),ymax = max(pos))
#R has a special value for missing data; use it!
k$classification[k$classification == 'null'] <- NA
k$classification <- as.integer(k$classification)
#The other two data sets should be merged and converted to long format
d2 <- merge(l,k,by.x = "car",by.y = "driverNum")
colnames(d2)[3:5] <- c('End of Lap 1','Final Position','Grid Position')
d2 <- melt(d2,id.vars = 1:2)
#Now the plotting call is much shorter
ggplot() +
geom_linerange(data = d1,aes(x= car, ymin = ymin,ymax = ymax)) +
geom_point(data = d2,aes(x= car, y= value,shape = variable),size = 2) +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
labs(x = NULL, y = "Position", shape = "")
A few notes. You were setting aesthetics to fixed values (size = 2) which should be done outside of aes(). aes() is for mapping variables (i.e. columns) to aesthetics (color, shape, size, etc.). This allows ggplot to intelligently create the legend for you.
Merging the second two data sets and then melting it creates a grouping variable for ggplot to use in the legend. I used the shape aesthetic since a few values overlap; using color may make that hard to spot. In general, ggplot will resist mixing aesthetics into a single legend. If you want to use shape, color and size you'll get three legends.
I prefer setting labels using labs, since you can do them all in one spot. Note that setting the aesthetic label to "" removes the legend title.