ggplot2 scale_x_discrete value causing uneven axis spacing - r

I have a data frame with rows containing the titles of journal publications, values, and indicating whether it is a normal or a highlight data point. I want the plot to preserve the order of the data frame. The following code produces an unevenly spaced y-axis.
require(ggplot2)
title <- c("COGNITION","MUTAT RES-DNA REPAIR","AM J PHYSIOL-CELL PH","AM J PHYSIOL-CELL PH","BLOOD",
"PNAS","BIOCHEM BIOPH RES CO","CLIN CANCER RES","BIOCHEM BIOPH RES CO","MOL THER" )
value <- c(-0.428, -0.637, -0.740, -0.782, -0.880, -1.974, -1.988, -2.029, -2.217, -2.242)
indicator <- c(rep("highlight",5), rep("normal",5))
df <- data.frame(title, value, indicator)
mycolors <- c("highlight" = "blue", "normal" = "red")
x_axis_range <- c((min(df$value)), (max(df$value)))
p <- ggplot(df, aes(x = title, y = value)) +
geom_point(aes(size=3, color=indicator)) +
coord_flip() +
scale_color_manual(values=mycolors) +
scale_y_continuous(limit=x_axis_range) +
# produces uneven spacing
scale_x_discrete(limits=df$title) +
theme(legend.position="none")
show(p)
I don't know why ggplot is adding extra space between the MOL THER and CLIN CANCER RES and between the BLOOD and AM J PHYSIOL-CELL PH data points. When I change the scale_x_discrete() line to:
scale_x_discrete(limits=df$title.1) +
This spacing becomes even, but the order of the data is changed to alphabetically by title from bottom-to-top.
Why does adding the .1 to the end of limits=df$title even out the spacing? How can I preserve this evenness, and still be able to control the order the data along the y-axis with the order() function?

You get uneven spacing for the discrete scale because by providing df$title you give 10 values but in plot there are only 8 unique values - so there are two spaces for the levels already used.
When you provide scale_x_discrete(limits=df$title.1) limits actually are ignored because there is no title.1 column in your data and result is NULL
To get the order you need provide unique() values of df$title that are converted to character (to keep original order)
ggplot(df, aes(x = title, y = value)) +
geom_point(aes(size=3, color=indicator)) +
coord_flip() +
scale_color_manual(values=mycolors) +
scale_y_continuous(limit=x_axis_range) +
scale_x_discrete(limits=unique(as.character(df$title)) )+
theme(legend.position="none")

Related

create a scatterplot with x axis coming from one dataset and y axis coming from a second dataset and colour the points according to the dataframe

i have two dataframes comtaining results from epigenetic analysis.
the column from df1 which is important to the plot is labelled beta_ADHD
the column from df2 which is important to the plot is labelled beta_ADHD
I would like to make the the column from df 1 the x axis and the column from df 2 the y axis,
i would also like to label the points on the graph according to the data set they are from.
this is what ive tried so far but nothing has worked yet:
ggp <- ggplot(NULL, aes(Beta_ADHD, Beta_ADHD)) + # Draw ggplot2 plot based on two data frames
geom_point(data = df1, col = "red") +
geom_point(data = df2, col = "blue")
ggp # Draw plot
and i also tried this:
ggplot(data=data.frame(x=df1$Beta_ADHD, y=df2$Beta_ADHD), aes(x=x, y=y)) + geom_point()
I'm at a complete loss here and any help would be greatly appreciated.
I think you need to combine the inputs into a single data frame in order to use them as co-ordinates for a scatter plot. (Also, the 2 data sets must have the same number of values.)
I don't believe it makes sense to label or colour the points according to which data set they are from. As we are taking the x-coordinate from df1 and the y-coordinate from df2, that means that every point comes from both data sets. It is the labels on the x-axis beta_ADHD1 and y-axis beta_ADHD2 that show which data set the value came from. You can change the text and color of the axis titles using xlab(), ylab() and theme().
# create some sample data
df1 <- data.frame(beta_ADHD=runif(100,0,10))
df2 <- data.frame(beta_ADHD=rnorm(100,0,10))
# create a new data frame containing the required co-ordinates
# the values from df1 are named beta_ADHD1 and the values from df2 are named beta_ADHD2
new_df <- data.frame(beta_ADHD1 = df1$beta_ADHD, beta_ADHD2 = df2$beta_ADHD)
# plot this data using ggplot
ggplot(new_df, aes(x = beta_ADHD1, y = beta_ADHD2)) + geom_point() +
xlab('beta_ADHD from df1') + ylab('beta_ADHD from df2') +
theme(axis.title.x = element_text(color ='red'), axis.title.y = element_text(color = 'blue'))

Spacing of discrete axis by a categorical variable

I have a categorical axis where i'd like to visually separate groups within that categorical variable. I don't want to facet because it takes up too much space and is visually not as clean.
Here's a visual example of what I want that involves some tedious hacking (setting alpha to 0 for non-data entries used for spacing).
library(ggplot2)
dd <- data.frame(x=factor(c(1,-1,2:10),levels=c(1,-1,2:10)), y=c(1,2,2:10), hidden=as.factor(c(0,1,rep(0,9))))
ggplot(data=dd,aes(x=x,y=y,alpha=hidden)) +
geom_point() + scale_alpha_manual(values=c("1"=0,"0"=1)) +
scale_x_discrete(breaks=c(1:10))
I'd like to be able create this plot without having to hack an extra category in (which wouldn't be feasible with the amount of data/number of groups I'm trying to plot) using the following data structure (where the variable "groups" determines where the spacing occurs):
dd2 <- data.frame(x=factor(1:10,), y=c(1:10), groups=c("A",rep("B",9)))
You can get the result you are looking for via the breaks and limits arguments to scale_x_discrete. Set the breaks to the levels of the factor on the x-axis and the limits to the factor levels with spacers were you want/need them.
Here is an example:
library(ggplot2)
dd <- data.frame(x = factor(letters[1:10]), y = 1:10)
ggplot(dd) +
aes(x = x, y = y) +
geom_point() +
scale_x_discrete(breaks = levels(dd$x),
limits = c(levels(dd$x)[1], "skip", levels(dd$x)[-1]))

How to adjust the ordering of labels in the default legend in ggplot2 so that it corresponds to the order in the data

I am plotting a forest plot in ggplot2 and am having issues with the ordering of the labels in the legend matching the order of the labels in the data set. Here is my code below.
data code
d<-data.frame(x=c("Co-K(W) N=720", "IH-K(W) N=67", "IF-K(W) N=198", "CO-K(B)N=78", "IH-K(B) N=13", "CO=A(W) N=874","D-Sco Ad(W) N=346","DR-Ad (W) N=892","CE_A(W) N=274","CO-Ad(B) N=66","D-So Ad(B) N=215","DR-Ad(B) N=123","CE-Ad(B) N=79"),
y = rnorm(13, 0, 0.1))
d <- transform(d, ylo = y-1/13, yhi=y+1/13)
d$x <- factor(d$x, levels=rev(d$x)) # reverse ordering
forest plot code
credplot.gg <- function(d){
# d is a data frame with 4 columns
# d$x gives variable names
# d$y gives center point
# d$ylo gives lower limits
# d$yhi gives upper limits
require(ggplot2)
p <- ggplot(d, aes(x=x, y=y, ymin=ylo, ymax=yhi,group=x,colour=x,)) +
geom_pointrange(size=1) +
theme_bw() +
scale_color_discrete(name="Sample") +
coord_flip() +
theme(legend.key=element_rect(fill='cornsilk2')) +
guides(colour = guide_legend(override.aes = list(size=0.5))) +
geom_hline(aes(x=0), colour = 'red', lty=2) +
xlab('Cohort') + ylab('CI') + ggtitle('Forest Plot')
return(p)
}
credplot.gg(d)
This is what I get. As you can see the labels on the y axis matches the labels in the order that it is in the data. However, it is not the same order in the legend. I'm not sure how to correct this. This is my first time creating a plot in ggplot2. Any feedback is well appreciated.Thanks in advanced
Nice plot, especially for a first ggplot! I've not tested, but I think all you need is to add reverse=TRUE inside your colour's guide_legend(found this in the Cookbook for R).
If I were to make one more comment, I'd say that ordering your vertical factor by numeric value often makes comparisons easier when alphabetical order isn't particularly meaningful. (Though maybe your alpha order is meaningful.)

ggplot 2 - Change legend categories with numeric values (no factors)

Suppose I work with the mtcars data set. I would like to set the size of the points according to the weight (wt). If I do that as shown below, R/ggplot2 will give me a legend with 4 categories (2,3,4,5).
library(ggplot2)
mtc <- mtcars
p1 <- ggplot(mtc, aes(x = hp, y = mpg))
p1 <- p1 + geom_point(aes(size = wt))
print(p1)
How can I change the scale/names/categories of the legend. I found information on how to do that if the "categories" would be factors, but I don't know how to do this with numeric values. I need to keep them numeric otherwise it doesn't work with the size of the dots anymore.
My real data set has about 100 values for wt (everything from 1-150) and I want to keep 5 values. (ggplot 2 gives me 2 -> 50 and 100)
1) How can I change the scale of that legend? In the mtc example for example I just want 2 points of size 2 and 5
2) I was thinking about making categories such as:
mtc$wtCat[which(mtc$wt<=2)]=1
mtc$wtCat[which(mtc$wt>2 & mtc$wt<=3)]=2
mtc$wtCat[which(mtc$wt>3)]=3
p1 <- ggplot(mtc, aes(x = hp, y = mpg))
p2 <- p1 + geom_point(aes(size = wtCat), stat="identity")
print(p2)
and then just rename 1,2,3 in the legend into <=2, 2-3 and >3 but I didn't figure out how to do that as well.
Thank you so much.
You can use scale_size_continuous() and with argument breaks= set levels you want to see in legend and with argument labels= change how legend entries are labelled.
ggplot(mtcars,aes(hp,mpg,size=wt))+geom_point()+
scale_size_continuous(breaks=c(2,5),labels=c("<=2",">2"))

Omitting a Missing x-axis Value in ggplot2 (Convert range to categorical variable)

I am using ggplot to generate a chart that summarises a race made up from several laps. There are 24 participants in the race,numbered 1-12, 14-25; I am plotting out a summary measure for each participant using ggplot, but ggplot assumes I want the number range 1-25, rather than categories 1-12, 14-25.
What's the fix for this? Here's the code I am using (the data is sourced from a Google spreadsheet).
sskey='0AmbQbL4Lrd61dHlibmxYa2JyT05Na2pGVUxLWVJYRWc'
library("ggplot2")
require(RCurl)
gsqAPI = function(key,query,gid){ return( read.csv( paste( sep="", 'http://spreadsheets.google.com/tq?', 'tqx=out:csv', '&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ) ) ) }
sin2011racestatsX=gsqAPI(sskey,'select A,B,G',gid='13')
sin2011proximity=gsqAPI(sskey,'select A,B,C',gid='12')
h=sin2011proximity
k=sin2011racestatsX
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
scale_x_discrete(limits =c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL','','SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))+
xlab(NULL) + opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position")+
scale_y_discrete(breaks=1:24,limits=1:24)+ opts(legend.position = "none")
Expanding on my cryptic comment, try this:
#Convert these to factors with the appropriate labels
# Note that I removed the ''
h$car <- factor(h$car,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
k$driverNum <- factor(k$driverNum,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position") +
scale_y_discrete(breaks=1:24,limits=1:24) + opts(legend.position = "none") +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) + xlab(NULL)
Calling scale_x_discrete is no longer necessary. And stylistically, I prefer putting opts and xlab stuff at the end.
Edit
A few notes in response to your comment. Many of your difficulties can be eased by a more streamlined use of ggplot. Your data is in an awkward format:
#Summarise so we can use geom_linerange rather than geom_step
d1 <- ddply(h,.(car),summarise,ymin = min(pos),ymax = max(pos))
#R has a special value for missing data; use it!
k$classification[k$classification == 'null'] <- NA
k$classification <- as.integer(k$classification)
#The other two data sets should be merged and converted to long format
d2 <- merge(l,k,by.x = "car",by.y = "driverNum")
colnames(d2)[3:5] <- c('End of Lap 1','Final Position','Grid Position')
d2 <- melt(d2,id.vars = 1:2)
#Now the plotting call is much shorter
ggplot() +
geom_linerange(data = d1,aes(x= car, ymin = ymin,ymax = ymax)) +
geom_point(data = d2,aes(x= car, y= value,shape = variable),size = 2) +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
labs(x = NULL, y = "Position", shape = "")
A few notes. You were setting aesthetics to fixed values (size = 2) which should be done outside of aes(). aes() is for mapping variables (i.e. columns) to aesthetics (color, shape, size, etc.). This allows ggplot to intelligently create the legend for you.
Merging the second two data sets and then melting it creates a grouping variable for ggplot to use in the legend. I used the shape aesthetic since a few values overlap; using color may make that hard to spot. In general, ggplot will resist mixing aesthetics into a single legend. If you want to use shape, color and size you'll get three legends.
I prefer setting labels using labs, since you can do them all in one spot. Note that setting the aesthetic label to "" removes the legend title.

Resources