How can I add text to points rendered with geom_jittered to label them? geom_text will not work because I don't know the coordinates of the jittered dots. Could you capture the position of the jittered points so I can pass to geom_text?
My practical usage would be to plot a boxplot with the geom_jitter over it to show the data distribution and I would like to label the outliers dots or the ones that match certain condition (for example the lower 10% for the values used for color the plots).
One solution would be to capture the xy positions of the jittered plots and use it later in another layer, is that possible?
[update]
From Joran answer, a solution would be to calculate the jittered values with the jitter function from the base package, add them to a data frame and use them with geom_point. For filtering he used ddply to have a filter column (a logic vector) and use it for subsetting the data in geom_text.
He asked for a minimal dataset. I just modified his example (a unique identifier in the label colum)
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''))
This is the result of joran example with my data and lowering the display of ids to the lowest 1%
And this is a modification of the code to have colors by another variable and displaying some values of this variable (the lowest 1% for each group):
library("ggplot2")
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''),quality= rnorm(300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 1% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.01);
g$top_q <- g$qual <= quantile(g$qual,0.01);
g})
#Create a boxplot, overlay the jittered points and
# label the bottom 1% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj,colour=quality)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab)) +
geom_text(data=subset(datJit,top_q),aes(x=xj,label=sprintf("%0.2f",quality)))
Your question isn't completely clear; for example, you mention labeling points at one point but also mention coloring points, so I'm not sure which you really mean, or perhaps both. A reproducible example would be very helpful. But using a little guesswork on my part, the following code does what I think you're describing:
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=rep('label',300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 10% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.1); g})
#Create a boxplot, overlay the jittered points and
# label the bottom 10% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab))
Just an addition to Joran's wonderful solution:
I ran into trouble with the x-axis positioning when I tried to use in a facetted plot using facet_wrap(). The problem is, that ggplot2 uses 1 as the x-value on every facet. The solution is to create a vector of jittered 1s:
datJit$xj <- jitter(rep(1,length(dat$x)),amount=0.1)
Related
I've done the rounds here and via google without a solution, so please help if you can.
I'm looking to create something like this : painSensitivityHeatMap using ggplot2
I can create something kinda similar using geom_tile, but without the smoothing between data points ... the only solution I have found requires a lot of code and data interpolation. Not very elegant, me thinks.uglySolutionUsingTile
So I'm thinking, I could coerce the density2d plots to my purposes instead by having the plot use fixed values rather than a calculated data-point density -- much in the same way that stat='identity' can be used in histograms to make them represent data values, rather than data counts.
So a minimal working example:
df <- expand.grid(letters[1:5], LETTERS[1:5])
df$value <- sample(1:4, 25, replace=TRUE)
# A not so pretty, non-smooth tile plot
ggplot(df, aes(x=Var1, y=Var2, fill=value)) + geom_tile()
# A potentially beautiful density2d plot, except it fails :-(
ggplot(df, aes(x=Var1, y=Var2)) + geom_density2d(aes(color=..value..))
This took me a little while, but here is a solution for future reference
A solution using idw from the gstat package and spsample from the sp package.
I've written a function which takes a dataframe, number of blocks (tiles) and a low and upper anchor for the colour scale.
The function creates a polygon (a simple quadrant of 5x5) and from that creates a grid of that shape.
In my data, the location variables are ordered factors -- therefor I unclass them into numbers (1-to-5 corresponding to the polygon-grid) and convert them to coordinates -- thus converting the tmpDF from a datafra to a spatial dataframe. Note: there are no overlapping/duplicate locations -- i.e 25 observations corresponding to the 5x5 grid.
The idw function fills in the polygon-grid (newdata) with inverse-distance weighted values ... in other words, it interpolates my data to the full polygon grid of a given number of tiles ('blocks').
Finally I create a ggplot based on a color gradient from the colorRamps package
painMapLumbar <- function(tmpDF, blocks=2500, lowLimit=min(tmpDF$value), highLimit=max(tmpDF$value)) {
# Create polygon to represent the lower back (lumbar)
poly <- Polygon(matrix(c(0.5, 0.5,0.5, 5.5,5.5, 5.5,5.5, 0.5,0.5, 0.5), ncol=2, byrow=TRUE))
# Create a grid of datapoints from the polygon
polyGrid <- spsample(poly, n=blocks, type="regular")
# Filter out the data for the figure we want
tmpDF <- tmpDF %>% mutate(x=unclass(x)) %>% mutate(y=unclass(y))
tmpDF <- tmpDF %>% filter(y<6) # Lumbar region only
coordinates(tmpDF) <- ~x+y
# Interpolate the data as Inverse Distance Weighted
invDistanceWeigthed <- as.data.frame(idw(formula = value ~ 1, locations = tmpDF, newdata = polyGrid))
p <- ggplot(invDistanceWeigthed, aes(x=x1, y=x2, fill=var1.pred)) + geom_tile() + scale_fill_gradientn(colours=matlab.like2(100), limits=c(lowLimit,highLimit))
return(p)
}
I hope this is useful to someone ... thanks for the replies above ... they helped me move on.
I'm using R to read and plot data from NetCDF files (ncdf4). I've started using R only recently thus I'm very confused, I beg your pardon.
Let's say from the files I obtain N 2-D matrixes of numerical values, each with different dimensions and many NA values.
I have to histogram these values in the same plot, with bins of given width and within given limits, the same for every matrix.
For just one matrix, I can do this:
library(ncdf4)
library(ggplot2)
file0 <- nc_open("test.nc")
#Read a variable
prec0 <- ncvar_get(file0,"pr")
#Some settings
min_plot=0
max_plot=30
bin_width=2
xlabel="mm/day"
ylabel="PDF"
title="Precipitation"
#Get maximum of array, exclude NAs
maximum_prec0=max(prec0, na.rm=TRUE)
#Store the histogram
histo_prec0 <- hist(prec0, xlim=c(min_plot,max_plot), right=FALSE, breaks=seq(0,ceiling(maximum_prec0),by=bin_width))
#Plot the histogram densities using points instead of bars, which is what we want
qplot(histo_prec0$mids, histo_prec0$density, xlim=c(min_plot,max_plot), color=I("yellow"), xlab=xlabel, ylab=ylabel, main=title, log="y")
#If necessary, can transform matrix to vector using
#vector_prec0 <- c(prec0)
However it occurs to me that it would be best to use a DataFrame for plotting multiple matrixes. I'm not certain of that nor on how to do it. This would also allow for automatic legends and all the advantages that come from using dataframes with ggplot2.
What I want to achieve is something akin to this:
https://copy.com/thumbs_public/j86WLyOWRs4N1VTi/scatter_histo.jpg?size=1024
Where on Y we have the Density and on X the bins.
Thanks in advance.
To be honest, it is unclear what you are after (scatter plot or histogram of data with values as points?).
Here are a couple of examples using ggplot which might fit your goals (based on your last sentence: "Where on Y we have the Density and on X the bins"):
# some data
nsample<- 200
d1<- rnorm(nsample,1,0.5)
d2<- rnorm(nsample,2,0.6)
#transformed into histogram bins and collected in a data frame
hist.d1<- hist(d1)
hist.d2<- hist(d2)
data.d1<- data.frame(hist.d1$mids, hist.d1$density, rep(1,length(hist.d1$density)))
data.d2<- data.frame(hist.d2$mids, hist.d2$density, rep(2,length(hist.d2$density)))
colnames(data.d1)<- c("bin","den","group")
colnames(data.d2)<- c("bin","den","group")
ddata<- rbind(data.d1,data.d2)
ddata$group<- factor(ddata$group)
# plot
plots<- ggplot(data=ddata, aes(x=bin, y=den, group=group)) +
geom_point(aes(color=group)) +
geom_line(aes(color=group)) #optional
print(plots)
However, you could also produce smooth density plots (or histograms) directly in ggplot:
ddata2<- cbind(c(rep(1,nsample),rep(2,nsample)),c(d1,d2))
ddata2<- as.data.frame(ddata2)
colnames(ddata2)<- c("group","value")
ddata2$group<- factor(ddata2$group)
plots2<- ggplot(data=ddata2, aes(x=value, group=group)) +
geom_density(aes(color=group))
# geom_histogram(aes(color=group, fill=group)) # for histogram instead
windows()
print(plots2)
I have a dataframe with 3 columns, (Id, Lat, Long), you can construct a small section of this with the following data:
df <- data.frame(
Id=c(1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Lat=c(58.12550, 58.17426, 58.46461, 58.45812, 58.45207, 58.44512, 58.43358, 58.42727, 57.77700, 57.76034, 57.73614, 57.72411, 57.70498, 57.68453),
Long=c(-5.098068, -5.314452, -4.914108, -4.899922, -4.887067, -4.873312, -4.852384, -4.840817, -5.666568, -5.648711, -5.617588, -5.594681, -5.557740, -5.509405))
The Id column is an index column. So all the rows with the same Id number have the coordinates for a single line. In my data frame this Id number varies from 1 through to 7696. So I have 7696 lines to plot.
Each Id number relates to an individual separate line of Lat and Long coordinates. What I want to do is overlay onto an existing plot all of these 7696 individual lines.
With the example data above this contains the Lat & Long coordinates for lines 1, 2, 3.
What is the best way to overlay all these lines onto an existing plot, I was thinking maybe some kind of loop?
Using ggplot2:
#dummy data
df <- data.frame(
Id=c(1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Lat=c(58.12550, 58.17426, 58.46461, 58.45812, 58.45207, 58.44512, 58.43358, 58.42727, 57.77700, 57.76034, 57.73614, 57.72411, 57.70498, 57.68453),
Long=c(-5.098068, -5.314452, -4.914108, -4.899922, -4.887067, -4.873312, -4.852384, -4.840817, -5.666568, -5.648711, -5.617588, -5.594681, -5.557740, -5.509405))
library(ggplot2)
#plot
ggplot(data=df,aes(Lat,Long,colour=as.factor(Id))) +
geom_line()
Using base R:
#plot blank
with(df,plot(Lat,Long,type="n"))
#plot lines
for(i in unique(df$Id))
with(df[ df$Id==i,],lines(Lat,Long,col=i))
To be honest, I think that any approach to take is going to result in a very cluttered plot since you have so many Ids (unless their lines do not overlap much). Either way, I would probably use ggplot2 for this.
##
if( !("ggplot2" %in% installed.packages()[,1]) ){
install.packages("ggplot2",dependencies=TRUE)
}
library(ggplot2)
##
D <- data.frame(
Id=Id,
Lat=Lat,
Long=Long
)
##
ggplot(data=D,aes(x=Lat,y=Long,group=Id,color=Id))+
geom_point()+ ## you might want to omit geom_point() in your plot
geom_line()
##
The reason I used group=Id, color=Id in aes() rather than passing Id as a factor to aes() and just using color=Id is that you will end up with a legend containing 7000+ factor levels (the majority of which will not be visible in the plot area).
I have a large set of data that consists of coordinates (x,y) and a numeric z value that is similar to density. I'm interested in binning the data, performing summary statistics (median, length, etc.) and plotting the binned values as points with the statistics mapped to ggplot aesthetics.
I've tried using stat_summary2d and extracting the results manually (based on this answer: https://stackoverflow.com/a/22013347/2832911). However, the problem I'm running into is that the bin placements are based on the range of the data, which in my case varies by data set. Thus between two plots the bins are not covering the same area.
My question is how to either manually set bins using stat_summary2d, or at least set them to be consistent regardless of the data.
Here is a basic example which demonstrates the approach and how the bins don't line up:
library(ggplot2)
set.seed(2)
df1 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
df2 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
g1 <- ggplot(df1, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df1.binned <-
data.frame(with(ggplot_build(g1)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=1)))
g2 <- ggplot(df2, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df2.binned <-
data.frame(with(ggplot_build(g2)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=2)))
df.binned <- rbind(df1.binned, df2.binned)
ggplot(df.binned, aes(x,y, size=z, color=factor(df)))+geom_point(alpha=.5)
Which generates
In reality I will use stat_summary2d several times to get, for instance, the number of points in the bin, and the median and then use aes(size=bin.length, colour=bin.median).
Any tips on how to accomplish this using my proposed approach, or an alternative approach would be welcome.
You can manually set breaks with stat_summary2d. If you want 10 levels from -1 to 1 you can do
bb<-seq(-1,1,length.out=10+1)
breaks<-list(x=bb, y=bb)
And then use the breaks variable when you call your plots
g1 <- ggplot(df1, aes(x,y))+
stat_summary2d(fun=mean, breaks=breaks, aes(z=z))+
geom_point()
It's a shame you can't change the geom of the stat_summary2d to "point" so you could make this in one go, but it doesn't look as though stat_summary2d calculate the proper x and y values for that.
I am visualizing a panel dataset with geom_point where y = var1, x = year, and color = var2. The problem is that there are many overlapping points, even with horizontal jitter.
Reducing the point size or setting a low alpha value is undesirable because both reduce the visual impact of the second variable, which has a very long right skew. I would like ggplot to place the points with the highest values of var2 on top of all other overlapping points.
Reproducible example:
df <- data.frame(diamonds)
ggplot(data = df,aes(x=factor(cut),y=carat,colour=price)) +
geom_point(position=position_jitter(width=.4))+
scale_colour_gradientn(colours=c("grey20","orange","orange3"))
How does one place the points with highest values in df$price on top of an overlapping stack of points?
It looks as though grid plots in the order of the data,
library(grid)
d <- data.frame(x=c(0.5,0.52),y=c(0.6,0.6), fill=c("blue","red"),
stringsAsFactors=FALSE)
grid.newpage()
with(d,grid.points(x,y,def='npc', pch=21,gp=gpar(cex=5, fill=fill)))
with(d[c(2,1),], grid.points(x,y-0.2,def='npc', pch=21,
gp=gpar(cex=5, fill=fill)))
so I would suggest you first reorder your data.frame, and pray that ggplot2 won't mess with it :)
library(ggplot2)
library(plyr)
df <- diamonds[order(diamonds$price, decreasing=TRUE), ]
# alternative with plyr
df <- arrange(diamonds, desc(price))
last_plot() %+% df
In ggplot2, you can use the order aesthetic to specify the order in which points are plotted. The last ones plotted will appear on top. To apply this, create a variable holding the order in which you'd like points to be drawn; in your case you might be able to specify rank(var2).
For the reproducible example, to put the points with the highest df$price on top:
df <- data.frame(diamonds)
df$orderrank <- rank(df$price,ties.method="first")
ggplot(data = df,aes(x=factor(cut),y=carat,colour=price, order=orderrank)) +
geom_point(position=position_jitter(width=.4))+
scale_colour_gradientn(colours=c("grey20","orange","orange3"))
Here is the difference in outputs between the example in the question and the chart with specified plotting order by price:
(The jittering makes the comparison a little less clear but the difference still comes across.)