Most frequent bin in a ggplot2 histogram in R

Most frequent bin in a ggplot2 histogram in R - r

I am using ggplot2 to draw a histogram of a sample of size 1000 taken from a normal distribution. I need to place the letter 'A' on the center of the histogram, and doing that with the function annotate.
Since this vector is random, the "center" of the drawing will change a little bit every time I run the code so I need to find a way in which the function knows how to place the 'A' according to that specific sample.For the x axis I took the median of the sample for the Y axis i was thinking of taking the frequency of the most frequent bin and dividing by 2.
Does anybody know if there is a function who gives you the frequency of each bin?
Here is a reproducible example:
library(ggplot2)
set.seed(123)
x <- rnorm(1000)
qplot(x, geom="histogram")

Here is a way to get the coordinates of the output plot (on a reproducible example):
library(ggplot2)
x <- runif(10)
h <- qplot(x, geom="histogram")
ggplot_build(h)$data
This will give you all sorts of information on the histogram.
So to get the height of the most frequent class and divide by two, you just need to do
height <- max(ggplot_build(h)$data[[1]]$count) / 2
Using the same kind of information, you can also put the text always right in the middle of the plot:
ranges <- ggplot_build(h)$panel$ranges
xtext <- mean(ranges[[1]]$x.range)
ytext <- mean(ranges[[1]]$y.range)
h + annotate("text", xtext, ytext,
label="A", size=30, color="blue", alpha=0.5)

Related

Align X axis of scatterplot and boxplot

I'm superimposing two images in R. One image is a boxplot (using boxplot()), the other a scatterplot (using scatterplot()). I noticed a discrepancy in the scale along the x-axis. (A) is the boxplot scale. (B) is for the scatterplot.
What I've been trying to do is re-scale (B) to suit (A). I note there is a condition called xlim in scatterplot. Tried it, didn't work. I've also noted this example came up as I was typing out the question: Change Axis Label - R scatterplot.
Tried it, didn't work.
How can I modify the x-axis to change the scale from 1.0, 1.5, 2.0, 2.5, 3.0 to simply 1,2,3.
In Stata, I'm aware you can specify the x-axis range, and then indicate the step-ups between. For example, the range may be 0-100, and each measurable point would be set to 10. So you'd end up with 10, 20,....,100.
My R code, as it stands, looks something like this:
library(car)
boxplot(a,b,c)
par(new=T)
scatterplot(x, y, smooth=TRUE, boxplots=FALSE)
I've tried modifying scatterplot as such without any success:
scatterplot(x, y, smooth=TRUE, boxplots=FALSE, xlim=c(1,3))

As mentioned in comments use as.factor, then xaxis should align. Here is ggplot solution:
#dummy data
dat1 <- data.frame(group=as.factor(rep(1:3,4)),
var=c(runif(12)))
dat2 <- data.frame(x=as.factor(1:3),y=runif(3))
library(ggplot2)
library(grid)
library(gridExtra)
#plot points on top of boxplot
ggplot(dat1,aes(group,var)) +
geom_boxplot() +
geom_point(aes(x,y),dat2)
Plot as separate plots
gg_boxplot <-
ggplot(dat1,aes(group,var)) +
geom_boxplot()
gg_point <-
ggplot(dat2,aes(x,y)) +
geom_point()
grid.arrange(gg_boxplot,gg_point,
ncol=1,
main="Plotting is easier with ggplot")
EDIT
Using xlim as suggested by #RuthgerRighart
#dummy data - no factors
dat1 <- data.frame(group=rep(1:3,4),
var=c(runif(12)))
dat2 <- data.frame(x=1:3,y=runif(3))
par(mfrow=c(2,1))
boxplot(var~group,dat1,xlim=c(1,3))
plot(dat2$x,dat2$y,xlim=c(1,3))

R - How to histogram multiple matrixes using qplot/ggplot2

I'm using R to read and plot data from NetCDF files (ncdf4). I've started using R only recently thus I'm very confused, I beg your pardon.
Let's say from the files I obtain N 2-D matrixes of numerical values, each with different dimensions and many NA values.
I have to histogram these values in the same plot, with bins of given width and within given limits, the same for every matrix.
For just one matrix, I can do this:
library(ncdf4)
library(ggplot2)
file0 <- nc_open("test.nc")
#Read a variable
prec0 <- ncvar_get(file0,"pr")
#Some settings
min_plot=0
max_plot=30
bin_width=2
xlabel="mm/day"
ylabel="PDF"
title="Precipitation"
#Get maximum of array, exclude NAs
maximum_prec0=max(prec0, na.rm=TRUE)
#Store the histogram
histo_prec0 <- hist(prec0, xlim=c(min_plot,max_plot), right=FALSE, breaks=seq(0,ceiling(maximum_prec0),by=bin_width))
#Plot the histogram densities using points instead of bars, which is what we want
qplot(histo_prec0$mids, histo_prec0$density, xlim=c(min_plot,max_plot), color=I("yellow"), xlab=xlabel, ylab=ylabel, main=title, log="y")
#If necessary, can transform matrix to vector using
#vector_prec0 <- c(prec0)
However it occurs to me that it would be best to use a DataFrame for plotting multiple matrixes. I'm not certain of that nor on how to do it. This would also allow for automatic legends and all the advantages that come from using dataframes with ggplot2.
What I want to achieve is something akin to this:
https://copy.com/thumbs_public/j86WLyOWRs4N1VTi/scatter_histo.jpg?size=1024
Where on Y we have the Density and on X the bins.
Thanks in advance.

To be honest, it is unclear what you are after (scatter plot or histogram of data with values as points?).
Here are a couple of examples using ggplot which might fit your goals (based on your last sentence: "Where on Y we have the Density and on X the bins"):
# some data
nsample<- 200
d1<- rnorm(nsample,1,0.5)
d2<- rnorm(nsample,2,0.6)
#transformed into histogram bins and collected in a data frame
hist.d1<- hist(d1)
hist.d2<- hist(d2)
data.d1<- data.frame(hist.d1$mids, hist.d1$density, rep(1,length(hist.d1$density)))
data.d2<- data.frame(hist.d2$mids, hist.d2$density, rep(2,length(hist.d2$density)))
colnames(data.d1)<- c("bin","den","group")
colnames(data.d2)<- c("bin","den","group")
ddata<- rbind(data.d1,data.d2)
ddata$group<- factor(ddata$group)
# plot
plots<- ggplot(data=ddata, aes(x=bin, y=den, group=group)) +
geom_point(aes(color=group)) +
geom_line(aes(color=group)) #optional
print(plots)
However, you could also produce smooth density plots (or histograms) directly in ggplot:
ddata2<- cbind(c(rep(1,nsample),rep(2,nsample)),c(d1,d2))
ddata2<- as.data.frame(ddata2)
colnames(ddata2)<- c("group","value")
ddata2$group<- factor(ddata2$group)
plots2<- ggplot(data=ddata2, aes(x=value, group=group)) +
geom_density(aes(color=group))
# geom_histogram(aes(color=group, fill=group)) # for histogram instead
windows()
print(plots2)

Manually specifying bins with stat_summary2d

I have a large set of data that consists of coordinates (x,y) and a numeric z value that is similar to density. I'm interested in binning the data, performing summary statistics (median, length, etc.) and plotting the binned values as points with the statistics mapped to ggplot aesthetics.
I've tried using stat_summary2d and extracting the results manually (based on this answer: https://stackoverflow.com/a/22013347/2832911). However, the problem I'm running into is that the bin placements are based on the range of the data, which in my case varies by data set. Thus between two plots the bins are not covering the same area.
My question is how to either manually set bins using stat_summary2d, or at least set them to be consistent regardless of the data.
Here is a basic example which demonstrates the approach and how the bins don't line up:
library(ggplot2)
set.seed(2)
df1 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
df2 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
g1 <- ggplot(df1, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df1.binned <-
data.frame(with(ggplot_build(g1)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=1)))
g2 <- ggplot(df2, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df2.binned <-
data.frame(with(ggplot_build(g2)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=2)))
df.binned <- rbind(df1.binned, df2.binned)
ggplot(df.binned, aes(x,y, size=z, color=factor(df)))+geom_point(alpha=.5)
Which generates
In reality I will use stat_summary2d several times to get, for instance, the number of points in the bin, and the median and then use aes(size=bin.length, colour=bin.median).
Any tips on how to accomplish this using my proposed approach, or an alternative approach would be welcome.

You can manually set breaks with stat_summary2d. If you want 10 levels from -1 to 1 you can do
bb<-seq(-1,1,length.out=10+1)
breaks<-list(x=bb, y=bb)
And then use the breaks variable when you call your plots
g1 <- ggplot(df1, aes(x,y))+
stat_summary2d(fun=mean, breaks=breaks, aes(z=z))+
geom_point()
It's a shame you can't change the geom of the stat_summary2d to "point" so you could make this in one go, but it doesn't look as though stat_summary2d calculate the proper x and y values for that.

adding text to ggplot geom_jitter points that match a condition

How can I add text to points rendered with geom_jittered to label them? geom_text will not work because I don't know the coordinates of the jittered dots. Could you capture the position of the jittered points so I can pass to geom_text?
My practical usage would be to plot a boxplot with the geom_jitter over it to show the data distribution and I would like to label the outliers dots or the ones that match certain condition (for example the lower 10% for the values used for color the plots).
One solution would be to capture the xy positions of the jittered plots and use it later in another layer, is that possible?
[update]
From Joran answer, a solution would be to calculate the jittered values with the jitter function from the base package, add them to a data frame and use them with geom_point. For filtering he used ddply to have a filter column (a logic vector) and use it for subsetting the data in geom_text.
He asked for a minimal dataset. I just modified his example (a unique identifier in the label colum)
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''))
This is the result of joran example with my data and lowering the display of ids to the lowest 1%
And this is a modification of the code to have colors by another variable and displaying some values of this variable (the lowest 1% for each group):
library("ggplot2")
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''),quality= rnorm(300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 1% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.01);
g$top_q <- g$qual <= quantile(g$qual,0.01);
g})
#Create a boxplot, overlay the jittered points and
# label the bottom 1% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj,colour=quality)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab)) +
geom_text(data=subset(datJit,top_q),aes(x=xj,label=sprintf("%0.2f",quality)))

Your question isn't completely clear; for example, you mention labeling points at one point but also mention coloring points, so I'm not sure which you really mean, or perhaps both. A reproducible example would be very helpful. But using a little guesswork on my part, the following code does what I think you're describing:
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=rep('label',300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 10% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.1); g})
#Create a boxplot, overlay the jittered points and
# label the bottom 10% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab))

Just an addition to Joran's wonderful solution:
I ran into trouble with the x-axis positioning when I tried to use in a facetted plot using facet_wrap(). The problem is, that ggplot2 uses 1 as the x-value on every facet. The solution is to create a vector of jittered 1s:
datJit$xj <- jitter(rep(1,length(dat$x)),amount=0.1)

Creating a facet_wrap plot with ggplot2 with different annotations in each plot

I am using ggplot2 to explore the result of some testing on an agent-based model. The model can end in one of three rounds per realization, and as such I am interested in how player utilities differ in terms of what round the game ends and their relative position in 2D space.
All this is to say that I have generated a facet_wrap plot to show this for each round, but I would also like to annotate each plot with the cor(x,y) for the subset of data represented in each facet. Is there a way to tell ggplot2 that I would like the annotation to use the subset of data generated by facet_wrap? Here is the code I have so far, and what it is producing
library(ggplot2)
# Load data
abm.data<-read.csv("ABM_results.csv")
# Create new colun for area of Pareto set
attach(abm.data)
area<-abs(((x3*(y2-y1))+(x2*(y1-y3))+(x1*(y3-y2)))/2)
abm.data<-transform(abm.data,area=area)
detach(abm.data)
# Compare area of Pareto set with player utility
png("area_p1.png",res=100,pointsize=20,height=500,width=1600)
area.p1<-ggplot(abm.data,aes(x=area))+geom_point(aes(y=U1_2,colour="Player 1",alpha=0.4))+facet_wrap(~round,ncol=3)+
annotate("text",0.375,-1.25,label=paste("rho=",round(cor(abm.data$area,abm.data$U1_2),2)), parse=TRUE)+
scale_colour_manual(values=c("Player 1"="red"))
area.p1+xlab("Area of Pareto Set")+ylab("Player Utility at Game End")+
opts(title="Final Player 1 Utility by Pareto Set Size and Round Game Ends",legend.position="none")
dev.off()
(source: drewconway.com)
As you can see, there are two problems:
The \rho value is of the full dataset, rather than the subsets by 'round'. Is there a way to get the cor(x,y) to print based on only the data shown in each plot?
The annotation should read "\rho=some_value" but instead I get "=(\rho,value);" is there a way to fix this?

To fix the second problem use
annotate("text", 0.375, -1.25,
label=paste("rho==", round(cor(abm.data$area, abm.data$U1_2), 2)),
parse=TRUE)
i.e. "rho==".
Edit: Here is a solution to solve the first problem
library("plyr")
library("ggplot2")
set.seed(1)
df <- data.frame(x=rnorm(300), y=rnorm(300), cl=gl(3,100)) # create test data
df.cor <- ddply(df, .(cl), function(val) sprintf("rho==%.2f", cor(val$x, val$y)))
p1 <- ggplot(data=df, aes(x=x)) +
geom_point(aes(y=y, colour="col1", alpha=0.4)) +
facet_wrap(~ cl, ncol=3) +
geom_text(data=df.cor, aes(x=0, y=3, label=V1), parse=TRUE) +
scale_colour_manual(values=c("col1"="red")) +
opts(legend.position="none")
print(p1)

The same question may be asked as for adding segments for each facet. We can solve these general problems by geom_segment instead of annotate("segment",...), for the geom_foo, we can define a data.frame to store the data for the geom_foo.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Most frequent bin in a ggplot2 histogram in R - r

Related

Align X axis of scatterplot and boxplot

R - How to histogram multiple matrixes using qplot/ggplot2

Manually specifying bins with stat_summary2d

adding text to ggplot geom_jitter points that match a condition

Creating a facet_wrap plot with ggplot2 with different annotations in each plot

Categories

Resources