R Histogram: Binning by one characteristic, Density by another - r

I have a data frame of two variables, x and y in R. What i want to do is bin each entry by its value of x, but then display the density of the value of y for all entries in each bin. More specifically, for each interval in units of x, i want to plot the sum(of all values of y of entries whose values of x are in the specific interval)/(sum of all values of y for all entries). I know how to do this manually via vector manipulation, but i have to make a lot of these plots and wanted to know if their was a quicker way to do this, maybe via some advanced hist.

You could generate the groupings using cut and then use a facet_grid to display the multiple histograms:
# Sample data with y depending on x
set.seed(144)
dat <- data.frame(x=rnorm(1000))
dat$y <- dat$x + rnorm(1000)
# Generate bins of x values
dat$grp <- cut(dat$x, breaks=2)
# Plot
library(ggplot2)
ggplot(dat, aes(x=y)) + geom_histogram() + facet_grid(grp~.)

Related

Most frequent bin in a ggplot2 histogram in R

I am using ggplot2 to draw a histogram of a sample of size 1000 taken from a normal distribution. I need to place the letter 'A' on the center of the histogram, and doing that with the function annotate.
Since this vector is random, the "center" of the drawing will change a little bit every time I run the code so I need to find a way in which the function knows how to place the 'A' according to that specific sample.For the x axis I took the median of the sample for the Y axis i was thinking of taking the frequency of the most frequent bin and dividing by 2.
Does anybody know if there is a function who gives you the frequency of each bin?
Here is a reproducible example:
library(ggplot2)
set.seed(123)
x <- rnorm(1000)
qplot(x, geom="histogram")
Here is a way to get the coordinates of the output plot (on a reproducible example):
library(ggplot2)
x <- runif(10)
h <- qplot(x, geom="histogram")
ggplot_build(h)$data
This will give you all sorts of information on the histogram.
So to get the height of the most frequent class and divide by two, you just need to do
height <- max(ggplot_build(h)$data[[1]]$count) / 2
Using the same kind of information, you can also put the text always right in the middle of the plot:
ranges <- ggplot_build(h)$panel$ranges
xtext <- mean(ranges[[1]]$x.range)
ytext <- mean(ranges[[1]]$y.range)
h + annotate("text", xtext, ytext,
label="A", size=30, color="blue", alpha=0.5)

R - How to histogram multiple matrixes using qplot/ggplot2

I'm using R to read and plot data from NetCDF files (ncdf4). I've started using R only recently thus I'm very confused, I beg your pardon.
Let's say from the files I obtain N 2-D matrixes of numerical values, each with different dimensions and many NA values.
I have to histogram these values in the same plot, with bins of given width and within given limits, the same for every matrix.
For just one matrix, I can do this:
library(ncdf4)
library(ggplot2)
file0 <- nc_open("test.nc")
#Read a variable
prec0 <- ncvar_get(file0,"pr")
#Some settings
min_plot=0
max_plot=30
bin_width=2
xlabel="mm/day"
ylabel="PDF"
title="Precipitation"
#Get maximum of array, exclude NAs
maximum_prec0=max(prec0, na.rm=TRUE)
#Store the histogram
histo_prec0 <- hist(prec0, xlim=c(min_plot,max_plot), right=FALSE, breaks=seq(0,ceiling(maximum_prec0),by=bin_width))
#Plot the histogram densities using points instead of bars, which is what we want
qplot(histo_prec0$mids, histo_prec0$density, xlim=c(min_plot,max_plot), color=I("yellow"), xlab=xlabel, ylab=ylabel, main=title, log="y")
#If necessary, can transform matrix to vector using
#vector_prec0 <- c(prec0)
However it occurs to me that it would be best to use a DataFrame for plotting multiple matrixes. I'm not certain of that nor on how to do it. This would also allow for automatic legends and all the advantages that come from using dataframes with ggplot2.
What I want to achieve is something akin to this:
https://copy.com/thumbs_public/j86WLyOWRs4N1VTi/scatter_histo.jpg?size=1024
Where on Y we have the Density and on X the bins.
Thanks in advance.
To be honest, it is unclear what you are after (scatter plot or histogram of data with values as points?).
Here are a couple of examples using ggplot which might fit your goals (based on your last sentence: "Where on Y we have the Density and on X the bins"):
# some data
nsample<- 200
d1<- rnorm(nsample,1,0.5)
d2<- rnorm(nsample,2,0.6)
#transformed into histogram bins and collected in a data frame
hist.d1<- hist(d1)
hist.d2<- hist(d2)
data.d1<- data.frame(hist.d1$mids, hist.d1$density, rep(1,length(hist.d1$density)))
data.d2<- data.frame(hist.d2$mids, hist.d2$density, rep(2,length(hist.d2$density)))
colnames(data.d1)<- c("bin","den","group")
colnames(data.d2)<- c("bin","den","group")
ddata<- rbind(data.d1,data.d2)
ddata$group<- factor(ddata$group)
# plot
plots<- ggplot(data=ddata, aes(x=bin, y=den, group=group)) +
geom_point(aes(color=group)) +
geom_line(aes(color=group)) #optional
print(plots)
However, you could also produce smooth density plots (or histograms) directly in ggplot:
ddata2<- cbind(c(rep(1,nsample),rep(2,nsample)),c(d1,d2))
ddata2<- as.data.frame(ddata2)
colnames(ddata2)<- c("group","value")
ddata2$group<- factor(ddata2$group)
plots2<- ggplot(data=ddata2, aes(x=value, group=group)) +
geom_density(aes(color=group))
# geom_histogram(aes(color=group, fill=group)) # for histogram instead
windows()
print(plots2)

plotting multiple plots in ggplot2 on same graph that are unrelated

How would one use the smooth.spline() method in a ggplot2 scatterplot?
If my data is in the data frame called data, with two columns, x and y.
The smooth.spline would be sm <- smooth.spline(data$x, data$y). I believe I should use geom_line(), with sm$x and sm$y as the xy coordinates. However, how would one plot a scatterplot and a lineplot on the same graph that are completely unrelated? I suspect it has something to do with the aes() but I am getting a little confused.
You can use different data(frames) in different geoms and call the relevant variables using aes or you could combine the relevant variables from the output of smooth.spline
# example data
set.seed(1)
dat <- data.frame(x = rnorm(20, 10,2))
dat$y <- dat$x^2 - 20*dat$x + rnorm(20,10,2)
# spline
s <- smooth.spline(dat)
# plot - combine the original x & y and the fitted values returned by
# smooth.spline into a data.frame
library(ggplot2)
ggplot(data.frame(x=s$data$x, y=s$data$y, xfit=s$x, yfit=s$y)) +
geom_point(aes(x,y)) + geom_line(aes(xfit, yfit))
# or you could use geom_smooth
ggplot(dat, aes(x , y)) + geom_point() + geom_smooth()

Manually specifying bins with stat_summary2d

I have a large set of data that consists of coordinates (x,y) and a numeric z value that is similar to density. I'm interested in binning the data, performing summary statistics (median, length, etc.) and plotting the binned values as points with the statistics mapped to ggplot aesthetics.
I've tried using stat_summary2d and extracting the results manually (based on this answer: https://stackoverflow.com/a/22013347/2832911). However, the problem I'm running into is that the bin placements are based on the range of the data, which in my case varies by data set. Thus between two plots the bins are not covering the same area.
My question is how to either manually set bins using stat_summary2d, or at least set them to be consistent regardless of the data.
Here is a basic example which demonstrates the approach and how the bins don't line up:
library(ggplot2)
set.seed(2)
df1 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
df2 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
g1 <- ggplot(df1, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df1.binned <-
data.frame(with(ggplot_build(g1)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=1)))
g2 <- ggplot(df2, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df2.binned <-
data.frame(with(ggplot_build(g2)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=2)))
df.binned <- rbind(df1.binned, df2.binned)
ggplot(df.binned, aes(x,y, size=z, color=factor(df)))+geom_point(alpha=.5)
Which generates
In reality I will use stat_summary2d several times to get, for instance, the number of points in the bin, and the median and then use aes(size=bin.length, colour=bin.median).
Any tips on how to accomplish this using my proposed approach, or an alternative approach would be welcome.
You can manually set breaks with stat_summary2d. If you want 10 levels from -1 to 1 you can do
bb<-seq(-1,1,length.out=10+1)
breaks<-list(x=bb, y=bb)
And then use the breaks variable when you call your plots
g1 <- ggplot(df1, aes(x,y))+
stat_summary2d(fun=mean, breaks=breaks, aes(z=z))+
geom_point()
It's a shame you can't change the geom of the stat_summary2d to "point" so you could make this in one go, but it doesn't look as though stat_summary2d calculate the proper x and y values for that.

Combine continuous and discrete color scale in ggplot2?

I am a ggplot2 newbie. I am making a scatter plot where the points are colored based on a third continuous variable. However, for some of the points, that continuous variable has either an Inf value or a NaN. How can I generate a continuous scale that has a special, separate color for Inf and another separate color for NaN?
One way to get this behavior is to subset the data, and make a separate layer for the special points, where the color is set. But I'd like the special colors to enter the legend as well, and think it would be cleaner to eliminate the need to subset the data.
Thanks!
Uri
I'm sure this can be made more efficient, but here's one approach. Essentially, we follow your advice of subsetting the data into the different parts, divide the continuous data into discrete bins, then patch everything back together and use a scale of our own choosing.
library(ggplot2)
library(RColorBrewer)
#Sample data
dat <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100))
dat[sample(nrow(dat), 5), 3] <- NA
dat[sample(nrow(dat), 5), 3] <- Inf
#Subset out the real values
dat.good <- dat[!(is.na(dat$z)) & is.finite(dat$z) ,]
#Create 6 breaks for them
dat.good$col <- cut(dat.good$z, 6)
#Grab the bad ones
dat.bad <- dat[is.na(dat$z) | is.infinite(dat$z) ,]
dat.bad$col <- as.character(dat.bad$z)
#Rbind them back together
dat.plot <- rbind(dat.good, dat.bad)
#Make your own scale with RColorBrewer
yourScale <- c(brewer.pal(6, "Blues"), "red","green")
ggplot(dat.plot, aes(x,y, colour = col)) +
geom_point() +
scale_colour_manual("Intensity", values = yourScale)

Resources