R - How to histogram multiple matrixes using qplot/ggplot2 - r

I'm using R to read and plot data from NetCDF files (ncdf4). I've started using R only recently thus I'm very confused, I beg your pardon.
Let's say from the files I obtain N 2-D matrixes of numerical values, each with different dimensions and many NA values.
I have to histogram these values in the same plot, with bins of given width and within given limits, the same for every matrix.
For just one matrix, I can do this:
library(ncdf4)
library(ggplot2)
file0 <- nc_open("test.nc")
#Read a variable
prec0 <- ncvar_get(file0,"pr")
#Some settings
min_plot=0
max_plot=30
bin_width=2
xlabel="mm/day"
ylabel="PDF"
title="Precipitation"
#Get maximum of array, exclude NAs
maximum_prec0=max(prec0, na.rm=TRUE)
#Store the histogram
histo_prec0 <- hist(prec0, xlim=c(min_plot,max_plot), right=FALSE, breaks=seq(0,ceiling(maximum_prec0),by=bin_width))
#Plot the histogram densities using points instead of bars, which is what we want
qplot(histo_prec0$mids, histo_prec0$density, xlim=c(min_plot,max_plot), color=I("yellow"), xlab=xlabel, ylab=ylabel, main=title, log="y")
#If necessary, can transform matrix to vector using
#vector_prec0 <- c(prec0)
However it occurs to me that it would be best to use a DataFrame for plotting multiple matrixes. I'm not certain of that nor on how to do it. This would also allow for automatic legends and all the advantages that come from using dataframes with ggplot2.
What I want to achieve is something akin to this:
https://copy.com/thumbs_public/j86WLyOWRs4N1VTi/scatter_histo.jpg?size=1024
Where on Y we have the Density and on X the bins.
Thanks in advance.

To be honest, it is unclear what you are after (scatter plot or histogram of data with values as points?).
Here are a couple of examples using ggplot which might fit your goals (based on your last sentence: "Where on Y we have the Density and on X the bins"):
# some data
nsample<- 200
d1<- rnorm(nsample,1,0.5)
d2<- rnorm(nsample,2,0.6)
#transformed into histogram bins and collected in a data frame
hist.d1<- hist(d1)
hist.d2<- hist(d2)
data.d1<- data.frame(hist.d1$mids, hist.d1$density, rep(1,length(hist.d1$density)))
data.d2<- data.frame(hist.d2$mids, hist.d2$density, rep(2,length(hist.d2$density)))
colnames(data.d1)<- c("bin","den","group")
colnames(data.d2)<- c("bin","den","group")
ddata<- rbind(data.d1,data.d2)
ddata$group<- factor(ddata$group)
# plot
plots<- ggplot(data=ddata, aes(x=bin, y=den, group=group)) +
geom_point(aes(color=group)) +
geom_line(aes(color=group)) #optional
print(plots)
However, you could also produce smooth density plots (or histograms) directly in ggplot:
ddata2<- cbind(c(rep(1,nsample),rep(2,nsample)),c(d1,d2))
ddata2<- as.data.frame(ddata2)
colnames(ddata2)<- c("group","value")
ddata2$group<- factor(ddata2$group)
plots2<- ggplot(data=ddata2, aes(x=value, group=group)) +
geom_density(aes(color=group))
# geom_histogram(aes(color=group, fill=group)) # for histogram instead
windows()
print(plots2)

Related

Plotting a subset of envfit results onto an ordination

I'm working on a figure for a publication where we are looking at a combination of plant coverage and other environmental data on differing communities. I am trying to make a multi-panel figure, with panels that display all the envfit results, one that displays only the plants, and one that displays only the other enviro. Because of the complexity of the figure, it's actually a little easier to construct in the base plot function than in ggvegan.
My challenge is figuring out how to subset the results of the envfit analysis object for the different panels. A simplified example would be:
library(vegan)
data("mite")
data("mite.env")
set.seed(55)
nmds<-metaMDS(mite)
set.seed(55)
ef<-envfit(nmds, mite.env, permu=999)
plot(ef, p.max = .05)
which produces this figure
For sake of the example, does anyone have suggestions on a way I could create two separate figures, one with only the WatrCont vector and one with only the SubsDens vector? I'm sure there is a way to pull specific results out of the ef object, but my coding is not savvy enough to understand how.
Additionally, is there a way to have the jumble of text at the center not overlap, similar to jitter in ggplot?
Thank y'all for all of your help!
I would suggest extracting the data from nmds and ef and using ggplot to add the required elements to your plots.
Here is an example:
library(vegan)
library(ggplot2)
data("mite")
data("mite.env")
set.seed(55)
nmds<-metaMDS(mite)
set.seed(55)
ef<-envfit(nmds, mite.env, permu=999)
# Get the NMDS scores
nmds_values <- as.data.frame(scores(nmds))
# Get the coordinates of the vectors produced for continuous predictors in your envfit
vector_coordinates <- as.data.frame(scores(ef, "vectors")) * ordiArrowMul(ef)
# Plot the vectors separately
ggplot(nmds_values,
aes(x=NMDS1, y = NMDS2)) +
geom_point() +
geom_segment(aes(x=0, y=0, xend=NMDS1, yend=NMDS2),
vector_coordinates[1,]) +
geom_text(aes(x=NMDS1,y=NMDS2),
vector_coordinates[1,],
label=row.names(vector_coordinates[1,]))
ggplot(nmds_values,
aes(x=NMDS1, y = NMDS2)) +
geom_point() +
geom_segment(aes(x=0, y=0, xend=NMDS1, yend=NMDS2),
vector_coordinates[2,]) +
geom_text(aes(x=NMDS1,y=NMDS2),
vector_coordinates[2,],
label=row.names(vector_coordinates[2,]))
You can play around with the colours, size of the different elements as you see fit. Coordinates for categorical predictors can be extracted in a similar manner.

Sorting data vector for a histogram using ggplot and R

So I have 10.000 values in a vector from a Monte Carlo simulation. I want to plot this data as a histogram and a density plot. Doing this with the hist() function is easy, and it will calculate the frequency of the of the different values automatically. My ambition is however doing this in ggplot.
My biggest problem right now is how to transform the data so ggplot can handle it. I would like my x-axis to show the "price" while the x-axis shows the frequency or density. My data has a lot decimals as shown in the example data below.
myData <- c(266.8997, 271.5137, 225.4786, 223.3533, 258.1245, 199.5601, 234.2341, 231.7850, 260.2091, 184.5102, 272.8287, 203.7482, 212.5140, 220.9094, 221.2627, 236.3224)
My current code using the hist()-function, and the plot is shown below.
hist(myData,
xlab ="Price",
prob=TRUE)
lines(density(myData))
Histogram for the data vector containing 10000 values
How would you sort the data, and how would you do this with ggplot? I am thinking if I should round the numbers as well?
Hard to say exactly without seeing a sample of your data, but have you tried:
ggplot(myData, aes(Price)) + geom_histogram()
or:
ggplot(myData, aes(Price)) + geom_density()
Just try this:
ggplot() +
geom_bar(aes(myData)) +
geom_density(aes(myData))

How to do a 2d heatmap with color smoothing ... or a density plot from absolute values?

I've done the rounds here and via google without a solution, so please help if you can.
I'm looking to create something like this : painSensitivityHeatMap using ggplot2
I can create something kinda similar using geom_tile, but without the smoothing between data points ... the only solution I have found requires a lot of code and data interpolation. Not very elegant, me thinks.uglySolutionUsingTile
So I'm thinking, I could coerce the density2d plots to my purposes instead by having the plot use fixed values rather than a calculated data-point density -- much in the same way that stat='identity' can be used in histograms to make them represent data values, rather than data counts.
So a minimal working example:
df <- expand.grid(letters[1:5], LETTERS[1:5])
df$value <- sample(1:4, 25, replace=TRUE)
# A not so pretty, non-smooth tile plot
ggplot(df, aes(x=Var1, y=Var2, fill=value)) + geom_tile()
# A potentially beautiful density2d plot, except it fails :-(
ggplot(df, aes(x=Var1, y=Var2)) + geom_density2d(aes(color=..value..))
This took me a little while, but here is a solution for future reference
A solution using idw from the gstat package and spsample from the sp package.
I've written a function which takes a dataframe, number of blocks (tiles) and a low and upper anchor for the colour scale.
The function creates a polygon (a simple quadrant of 5x5) and from that creates a grid of that shape.
In my data, the location variables are ordered factors -- therefor I unclass them into numbers (1-to-5 corresponding to the polygon-grid) and convert them to coordinates -- thus converting the tmpDF from a datafra to a spatial dataframe. Note: there are no overlapping/duplicate locations -- i.e 25 observations corresponding to the 5x5 grid.
The idw function fills in the polygon-grid (newdata) with inverse-distance weighted values ... in other words, it interpolates my data to the full polygon grid of a given number of tiles ('blocks').
Finally I create a ggplot based on a color gradient from the colorRamps package
painMapLumbar <- function(tmpDF, blocks=2500, lowLimit=min(tmpDF$value), highLimit=max(tmpDF$value)) {
# Create polygon to represent the lower back (lumbar)
poly <- Polygon(matrix(c(0.5, 0.5,0.5, 5.5,5.5, 5.5,5.5, 0.5,0.5, 0.5), ncol=2, byrow=TRUE))
# Create a grid of datapoints from the polygon
polyGrid <- spsample(poly, n=blocks, type="regular")
# Filter out the data for the figure we want
tmpDF <- tmpDF %>% mutate(x=unclass(x)) %>% mutate(y=unclass(y))
tmpDF <- tmpDF %>% filter(y<6) # Lumbar region only
coordinates(tmpDF) <- ~x+y
# Interpolate the data as Inverse Distance Weighted
invDistanceWeigthed <- as.data.frame(idw(formula = value ~ 1, locations = tmpDF, newdata = polyGrid))
p <- ggplot(invDistanceWeigthed, aes(x=x1, y=x2, fill=var1.pred)) + geom_tile() + scale_fill_gradientn(colours=matlab.like2(100), limits=c(lowLimit,highLimit))
return(p)
}
I hope this is useful to someone ... thanks for the replies above ... they helped me move on.

Manually specifying bins with stat_summary2d

I have a large set of data that consists of coordinates (x,y) and a numeric z value that is similar to density. I'm interested in binning the data, performing summary statistics (median, length, etc.) and plotting the binned values as points with the statistics mapped to ggplot aesthetics.
I've tried using stat_summary2d and extracting the results manually (based on this answer: https://stackoverflow.com/a/22013347/2832911). However, the problem I'm running into is that the bin placements are based on the range of the data, which in my case varies by data set. Thus between two plots the bins are not covering the same area.
My question is how to either manually set bins using stat_summary2d, or at least set them to be consistent regardless of the data.
Here is a basic example which demonstrates the approach and how the bins don't line up:
library(ggplot2)
set.seed(2)
df1 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
df2 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
g1 <- ggplot(df1, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df1.binned <-
data.frame(with(ggplot_build(g1)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=1)))
g2 <- ggplot(df2, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df2.binned <-
data.frame(with(ggplot_build(g2)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=2)))
df.binned <- rbind(df1.binned, df2.binned)
ggplot(df.binned, aes(x,y, size=z, color=factor(df)))+geom_point(alpha=.5)
Which generates
In reality I will use stat_summary2d several times to get, for instance, the number of points in the bin, and the median and then use aes(size=bin.length, colour=bin.median).
Any tips on how to accomplish this using my proposed approach, or an alternative approach would be welcome.
You can manually set breaks with stat_summary2d. If you want 10 levels from -1 to 1 you can do
bb<-seq(-1,1,length.out=10+1)
breaks<-list(x=bb, y=bb)
And then use the breaks variable when you call your plots
g1 <- ggplot(df1, aes(x,y))+
stat_summary2d(fun=mean, breaks=breaks, aes(z=z))+
geom_point()
It's a shame you can't change the geom of the stat_summary2d to "point" so you could make this in one go, but it doesn't look as though stat_summary2d calculate the proper x and y values for that.

adding text to ggplot geom_jitter points that match a condition

How can I add text to points rendered with geom_jittered to label them? geom_text will not work because I don't know the coordinates of the jittered dots. Could you capture the position of the jittered points so I can pass to geom_text?
My practical usage would be to plot a boxplot with the geom_jitter over it to show the data distribution and I would like to label the outliers dots or the ones that match certain condition (for example the lower 10% for the values used for color the plots).
One solution would be to capture the xy positions of the jittered plots and use it later in another layer, is that possible?
[update]
From Joran answer, a solution would be to calculate the jittered values with the jitter function from the base package, add them to a data frame and use them with geom_point. For filtering he used ddply to have a filter column (a logic vector) and use it for subsetting the data in geom_text.
He asked for a minimal dataset. I just modified his example (a unique identifier in the label colum)
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''))
This is the result of joran example with my data and lowering the display of ids to the lowest 1%
And this is a modification of the code to have colors by another variable and displaying some values of this variable (the lowest 1% for each group):
library("ggplot2")
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''),quality= rnorm(300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 1% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.01);
g$top_q <- g$qual <= quantile(g$qual,0.01);
g})
#Create a boxplot, overlay the jittered points and
# label the bottom 1% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj,colour=quality)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab)) +
geom_text(data=subset(datJit,top_q),aes(x=xj,label=sprintf("%0.2f",quality)))
Your question isn't completely clear; for example, you mention labeling points at one point but also mention coloring points, so I'm not sure which you really mean, or perhaps both. A reproducible example would be very helpful. But using a little guesswork on my part, the following code does what I think you're describing:
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=rep('label',300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 10% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.1); g})
#Create a boxplot, overlay the jittered points and
# label the bottom 10% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab))
Just an addition to Joran's wonderful solution:
I ran into trouble with the x-axis positioning when I tried to use in a facetted plot using facet_wrap(). The problem is, that ggplot2 uses 1 as the x-value on every facet. The solution is to create a vector of jittered 1s:
datJit$xj <- jitter(rep(1,length(dat$x)),amount=0.1)

Resources