Using hex binning for downsampling QQ plots - r

My datasets are pretty large and rendering generated QQ plots is slow and sometimes even freezes my browser. I know that one option that I have is simply to downsample the data vector. However, I wanted to try hex binning technique instead of downsampling. Unfortunately, I couldn't make it work (two of my several attempts are shown below). If downsampling is possible to achieve using hex binning (which I suspect is, as it's similar to histograms), I'd appreciate, if someone could show me how to do it. I use ggplot2. Thanks!
g <- ggplot(df, aes(x=var)) + stat_qq(aes(x = var), geom = "hex")
g <- ggplot(df, aes(x = var, y = ..density..)) +
geom_hex(aes(sample = var), stat = "qq")
print (g)
The first call results in the following error message:
Error: stat_qq requires the following missing aesthetics: sample
The second call results in this message:
Error in eval(expr, envir, enclos) : object 'density' not found
UPDATE: I think that more correct variant is this, but I'm not sure what should be the arguments:
g <- ggplot(df, aes(??, ??)) + stat_binhex()

Not sure if this is what you are looking for exactly, but I offer a couple ways to do hexagonal binning. First with ggplot as you are trying to work with and the second with the package hexbin which seems to look better to me, but just my preference.
library(ggplot2)
x <- rgamma(1000,8,2)
y <- rnorm(1000,4,1.5)
binFrame <- data.frame(x,y)
qplot(x,y,data=binFrame, geom='bin2d') # with ggplot...rectangular binning actually
library(hexbin)
hexbinplot(y~x, data=binFrame) # with hexbin...actually hexagonal binning
Edit:
So I was thinking a bit about this at lunch and I think the fundamental issues is that hexbining is a multidimensional data reduction technique and it seems like you are trying to do uni-variate QQ plots on really large sample, but with hexbin in ggplot. At any-rate I can think of a way to do hex bin plots with ggplot, but the best I came up with is to start from scratch and manually construct both the theoretical quantiles (x) and sample quantiles (y). So here is what I came up with.
Basic QQ-Plot Manually
# setting up manual QQ plot used to plot with and with out hexbins
xSamp <- rgamma(1000,8,.5) # sample data
len <- 1000
i <- seq(1,len,by=1)
probSeq <- (i-.5)/len # probability grid
invCDF <- qnorm(probSeq,0,1) # theoretical quantiles for standard normal, but you could compare your sample to any distribution
orderGam <- xSamp[order(xSamp)] # ordered sampe
df <- data.frame(invCDF,orderGam)
plot(invCDF,orderGam,xlab="Standard Normal Theoretical Quantiles",ylab="Standardized Data Quantiles",main="QQ-Plot")
abline(lm(orderGam~invCDF),col="red",lwd=2)
QQ Plot With Hexbins in ggplot:
ggplot(df, aes(invCDF, orderGam)) + stat_binhex() + geom_smooth(method="lm")
![QQ Plot with ggplot][2]
So at the end of the day this might not scale up readily, but if you are looking to do true multidimensional tests of normality you might think about chi-square plots for multivariate normality. cheers

Related

Plotting a subset of envfit results onto an ordination

I'm working on a figure for a publication where we are looking at a combination of plant coverage and other environmental data on differing communities. I am trying to make a multi-panel figure, with panels that display all the envfit results, one that displays only the plants, and one that displays only the other enviro. Because of the complexity of the figure, it's actually a little easier to construct in the base plot function than in ggvegan.
My challenge is figuring out how to subset the results of the envfit analysis object for the different panels. A simplified example would be:
library(vegan)
data("mite")
data("mite.env")
set.seed(55)
nmds<-metaMDS(mite)
set.seed(55)
ef<-envfit(nmds, mite.env, permu=999)
plot(ef, p.max = .05)
which produces this figure
For sake of the example, does anyone have suggestions on a way I could create two separate figures, one with only the WatrCont vector and one with only the SubsDens vector? I'm sure there is a way to pull specific results out of the ef object, but my coding is not savvy enough to understand how.
Additionally, is there a way to have the jumble of text at the center not overlap, similar to jitter in ggplot?
Thank y'all for all of your help!
I would suggest extracting the data from nmds and ef and using ggplot to add the required elements to your plots.
Here is an example:
library(vegan)
library(ggplot2)
data("mite")
data("mite.env")
set.seed(55)
nmds<-metaMDS(mite)
set.seed(55)
ef<-envfit(nmds, mite.env, permu=999)
# Get the NMDS scores
nmds_values <- as.data.frame(scores(nmds))
# Get the coordinates of the vectors produced for continuous predictors in your envfit
vector_coordinates <- as.data.frame(scores(ef, "vectors")) * ordiArrowMul(ef)
# Plot the vectors separately
ggplot(nmds_values,
aes(x=NMDS1, y = NMDS2)) +
geom_point() +
geom_segment(aes(x=0, y=0, xend=NMDS1, yend=NMDS2),
vector_coordinates[1,]) +
geom_text(aes(x=NMDS1,y=NMDS2),
vector_coordinates[1,],
label=row.names(vector_coordinates[1,]))
ggplot(nmds_values,
aes(x=NMDS1, y = NMDS2)) +
geom_point() +
geom_segment(aes(x=0, y=0, xend=NMDS1, yend=NMDS2),
vector_coordinates[2,]) +
geom_text(aes(x=NMDS1,y=NMDS2),
vector_coordinates[2,],
label=row.names(vector_coordinates[2,]))
You can play around with the colours, size of the different elements as you see fit. Coordinates for categorical predictors can be extracted in a similar manner.

how to plot only most abundant species in NMDS?

I need to plot an ordination plot showing only let s say the 20 most abundant species.
I tried to do the sum of the species colunm and then select only a certain sum value:
abu <- colSums(dune)
abu
sol <- metaMDS(dune)
sol
plot(sol, type="text", display="species", select = abu > 40)
I get this error: select is not a graphical parameter
I would expect to see only small number of species but it does not happen,
how do you show only a small number of species in the NMDS plot?
This is not straightforward. You are getting an error because select is not a parameter for the plot. Unfortunately, the result of the analysis is not a data.frame that could be handled easily (e.g. with tidyverse), and even more unfortunately, the plot() function called is not your standard plot, but a method defined specifically for objects of this class. The authors of this method did not foresee your need, and therefore, we must make the plot manually. But to do that, we need to understand what is plotting and how.
Let us find out more about the object sol:
class(sol)
# [1] "metaMDS" "monoMDS"
methods(class="metaMDS")
# [1] goodness nobs plot points print scores sppscores<- text
Oh good, we have a plot method. After a moment of digging, we find it in the vegan package (not exported, so we need to access it via vegan:::plot.metaMDS). It appears to be a wrapper around a function called ordiplot. We edit the function with edit() to figure out what it is doing. Essentially, it boils down to the following (with loads of unnecessary code):
Y <- scores(sol, display="species")
plot(Y, type="n")
text(Y[,1], Y[,2], rownames(Y), col="red")
This is, more or less, your plot. Choosing the species to show is now trivial, but first we must make sure that rows of Y are in the same order as columns of dune:
all(colnames(dune) == rownames(Y))
Y.sel <- Y[colSums(dune) > 40, ]
plot(Y.sel[,1], Y.sel[,2], type="n", xlim=c(-.8, .8), ylim=c(-.4, .4))
text(Y.sel[,1], Y.sel[,2], rownames(Y.sel), col="red")
We can of course make a much nicer plot. For example, with ggplot (it is definitely possible to make a much nicer plot with base R as well). We could actually show the abundance of the plants using the size esthetics:
library(ggplot2)
library(ggrepel)
Y <- data.frame(Y)
Y$abundance <- colSums(dune)
Y$labels <- rownames(Y)
ggplot(Y, aes(x=NMDS1, y=NMDS2, size=abundance)) +
geom_point() + geom_text_repel(aes(label=labels)) +
theme_minimal()
To filter the species by abundance, we now can do the following:
library(tidyverse)
Y %>% filter(abundance > 40) %>%
ggplot(Y, aes(x=NMDS1, y=NMDS2, size=abundance)) +
geom_point() + geom_text_repel(aes(label=labels)) +
theme_minimal()

Combining output from smatr with ggplot2

I have a dataset of leaf trait measurements made at multiple sites at two contrasting seasons. I am interested to explore the association/line fit between a pair of traits and to differentiate the seasons at each site.
Rather than a linear regression, I would prefer to use the Standardised Major Axis approach within the smatr package:
e.g. sma.site1 <- sma(TraitA ~ TraitB * Visit, data=subset(myfile, Site=="Site1")) # testing the null hypothesis of common slopes for the two Visits (Seasons) at a given Site.
I can produce a handy lattice plot in ggplot2 with a separate panel for each Site and the points differentiated by Visit:
e.g. qplot(TraitB, TraitA, data=myfile, colour=Visit) + facet_wrap(~Site, ncol=2)
However, if I add trend lines fitted with the additional argument in ggplot2:
+ geom_smooth(aes(group=Visit), method="lm", se=F)
……, those lines are not a good match for the sma coefficients.
What I would like to do is fit the lines suggested by the sma test onto the ggplot lattice. Is there an easy, or efficient, way to do that?
I know that I can subset the data, produce a plot for each site, add the relevant lines with + geom_abline() and then stitch the separate plots up together with grid.arrange(). But that feels very long-winded.
I would be grateful for any pointers.
I don't know anything about the smatr package but you should be able to tweak this to get the right values. Since you provided no data I used the leaf data from the example in the pkg. The basic idea is to pull out the slope & intercept from the returned sma object and then facet the geom_abline. I may be misinterpreting the object, though.
library(smatr)
library(ggplot2)
data(leaflife)
do.call(rbind, lapply(unique(leaflife$site), function(x) {
obj <- sma(longev~lma*rain, data=subset(leaflife, site=x))
data.frame(site=x,
intercept=obj$coef[[1]][1, 1],
slope=obj$coef[[1]][2, 1])
})) -> fits
gg <- ggplot(leaflife)
gg <- gg + geom_point(aes(x=lma, y=longev, color=soilp))
gg <- gg + geom_abline(data=fits, aes(slope=slope, intercept=intercept))
gg <- gg + facet_wrap(~site, ncol=2)
gg
I just saw this question and am not sure if you are still interested in this. I run the code by hrbrmstr, and found actually the only thing you need to change is:
obj <- sma(longev~lma*rain, data=subset(leaflife, site == x))
then you can get the plot with four lines for each group.
and also

Correlation matrix plot with ggplot2

I want to create a correlation matrix plot, i.e. a plot where each variable is plotted in a scatterplot against each other variable like with pairs() or splom(). I want to do this with ggplot2. See here for examples. The link mentions some code someone wrote for doing this in ggplot2, however, it is outdated and no longer works (even after you swap out the deprecated parts).
One could do this with a loop in a loop and then multiplot(), but there must be a better way. I tried melting the dataset to long, and copying the value and variable variables and then using facets. This almost gives you something correct.
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
library(reshape2)
d = melt(d)
d$value2 = d$value
d$variable2 = d$variable
library(ggplot2)
ggplot(data=d, aes(x=value, y=value2)) +
geom_point() +
facet_grid(variable ~ variable2)
This gets the general structure right, but only works for the plotting each variable against itself. Is there some more clever way of doing this without resorting to 2 loops?
library(GGally)
set.seed(42)
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
# estimated density in diagonal
ggpairs(d)
# blank
ggpairs(d, diag = list("continuous"="blank")
Using PerformanceAnalytics library :
library("PerformanceAnalytics")
chart.Correlation(df, histogram = T, pch= 19)

Creating a facet_wrap plot with ggplot2 with different annotations in each plot

I am using ggplot2 to explore the result of some testing on an agent-based model. The model can end in one of three rounds per realization, and as such I am interested in how player utilities differ in terms of what round the game ends and their relative position in 2D space.
All this is to say that I have generated a facet_wrap plot to show this for each round, but I would also like to annotate each plot with the cor(x,y) for the subset of data represented in each facet. Is there a way to tell ggplot2 that I would like the annotation to use the subset of data generated by facet_wrap? Here is the code I have so far, and what it is producing
library(ggplot2)
# Load data
abm.data<-read.csv("ABM_results.csv")
# Create new colun for area of Pareto set
attach(abm.data)
area<-abs(((x3*(y2-y1))+(x2*(y1-y3))+(x1*(y3-y2)))/2)
abm.data<-transform(abm.data,area=area)
detach(abm.data)
# Compare area of Pareto set with player utility
png("area_p1.png",res=100,pointsize=20,height=500,width=1600)
area.p1<-ggplot(abm.data,aes(x=area))+geom_point(aes(y=U1_2,colour="Player 1",alpha=0.4))+facet_wrap(~round,ncol=3)+
annotate("text",0.375,-1.25,label=paste("rho=",round(cor(abm.data$area,abm.data$U1_2),2)), parse=TRUE)+
scale_colour_manual(values=c("Player 1"="red"))
area.p1+xlab("Area of Pareto Set")+ylab("Player Utility at Game End")+
opts(title="Final Player 1 Utility by Pareto Set Size and Round Game Ends",legend.position="none")
dev.off()
(source: drewconway.com)
As you can see, there are two problems:
The \rho value is of the full dataset, rather than the subsets by 'round'. Is there a way to get the cor(x,y) to print based on only the data shown in each plot?
The annotation should read "\rho=some_value" but instead I get "=(\rho,value);" is there a way to fix this?
To fix the second problem use
annotate("text", 0.375, -1.25,
label=paste("rho==", round(cor(abm.data$area, abm.data$U1_2), 2)),
parse=TRUE)
i.e. "rho==".
Edit: Here is a solution to solve the first problem
library("plyr")
library("ggplot2")
set.seed(1)
df <- data.frame(x=rnorm(300), y=rnorm(300), cl=gl(3,100)) # create test data
df.cor <- ddply(df, .(cl), function(val) sprintf("rho==%.2f", cor(val$x, val$y)))
p1 <- ggplot(data=df, aes(x=x)) +
geom_point(aes(y=y, colour="col1", alpha=0.4)) +
facet_wrap(~ cl, ncol=3) +
geom_text(data=df.cor, aes(x=0, y=3, label=V1), parse=TRUE) +
scale_colour_manual(values=c("col1"="red")) +
opts(legend.position="none")
print(p1)
The same question may be asked as for adding segments for each facet. We can solve these general problems by geom_segment instead of annotate("segment",...), for the geom_foo, we can define a data.frame to store the data for the geom_foo.

Resources