Plot bin averaged values with error bars in R - r

I have a dataframe with three columns "DateTime", "T_ET", and "LAI". I want to plot T_ET (on y-axis) against LAI (on x-axis) along with 0.1-bin LAI averaged values of T_ET on the same plot something like below (Wei et al., 2017):
In above figure, y-axis is T_ET or T/(E+T), x-axis is LAI, red open diamonds with error bars are 0.1-bin LAI averaged of black points and the standard deviation, solid line is
a regression of the individual data points (estimated from the bin averages), n is available data points. Dash lines are 95% confidence bounds.
How can I obtain the plot similar to above plot? Please find the sample data using the following link: file
or use following sample data:
df <- structure(list(DateTime = structure(c(1478088000, 1478347200, 1478692800, 1478779200, 1478865600, 1478952000, 1479124800, 1479211200, 1479297600, 1479470400), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
T_ET = c(0.996408350852751, 0.904748351479432, 0.28771236118773, 0.364402232484906, 0.452348409759872, 0.415408041501318, 0.629291202120187, 0.812083112145703, 0.992414777441755, 0.818032913071265),
LAI = c(1.3434, 1.4669, 1.6316, 1.6727, 1.8476, 2.0225, 2.3723, 2.5472, 2.7221, 3.0719)),
row.names = c(NA, 10L),
class = "data.frame")

You can do this directly while plotting via stat_summary_bin(). By default, the geom associated with this would be the pointrange geom and uses mean_se(). bins= controls the number of bins, but you can also supply binwidth=. Note that with the pointrange geom, fatten controls the size of the central point:
ggplot(df, aes(LAI, T_ET)) + geom_point() + theme_classic() +
stat_summary_bin(bins=3, color='red', shape=5, fatten=5)
Your sample data is a little light, so here's another example via the diamonds dataset. Here, I'm constructing the same look as the example plot you show by combining the errorbar and poing geom. Please note that apparently setting the width of the errorbar doesn't work correctly with stat_summary_bin().
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
theme_classic()
EDIT: Showing Regression for Binned Data
As indicated in the comments, drawing a regression line based on the binned data and not the original data is possible, but not through the stat_summary_bin() function unless you are okay to use loess. If you're looking for linear regression, you'll need to bin the data outside of ggplot, then plot the regression on the binned data.
The reason for this is probably by design. It's inherently not a good idea to draw a regression line (a way of summarizing data) that is based on summarized data. Regardless, here's one way to do this via the diamonds dataset. We can use the cut() function to cut into separate bins, then summarize the data on those binned values. Due to the way the cut() function labels the output, we have to create our own labels. Since we're cutting into 12 equal pieces in this example, I'm creating 12 evenly-spaced positions on the x axis for our data values to sit into - this may be different in your case, just take care you label according to what the data represents and what makes the most statistical sense.
df <- diamonds
# setting interval labeling
bin_width <- diff(range(df$carat)/12)
bin_labels <- c((range(df$carat)[1] + (bin_width/2))+(0:11*bin_width))
# cutting the data
df$bins <- cut(df$carat, breaks=12, labels=bin_labels)
df$bins <- as.numeric(levels(df$bins)[df$bins]) # convert factor to numeric
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(data=df, aes(x=bins), method='lm', color='blue') +
theme_classic()
Note that the regression line above is weighting all binned values equally. This is generally not a good idea unless your data is spaced evenly among the dataset. I'd still recommend if you're going to draw a regression line, have it linked to the original data, which is much more representative of the reality within your data. That would look like this:
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(method='lm', color='green') +
theme_classic()
When it comes down to it, drawing a regression line for binned data is summarizing the summarized data rather than summarizing your original data. It's statistical heresay, so use at your own risk. But if you simply must for whatever strange reason... I can't stop you. ;)

Related

Probability density Matrix Subtraction for heatmaps

Pretty new to R and stuck. I am attempting to normalize 2d probability density of a heat map by subtracting the 2d probability densities of another data set. I am looking where behaviors occur in space, however to do this I want to subtract out where the subjects just spend most of their time from were the behaviors are occuring to get an idea of relative density of just the behaviors. To do this I am trying to find the probability density matrices used to plot a heatmap for the following code:
ctrlplot<-ctrl %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray",
as lowertick, uppertick, interval
limit=c(0,1.3e-05)) #sets the static limit of probabilities.
This works to make the heat plot for either data set plot, however I cannot find where ggplot or stat_density_2d is storing the density data to subtract the two.
Alternatively I have tried to get just the densities for both data sets using the following code and storing it as the variable dens:
n<-100
h<-c(bandwidth.nrd(ctrl$x),bandwidth.nrd(ctrl$y))
dens<-kde2d(ctrl$x,ctrl$y,n=n,h=h)
Now I am not sure how to subtract the resulting z values and get it back into a heat plot. I know there is likely an easy solution for this, but I am definitely stuck. Any advice on how to do this easier, or other suggestions on how to subtract the densities from one another would be greatly appreciated.
UPDATE:
I found a way to pull the density data from ggplot. I was able to pull the density data from two different data sets, subtract the vectors and place the densities back into the original data frame using the following code:
ctrlplot<-ctrl %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray")
ctxplot<-ctx %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray")
ctrlplot2<-ggplot_build(ctrlplot)
gbctrl<-ctrlplot2$data[[1]]
densctrl<-gbctrl$density
gbctx<-ggplot_build(ctxplot)
gbctx<-gbctx$data[[1]]
densctx<-gbctx$density
diff_ctrl_ctx<-densctrl-densctx
gbctrl$density<-diff_ctrl_ctx
ctrlplot2$data[[1]]<-gbctrl
ctrlplot2
ctrlplot
However the last two plots ctrlplot (original) and ctrlplot2(subtracted densities) give the same plot. Not sure if I am not replacing the correct parts of the data frame so that it updates for the graphing part since there are different lists in the original ggplot_build.

How do I create a barplot in R with a cumulative standard deviation?

I want to make a plot similar to the one attached by Lindfield et al. 2016. I'm familiar with the ggplot command in R with the format:
ggplot(dataframe, aes(x, y)) + geom_bar(stat = 'identity')
However, I don't know how to make a cumulative se error for a stacked barplot; only one that employs a position_dodge command.
I know that there are disadvantages to using stacked bars with se errors, but for my data set, it is more presentable than using the unstacked barplots.
Thanks.
I don't know how you get the cumulative standard errors in an appropriate way (I guess it depends on how your values are generated) but I think you need to do calculate them and store them in a second DF, for example if you have an initial data.frame created like this:
DF <- data.frame( x=c("a","a","b","b"),
sp=c("shark","cod","shark","cod"),
y=c(10,5,15,7),
stringsAsFactors=FALSE )
where y is the value associated with each species at each x point, then you'd create a second DF containing the lower and upper limits of your s.e. for each x value, eg
seDF <- data.frame( x=c('a','b'),
yl=c(12,18),
yu=c(17,24),
stringsAsFactors=FALSE )
Then you can create your plot with:
ggplot() +
geom_bar( data=DF, mapping=aes(x=x,y=y,fill=sp),
position="stack", stat="identity") +
geom_linerange( data=seDF, mapping=aes(x=x, ymin=yl, ymax=yu) )
I used geom_linerange rather then geom_errorbar as it doesn't create crossbars at either end.

Displaying smoothed (convolved) densities with ggplot2

I'm trying to display some frequencies convolved with a Gaussian kernel in ggplot2. I tried smoothing the lines with:
+ stat_smooth(se = F,method = "lm", formula = y ~ poly(x, 24))
Without success.
I read an article suggesting the frequencies should be convolved with a Gaussian kernel. Which ggplot2's stat_density function (http://docs.ggplot2.org/current/stat_density.html) seem to be able to produce.
However, I can't seem to be able to replace my geometry with stat_density. I there anything wrong with my code?
require(reshape2)
library(ggplot2)
library(RColorBrewer)
fileName = "/1.csv" # downloadable there: https://www.dropbox.com/s/l5j7ckmm5s9lo8j/1.csv?dl=0
mydata = read.csv(fileName,sep=",", header=TRUE)
dataM = melt(mydata,c("bins"))
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data=dataM,
aes(x=bins, y=value, colour=variable)) +
geom_line() + scale_x_continuous(limits = c(0, 2))
This code produces the following plot:
I'm looking at smoothing the lines a little bit, so they look more like this:
(from http://journal.frontiersin.org/Journal/10.3389/fncom.2013.00189/full)
Since my comments solved your problem, I'll convert them to an answer:
The density function takes individual measurements and calculates a kernel density distribution by convolution (gaussian is the default kernel). For example, plot(density(rnorm(1000))). You can control the smoothness with the bw (bandwidth) parameter. For example, plot(density(rnorm(1000), bw=0.01)).
But your data frame is already a density distribution (analogous to the output of the density function). To generate a smoother density estimate, you need to start with the underlying data and run density on it, adjusting bw to get the smoothness where you want it.
If you don't have access to the underlying data, you can smooth out your existing density distributions as follows:
ggplot(data=dataM, aes(x=bins, y=value, colour=variable)) +
geom_smooth(se=FALSE, span=0.3) +
scale_x_continuous(limits = c(0, 2)).
Play around with the span parameter to get the smoothness you want.

5 dimensional plot in r

I am trying to plot a 5 dimensional plot in R. I am currently using the rgl package to plot my data in 4 dimensions, using 3 variables as the x,y,z, coordinates, another variable as the color. I am wondering if I can add a fifth variable using this package, like for example the size or the shape of the points in the space. Here's an example of my data, and my current code:
set.seed(1)
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
colnames(df) <- c("var1","var2","var3","var4","var5")
require(rgl)
plot3d(df$var1, df$var2, df$var3, col=as.numeric(df$var4), size=0.5, type='s',xlab="var1",ylab="var2",zlab="var3")
I hope it is possible to do the 5th dimension.
Many thanks,
Here is a ggplot2 option. I usually shy away from 3D plots as they are hard to interpret properly. I also almost never put in 5 continuous variables in the same plot as I have here...
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12))
While this is a bit messy, you can actually reasonably read all 5 dimensions for most points.
A better approach to multi-dimensional plotting opens up if some of your variables are categorical. If all your variables are continuous, you can turn some of them to categorical with cut and then use facet_wrap or facet_grid to plot those.
For example, here I break up var3 and var4 into quintiles and use facet_grid on them. Note that I also keep the color aesthetics as well to highlight that most of the time turning a continuous variable to categorical in high dimensional plots is good enough to get the key points across (here you'll notice that the fill and border colors are pretty uniform within any given grid cell):
df$var4.cat <- cut(df$var4, quantile(df$var4, (0:5)/5), include.lowest=T)
df$var3.cat <- cut(df$var3, quantile(df$var3, (0:5)/5), include.lowest=T)
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12)) +
facet_grid(var3.cat ~ var4.cat)

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

Resources