Boxplot in R showing the mean (again) - r

I saw Boxplot in R showing the mean
I'm interested in the ggplot solution. But what I am plotting are averages already so I don't want to do an average of an average. I do have the true mean stored in TrueAvgCPC.
Here is what I tried, but it's not working:
p <- qplot(Mydf$Network,Mydf$Avg.CPC,data=Mydf,geom='boxplot')
p <- p+stat_summary(TrueAvgCPC,shape=1,col='red',geom='point')
print(p)
Thanks!

As far as I see, you want to just add a true mean (or several?) to the box plot. If you have the value(s), why use stat_summary instead of just plotting the points?
#sample data
x <- rnorm(30)
y <- rep(letters[1:3],10)
TrueAVGCPC <- c(0.34,0.1,0.44)
#plot
p <- qplot(y,x,geom='boxplot')
p <- p+geom_point(aes(x=c(1,2,3),y=TrueAVGCPC),col="red")
print(p)

Related

geom_bar messing with y_axis scale

I have some elevation data which I would like to associate with climatic categories in a dataset. When I try to plot it as a barplot to see the distribution of the categories along the elevation, something in ggplot's geom_bar converts the y axis scale to some weird values.
Here is the example:
# Example dataset
data_mountain_A <- data.frame(elevation=c(0,500,1000,1500,2000),
temperature=c(20,16,12,8,5),
name="A")
data_mountain_B <- data.frame(elevation=c(0,500,1000,1500,2000,2500,3000),
temperature=c(20,16,12,8,5,0,-5),
name="B")
data_merge <- rbind(data_mountain_A, data_mountain_B)
# Creates the temperature intervals
data_merge$temperature_intervals <- cut(data_merge$temperature,seq(-5,20,5))
# Fancy colors
colfunc <- colorRampPalette(c("white","light blue","dark green"))
# Plot
ggplot(data=data_merge, aes(fill=temperature_intervals, y=elevation, x=name)) +
geom_bar(stat="identity") +
scale_fill_manual(values=colfunc(5))
And here is the output I get:
Any hints on what am I doing wrong?
Thanks!
EDIT: I've found out the issue. It was considering the elevation as a range, not as a single measure. I fixed it by replacing the elevation absolute values by the length of the elevation intervals.
I've found out the issue. It was considering the elevation as a range, not as a single measure. I fixed it by replacing the elevation absolute values by the length of the elevation intervals.
# Example dataset
data_mountain_A <- data.frame(elevation=c(500,500,500,500,500),
temperature=c(20,16,12,8,5),
name="A")
data_mountain_B <- data.frame(elevation=c(500,500,500,500,500,500,500),
temperature=c(20,16,12,8,5,0,-5),
name="B")
data_merge <- rbind(data_mountain_A, data_mountain_B)
# Creates the temperature intervals
data_merge$temperature_intervals <- cut(data_merge$temperature,seq(-5,20,5))
# Fancy colors
colfunc <- colorRampPalette(c("white","light blue","dark green"))
# Plot
ggplot(data=data_merge, aes(fill=temperature_intervals, y=elevation, x=name))+geom_bar(position="stack",stat="identity")+
scale_fill_manual(values=colfunc(5))

Most frequent bin in a ggplot2 histogram in R

I am using ggplot2 to draw a histogram of a sample of size 1000 taken from a normal distribution. I need to place the letter 'A' on the center of the histogram, and doing that with the function annotate.
Since this vector is random, the "center" of the drawing will change a little bit every time I run the code so I need to find a way in which the function knows how to place the 'A' according to that specific sample.For the x axis I took the median of the sample for the Y axis i was thinking of taking the frequency of the most frequent bin and dividing by 2.
Does anybody know if there is a function who gives you the frequency of each bin?
Here is a reproducible example:
library(ggplot2)
set.seed(123)
x <- rnorm(1000)
qplot(x, geom="histogram")
Here is a way to get the coordinates of the output plot (on a reproducible example):
library(ggplot2)
x <- runif(10)
h <- qplot(x, geom="histogram")
ggplot_build(h)$data
This will give you all sorts of information on the histogram.
So to get the height of the most frequent class and divide by two, you just need to do
height <- max(ggplot_build(h)$data[[1]]$count) / 2
Using the same kind of information, you can also put the text always right in the middle of the plot:
ranges <- ggplot_build(h)$panel$ranges
xtext <- mean(ranges[[1]]$x.range)
ytext <- mean(ranges[[1]]$y.range)
h + annotate("text", xtext, ytext,
label="A", size=30, color="blue", alpha=0.5)

Creating barplot with standard errors plotted in R

I am trying to find the best way to create barplots in R with standard errors displayed. I have seen other articles but I cannot figure out the code to use with my own data (having not used ggplot before and this seeming to be the most used way and barplot not cooperating with dataframes). I need to use this in two cases for which I have created two example dataframes:
Plot df1 so that the x-axis has sites a-c, with the y-axis displaying the mean value for V1 and the standard errors highlighted, similar to this example with a grey colour. Here, plant biomass should the mean V1 value and treatments should be each of my sites.
Plot df2 in the same way, but so that before and after are located next to each other in a similar way to this, so pre-test and post-test equate to before and after in my example.
x <- factor(LETTERS[1:3])
site <- rep(x, each = 8)
values <- as.data.frame(matrix(sample(0:10, 3*8, replace=TRUE), ncol=1))
df1 <- cbind(site,values)
z <- factor(c("Before","After"))
when <- rep(z, each = 4)
df2 <- data.frame(when,df1)
Apologies for the simplicity for more experienced R users and particuarly those that use ggplot but I cannot apply snippets of code that I have found elsewhere to my data. I cannot even get enough code together to produce a start to a graph so I hope my descriptions are sufficient. Thank you in advance.
Something like this?
library(ggplot2)
get.se <- function(y) {
se <- sd(y)/sqrt(length(y))
mu <- mean(y)
c(ymin=mu-se, ymax=mu+se)
}
ggplot(df1, aes(x=site, y=V1)) +
stat_summary(fun.y=mean, geom="bar", fill="lightgreen", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1)
ggplot(df2, aes(x=site, y=V1, fill=when)) +
stat_summary(fun.y=mean, geom="bar", position="dodge", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1, position=position_dodge(width=0.9))
So this takes advantage of the stat_summary(...) function in ggplot to, first, summarize y for given x using mean(...) (for the bars), and then to summarize y for given x using the get.se(...) function for the error-bars. Another option would be to summarize your data prior to using ggplot, and then use geom_bar(...) and geom_errorbar(...).
Also, plotting +/- 1 se is not a great practice (although it's used often enough). You'd be better served plotting legitimate confidence limits, which you could do, for instance, using the built-in mean_cl_normal function instead of the contrived get.se(...). mean_cl_normal returns the 95% confidence limits based on the assumption that the data is normally distributed (or you can set the CL to something else; read the documentation).
I used group_by and summarise_each function for this and std.error function from package plotrix
library(plotrix) # for std error function
library(dplyr) # for group_by and summarise_each function
library(ggplot2) # for creating ggplot
For df1 plot
# Group data by when and site
grouped_df1<-group_by(df1,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error(from plotrix)
summarised_df1<-summarise_each(grouped_df1,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df1,aes(site,mean))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g
For df2 plot
# Group data by when and site
grouped_df2<-group_by(df2,when,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error
summarised_df2<-summarise_each(grouped_df2,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df2,aes(site,mean,fill=when))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g

R - How to histogram multiple matrixes using qplot/ggplot2

I'm using R to read and plot data from NetCDF files (ncdf4). I've started using R only recently thus I'm very confused, I beg your pardon.
Let's say from the files I obtain N 2-D matrixes of numerical values, each with different dimensions and many NA values.
I have to histogram these values in the same plot, with bins of given width and within given limits, the same for every matrix.
For just one matrix, I can do this:
library(ncdf4)
library(ggplot2)
file0 <- nc_open("test.nc")
#Read a variable
prec0 <- ncvar_get(file0,"pr")
#Some settings
min_plot=0
max_plot=30
bin_width=2
xlabel="mm/day"
ylabel="PDF"
title="Precipitation"
#Get maximum of array, exclude NAs
maximum_prec0=max(prec0, na.rm=TRUE)
#Store the histogram
histo_prec0 <- hist(prec0, xlim=c(min_plot,max_plot), right=FALSE, breaks=seq(0,ceiling(maximum_prec0),by=bin_width))
#Plot the histogram densities using points instead of bars, which is what we want
qplot(histo_prec0$mids, histo_prec0$density, xlim=c(min_plot,max_plot), color=I("yellow"), xlab=xlabel, ylab=ylabel, main=title, log="y")
#If necessary, can transform matrix to vector using
#vector_prec0 <- c(prec0)
However it occurs to me that it would be best to use a DataFrame for plotting multiple matrixes. I'm not certain of that nor on how to do it. This would also allow for automatic legends and all the advantages that come from using dataframes with ggplot2.
What I want to achieve is something akin to this:
https://copy.com/thumbs_public/j86WLyOWRs4N1VTi/scatter_histo.jpg?size=1024
Where on Y we have the Density and on X the bins.
Thanks in advance.
To be honest, it is unclear what you are after (scatter plot or histogram of data with values as points?).
Here are a couple of examples using ggplot which might fit your goals (based on your last sentence: "Where on Y we have the Density and on X the bins"):
# some data
nsample<- 200
d1<- rnorm(nsample,1,0.5)
d2<- rnorm(nsample,2,0.6)
#transformed into histogram bins and collected in a data frame
hist.d1<- hist(d1)
hist.d2<- hist(d2)
data.d1<- data.frame(hist.d1$mids, hist.d1$density, rep(1,length(hist.d1$density)))
data.d2<- data.frame(hist.d2$mids, hist.d2$density, rep(2,length(hist.d2$density)))
colnames(data.d1)<- c("bin","den","group")
colnames(data.d2)<- c("bin","den","group")
ddata<- rbind(data.d1,data.d2)
ddata$group<- factor(ddata$group)
# plot
plots<- ggplot(data=ddata, aes(x=bin, y=den, group=group)) +
geom_point(aes(color=group)) +
geom_line(aes(color=group)) #optional
print(plots)
However, you could also produce smooth density plots (or histograms) directly in ggplot:
ddata2<- cbind(c(rep(1,nsample),rep(2,nsample)),c(d1,d2))
ddata2<- as.data.frame(ddata2)
colnames(ddata2)<- c("group","value")
ddata2$group<- factor(ddata2$group)
plots2<- ggplot(data=ddata2, aes(x=value, group=group)) +
geom_density(aes(color=group))
# geom_histogram(aes(color=group, fill=group)) # for histogram instead
windows()
print(plots2)

coloured vlines on a POSIXct axis with a different dataset in ggplot2

I'd like to annotate my plot of bugs with the release date which is in another data.frame, but I'd like the colour of the vline to match that of corresponding original trace
# first create some dummy data
set.seed(123)
N <- 100
adf <- data.frame(version=sample(c('A','B','C'), N, replace=TRUE),
cs=as.POSIXct('2011-06-01 00:00') + rnorm(N, 20, 70)*86400)
# lets just shift things slightly, depending on version
adf$cs <- adf$cs + (as.integer(adf$version) - 1)*5e6
adf <- adf[order(adf$cs),]
library(plyr)
adf <- ddply(adf, .(version), function(bdf) { cbind(bdf, bugno=1:nrow(bdf)) } )
# now lets plot these bug curves by version
library(ggplot2)
q <- qplot(cs, bugno, data=adf, geom='line', colour=version,
xlab='', ylab='Number of Bugs')
print(q)
# however I'd like to annotate these plots by adding the
# dates of "release", with the colour matching that of release
# in the plot q, so no further annotation necessary (hopefully!)
g.res <- data.frame(version=c('A','B','C'),
releasedate=c(as.Date('2011-06-01'), as.Date('2011-10-01'),
as.Date('2012-01-01')))
# works... but only in blue...
q + geom_vline(data=g.res, aes(xintercept=as.POSIXct(releasedate)), col="blue")
I am aware of Axis breaks at noon each day of ggplot2 chart and How to get a vertical geom_vline to an x-axis of class date?
And since I've put all this work into the question I've just realised the answer... the colour must be part of the aes! I still don't have a proper understanding of how aes works, I'll have to read the book again! :-)
q + geom_vline(data=g.res, aes(xintercept=as.POSIXct(releasedate), col=version) )

Resources