I can't find a description of what the end points of the lines of a boxplot represent.
For example, here are point values above and below where the lines end.
(I realize that the top and bottom of the box are 25th and 75th percentile, and the centerline is the 50th). I assume, as there are points above and below the lines that they do not represent the max/min values.
The "dots" at the end of the boxplot represent outliers. There are a number of different rules for determining if a point is an outlier, but the method that R and ggplot use is the "1.5 rule". If a data point is:
less than Q1 - 1.5*IQR
greater than Q3 + 1.5*IQR
then that point is classed as an "outlier". The whiskers are defined as:
upper whisker = min(max(x), Q_3 + 1.5 * IQR)
lower whisker = max(min(x), Q_1 – 1.5 * IQR)
where IQR = Q_3 – Q_1, the box length. So the upper whisker is located at the smaller of the maximum x value and Q_3 + 1.5 IQR,
whereas the lower whisker is located at the larger of the smallest x value and Q_1 – 1.5 IQR.
Additional information
See the wikipedia boxplot page for alternative outlier rules.
There are actually a variety of ways of calculating quantiles. Have a look at `?quantile for the description of the nine different methods.
Example
Consider the following example
> set.seed(1)
> x = rlnorm(20, 1/2)#skewed data
> par(mfrow=c(1,3))
> boxplot(x, range=1.7, main="range=1.7")
> boxplot(x, range=1.5, main="range=1.5")#default
> boxplot(x, range=0, main="range=0")#The same as range="Very big number"
This gives the following plot:
As we decrease range from 1.7 to 1.5 we reduce the length of the whisker. However, range=0 is a special case - it's equivalent to "range=infinity"
I think ggplot using the standard defaults, the same as boxplot: "the whiskers extend to the most extreme data point which is no more than [1.5] times the length of the box away from the box"
See: boxplot.stats
P1IMSA Tutorial 8 - Understanding Box and Whisker Plots video offers a visual step-by-step explanation of (Tukey) box and whisker plots.
At 4m 23s I explain the meaning of the whisker ends and its relationship to the 1.5*IQR.
Although the chart shown in the video was rendered using D3.js rather than R, its explanations jibe with the R implementations of boxplots mentioned.
As highlighted by #TemplateRex in a comment, ggplot doesn't draw the whiskers at the upper/lower quartile plus/minus 1.5 times the IQR. It actually draws them at max(x[x < Q3 + 1.5 * IQR]) and min(x[x > Q1 + 1.5 * IQR]). For example, here is a plot drawn using geom_boxplot where I've added a dashed line at the value Q1 - 1.5*IQR:
Q1 = 52
Q3 = 65
Q1 - 1.5 * IQR = 52 - 13*1.5 = 32.5 (dashed line)
Lower whisker = min(x[x > Q1 + 1.5 * IQR]) = 35 (where x is the data used to create the boxplot, outlier is at x = 27).
MWE
Note this isn't the exact code I used to produce the image above but it gets the point over.
library("mosaic") # For favstats()
df <- c(54, 41, 55, 66, 71, 50, 65, 54, 72, 46, 36, 64, 49, 64, 73,
52, 53, 66, 49, 64, 44, 56, 49, 54, 61, 55, 52, 64, 60, 54, 59,
67, 58, 51, 63, 55, 67, 68, 54, 53, 58, 26, 53, 56, 61, 51, 51,
50, 51, 68, 60, 67, 66, 51, 60, 52, 79, 62, 55, 74, 62, 59, 35,
67, 58, 74, 48, 53, 40, 62, 67, 57, 68, 56, 75, 55, 41, 50, 73,
57, 62, 61, 48, 60, 64, 53, 53, 66, 58, 51, 68, 69, 69, 58, 54,
57, 65, 78, 70, 52, 59, 52, 65, 70, 53, 57, 72, 47, 50, 70, 41,
64, 59, 58, 65, 57, 60, 70, 46, 40, 76, 60, 64, 51, 38, 67, 57,
64, 51)
df <- as.data.frame(df)
Q1 <- favstats(df)$Q1
Q3 <- favstats(df)$Q3
IQR <- Q3 - Q1
lowerlim <- Q1 - 1.5*IQR
upperlim <- Q3 + 1.5* IQR
boxplot_Tukey_lower <- min(df[df > lowerlim])
boxplot_Tukey_upper <- max(df[df < upperlim])
ggplot(df, aes(x = "", y = df)) +
stat_boxplot(geom ='errorbar', width = 0.5) +
geom_boxplot() +
geom_hline(yintercept = lowerlim, linetype = "dashed") +
geom_hline(yintercept = upperlim, linetype = "dashed")
Related
I'm trying to figure out how to calculate the area of a river cross section.
For the cross section I have the depth at every 25 cm over the 5 m wide river.
x_profile <- seq(0, 500, 25)
y_profile = c(50, 73, 64, 59, 60, 64, 82, 78, 79, 76, 72, 68, 63, 65, 62, 61, 56, 50, 44, 39, 25)
If anyone have some suggestions of how this could be done in r it's highly appreciated.
We can use the sf package to create a polygon showing the cross-section and then calculate the area. Notice that to create a polygon, it is necessary to provide three more points as c(0, 0), c(500, 0), and c(0, 0) when creating the matrix m.
x_profile <- seq(0, 500, 25)
y_profile <- c(50, 73, 64, 59, 60, 64, 82, 78, 79, 76, 72,
68, 63, 65, 62, 61, 56, 50, 44, 39, 25)
library(sf)
# Create matrix with coordinates
m <- matrix(c(0, x_profile, 500, 0, 0, -y_profile, 0, 0),
byrow = FALSE, ncol = 2)
# Create a polygon
poly <- st_polygon(list(m))
# View the polygon
plot(poly)
# Calcualte the area
st_area(poly)
31312.5
I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
I have the data, and I know that the growth follows a Gompertz curve with formula:
[y = a*exp(-exp(-(x-x0)/b))], with a = 88.8, b = 11.7, and x0 = 15.1.
The data I have is
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
Any help would be appreciated thank you
*Will edit when more information is given**
I am a little confused by your question. I have compiled what you have written into R. Please elaborate for me so that I can help you:
gompertz <- function(x, x0, a, b){
a*exp(-exp(-(x-x0)/b))
}
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165) # means of 30 samples of grape cluster weights?
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112) # ?
#??
gompertz(x, x0 = 15.1, a = 88.8, b = 11.7)
gompertz(y, x0 = 15.1, a = 88.8, b = 11.7)
I have constructed models in glmer and would like to predict these on a rasterStack representing the fixed effects in my model. my glmer model is in the form of:
m1<-glmer(Severity ~ x1 + x2 + x3 + (1 | Year) + (1 | Ecoregion), family=binomial( logit ))
As you can see, I have random effects which I don't have as spatial layer - for example 'year'. Therefore the problem is really predicting glmer on rasterStacks when you don't have the random effects data random effects layers. If I use it out of the box without adding my random effects I get an error.
m1.predict=predict(object=all.var, model=m1, type='response', progress="text", format="GTiff")
Error in predict.averaging(model, blockvals, ...) :
Your question is very brief, and does not indicated what, if any, trouble you have encountered. This seems to work 'out of the box', but perhaps not in your case. See ?raster::predict for options.
library(raster)
# example data. See ?raster::predict
logo <- brick(system.file("external/rlogo.grd", package="raster"))
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
xy <- rbind(cbind(1, p), cbind(0, a))
v <- data.frame(cbind(pa=xy[,1], extract(logo, xy[,2:3])))
v$Year <- sample(2000:2001, nrow(v), replace=TRUE)
library(lme4)
m <- lmer(pa ~ red + blue + (1 | Year), data=v)
# here adding Year as a constant, as it is not a variable (RasterLayer) in the RasterStack object
x <- predict(logo, m, const=(data.frame(Year=2000)))
If you don't have the random effects, just use re.form=~0 in your predict call to predict at the population level:
x <- predict(logo, m, re.form=~0)
works without complaint for me with #RobertH's example (although I don't know if correctly)
I am attempting to use the following dataset to reproduce a histogram in R:
ages <- c(26, 31, 35, 37, 43, 43, 43, 44, 45, 47, 48, 48, 49, 50, 51, 51, 51, 51, 52, 54, 54, 54, 54,
55, 55, 55, 56, 57, 57, 57, 58, 58, 58, 58, 59, 59, 62, 62, 63, 64, 65, 65, 65, 66, 66, 67,
67, 72, 86)
I would like to get a histogram that looks as close as possible to this:
However, I am having three problems:
Ia m unable to get my frequency count on the y-axis to reach 18
I haven't been able to get the squiggly break symbol on the x-axis
My breaks don't seem to be properly setting to the vector I entered in my code
I read over ?hist and thought the first two issues could be accomplished by setting xlim and ylim, but that doesn't seem to be working.
I'm at a loss for the third issue since I thought it could be accomplished by including breaks = c(25.5, 34.5, 43.5, 52.5, 61.5, 70.5, 79.5, 88.5).
Here's my code so far:
hist(ages, breaks = c(25.5, 34.5, 43.5, 52.5, 61.5, 70.5, 79.5, 88.5),
freq=TRUE, col = "lightblue", xlim = c(25.5, 88.5), ylim = c(0,18),
xlab = "Age", ylab = "Frequency")
Followed by my corresponding histogram:
Any bump in the right direction is appreciated.
1. Reaching 18.
It appears that in your data you have at most 17 numbers in the category between 52.5 and 61.5. And that is even with open interval on both sides:
ages <- c(26, 31, 35, 37, 43, 43, 43, 44, 45, 47, 48, 48, 49, 50, 51, 51, 51,
51, 52, 54, 54, 54, 54, 55, 55, 55, 56, 57, 57, 57, 58, 58, 58, 58,
59, 59, 62, 62, 63, 64, 65, 65, 65, 66, 66, 67, 67, 72, 86
)
sum(ages >= 52.5 & ages <= 61.5)
[1] 17
So your histogram only reflects that.
2. Break symbol.
For that you might be interested in THIS SO ANSWER
3. Breaks.
If you read help(hist) you will see that breaks specify the points at which the groups are formed:
... * a vector giving the breakpoints between histogram cells
So you breaks work as intended. The problem you have is with showing the same numbers on x-axis. Here ANOTHER SO ANSWER might help you.
Example
Here is how you could go about reproducing the plot.
library(plotrix) # for the break on x axis
library(shape) # for styled arrow heads
# manually select axis ticks and colors
xticks <- c(25.5, 34.5, 43.5, 52.5, 61.5, 70.5, 79.5, 88.5)
yticks <- seq(2, 18, 2)
bgcolor <- "#F2ECE4" # color for the background
barcolor <- "#95CEEF" # color for the histogram bars
# top level parameters - background color and font type
par(bg=bgcolor, family="serif")
# establish a new plotting window with a coordinate system
plot.new()
plot.window(xlim=c(23, 90), ylim=c(0, 20), yaxs="i")
# add horizontal background lines
abline(h=yticks, col="darkgrey")
# add a histogram using our selected break points
hist(ages, breaks=xticks, freq=TRUE, col=barcolor, xaxt='n', yaxt='n', add=TRUE)
# L-shaped bounding box for the plot
box(bty="L")
# add x and y axis
axis(side=1, at=xticks)
axis(side=2, at=yticks, labels=NA, las=1, tcl=0.5) # for inward ticks
axis(side=2, at=yticks, las=1)
axis.break(1, 23, style="zigzag", bgcol=bgcolor, brw=0.05, pos=0)
# add labels
mtext("Age", 1, line=2.5, cex=1.2)
mtext("Frequency", 2, line=2.5, cex=1.2)
# add arrows
u <- par("usr")
Arrows(88, 0, u[2], 0, code = 2, xpd = TRUE, arr.length=0.25)
Arrows(u[1], 18, u[1], u[4], code = 2, xpd = TRUE, arr.length=0.25)
And the picture:
I am trying to forecast using R's arima from java using Eclipse. I am using Rserve. I need to output the forecast and the intervals in array format. I can print out the forecast in the Eclipse console as an output. How do I retrieve the point forecast and the confidence interval as an array. Here is my code.
RConnection c = null;
int[] kings = { 60, 43, 67, 50, 56, 42, 50, 65, 68, 43, 65, 34, 47, 34,
49, 41, 13, 35, 53, 56, 16, 43, 69, 59, 48, 59, 86, 55, 68, 51,
33, 49, 67, 77, 81, 67, 71, 81, 68, 70, 77, 56 };
try {
c = new RConnection();
System.out.println("INFO : The Server version is :-- " + c.getServerVersion());
c.eval("library(\"forecast\")");
c.assign("kings", kings);
c.eval("datats<-data;");
c.eval("kingsts<-ts(kings);");
c.eval("arima<-auto.arima(kingsts);");
c.eval("fcast<-forecast(arima, h=12);");
String f = c.eval("paste(capture.output(print(fcast)),collapse='\\n')").asString();
System.out.println(f);
//Codes online suggest I do the following but this does not work
REXP fs = re.eval("summary(fcast);");
double[] forecast = fs.asDoubleArray();
for(int i=0; i
I figured out how to retrieve the values. I converted this into the data frame.
c.eval("ds=as.data.frame(fcast);");
//get the forecast values
c.eval("names(ds)[1]<-paste(\"actual\");");
REXP actual=c.eval("(ds$\"actual\")");
double[] forecast = actual.asDoubles();
for(int i=0; i<forecast.length; i++)
System.out.println("forecast values are: "+forecast[i]);
The still need to figure out how to attach the time stamp to the dataframe