Why I can't see boxes when i plot boxplots in R? - r

I am trying to plot a boxplot in R, and I can't see the boxes, below shows what i see:
and here is the code i am writing:
ggplot(data=FIR,
aes(x=as.factor(FIR$Revised.Status),y=as.numeric(FIR$Diff_Date_Requested)))+
geom_boxplot(fill="blue", alpha=0.2)+xlab("Status")
Revised.Status: is the results of the request it is A or C
Diff_Date_Requested: is the difference between 2 dates

From ?geom_boxplot:
The lower and upper hinges correspond to the first and third quartiles
(the 25th and 75th percentiles).
If your 25th percentile and 75th percentile are the same value, or are really close together, you won't see the box, just a single line.
It looks like you might be working with count data or other integer values. If more than half your y-values are 1 (i.e., whatever it is happens the next day, so difference = 1), that might explain it.
You might want to think about other ways of visualizing count data, like geom_bar().

Related

How to plot minimum, maximum, and mean in r

I've been reading how to plot points in r, but can't find anything that matches my problem. My data is a matrix; the rows start with a column called 'site' and it is followed by three columns containing the parameters: minimum, mean, and maximum. There are four rows in the matrix, corresponding to 4 sites.
What I want is a graph that has the 4 sites on the x-axis and the three data points (min, mean max) above each site, connected by a line. The mean would be represented by a circle, while the min and max by a cross bar. Each of the means would be connected by a line. My output would look like a boxplot without the boxes and with a line connecting the means.
Can anyone help me? It seems like a simple problem but I'm stumped.
Define a random matrix:
set.seed(1)
n_sites <- 4
myMatrix <- cbind(t(replicate(n_sites,sort(rnorm(3)))),1:n_sites)
dimnames(myMatrix) <- list(paste("Site",1:n_sites),c("Min","Mean","Max","n"))
Plot:
plot(c(1,n_sites),range(myMatrix),type="n",xlab="",ylab="",xaxt="n",las=1)
axis(1,1:n_sites,rownames(myMatrix))
arrows(x0=1:n_sites,y0=myMatrix[,"Min"],x1=1:n_sites,y1=myMatrix[,"Max"],angle=90,code=3,length=0.1)
points(1:n_sites,myMatrix[,"Mean"],bg="white",pch=21,type="o")
text(1:n_sites,myMatrix[,"Max"],myMatrix[,"n"],pos=3)
I like using arrows() in cases like this.

Interpretation of a graph created by the R package seas

I am relatively new to R studio and R in general, I am not even sure if this is the right place to ask this question. I was instructed to draw a graph showing seasonality using daily rainfall over a number of years. I need help more in interpreting the graph than in plotting it.
There is an example already in R using mscdata that I was able to replicate using my own data, the code for the example is as below. Any help with what this graph means or explains will be greatly appreciated.Thank you
install.packages(seas)
library(seas)
data(mscdata)
dat <- mksub(mscdata, id=1108447)
dat.ss <- seas.sum(dat, width="mon")
x<-mscdata
# Structure in R
str(dat.ss)
tail(mscdata)
# Annual data
dat.ss$ann
# Demonstrate how to slice through a cubic array
dat.ss$seas["1990",,]
dat.ss$seas[,2,] # or "Feb", if using English locale
dat.ss$seas[,,"precip"]
# Simple calculation on an array
(monthly.mean <- apply(dat.ss$seas[,,"precip"], 2, mean,na.rm=TRUE))
barplot(monthly.mean, ylab="Mean monthly total (mm/month)",
main="Un-normalized mean precipitation in Vancouver, BC")
text(6.5, 150, paste("Un-normalized rates given 'per month' should be",
"avoided since ~3-9% error is introduced",
"to the analysis between months", sep="\n"))
# Normalized precip
norm.monthly <- dat.ss$seas[,,"precip"] / dat.ss$days
norm.monthly.mean <- apply(norm.monthly, 2, mean,na.rm=TRUE)
print(round(norm.monthly, 2))
print(round(norm.monthly.mean, 2))
barplot(norm.monthly.mean,
ylab="Normalized mean monthly total (mm/day)",
main="Normalized mean precipitation in Vancouver, BC")
# Better graphics of data
dat.ss <- seas.sum(dat, width=11)
image(dat.ss)
This code gives a graph showing sample quartiles, annual rainfall but I don't really know what it means. Any help whatsoever will be appreciated
The Graph using the package seas is as below
Plot
I'll start with the top left graph :
You've probably guessed that each row is a year (as shown by the Y-axis) while day groups/months of the year are X-axis. The color of each box of the heatmap is proportionally darker according to the mm's worth of rain in that day group, with the scale being displayed on the far right. I assume the red X's mean missing values.
Top right is like a barplot with the sum of rainfall each year (row), just continuously plotted. The red bar should be the average precipitation overall (not sure about the orange one).
Bottom left is a bit more tricky. Think of it like you reordered the rows in each column to have the heaviest rainfall of the day group at the top (forgetting about the year info here). The Y-axis shows the quantiles. The quantiles' respective values change for each day group, so the lines you see on top of the plot indicate key rainfall values in mm (4,6,8,10,12). Indeed, If you look at the 2mm line (lowest one), you'll see that in January, about 20% of rainfalls (across all years) are below this threshold, while in the end of July, over 80% are below 2mm (expect less rainfall in the summer).
Lastly, bottom right is similar to the one above it. It's the sum of all rows, referring to the quantiles rather than years this time, resulting in the staircase pattern.
You'll notice that since the scale of the plot is the same as the one showing the average per year, the top of the staircase is outside of the plot...
Hope I made that clear enough.

R - TablePlot() - Clarifications

I'm just trying to understand how to read table plots. I don't understand what the dividing line in a numerical columns/variable represents. For example, the dividing black line in P1/2/3/4/5 here:
https://steemitimages.com/DQmeEJ8RyPkdRhdqX6CwNsUTzXfGWt36RwyFrixt6NNbPTw/tabplot.PNG
Also, I understand the Y Axis represents proportions (0% to 100%). Does the X axis for each variable represent proportions too or is that just regular values for the data?
Thanks!
It's hard to be sure, but those look like boxplots (but they're not.. see below). A classic boxplot will have a central marker for the median and then box ends are at a point called hinges that are set by the first and third interquartile points. You can read up about it at ?boxplot.stats. It's also possible that someone chose a different statistic to form the x-bounds for those boxes, but we can be certain that they are not proportions since some of them are negative.
The proportions in the "y-axis" are for the various regions. You will need to find documentation to determine the sequence of the regions. I suspect they are viticulture regions in Italy.
Here is a copy of part of the help page: ?tableplot:
numMode
character value that determines how numeric values are plotted. The value consists of the following building blocks, which are concatenated with the "-" symbol. The default value is "mb-sdb-sdl". Prior to version 1.2, "MB-ML" was the default value.
sdb sd bars between mean-sd to mean+sd are shown
sdl sd lines at mean-sd and mean+sd are shown
mb mean bars are shown
MB mean bars are shown, where the color of the bar indicate completeness where positive mean values are blue and negative orange
ml mean lines are shown
ML mean lines are shown, where positive mean values are blue and negative orange
mean2 mean values are shown
This default value for numMode obviously changed since most of the example in the documentation only show the mean value.

Method of Outlier Removal for Boxplots

In R, what method is used in boxplot() to remove outliers? In other words, what determines if a given value is an outlier?
Please note, this question is not asking how to remove outliers.
EDIT: Why is this question being downvoted? Please provide comments. The method for outlier removal is not available in any documentation I have come across.
tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.
R's boxplot() function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline argument is FALSE). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range parameter:
range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats:
The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see ?boxplot.stats for slightly more detail ...]
The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).
Example:
set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")

What method does outline=FALSE use to determine outliers? [duplicate]

This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.

Resources