Adding Different Percentiles in boxplots in R - r

I am failry new to R and recently used it to make some Boxplots. I also added the mean and standard deviation in my boxplot. I was wondering if i could add some kind of tick mark or circle in different percentile as well. Let's say if i want to mark the 85th, $ 90th percentile in each HOUR boxplot, is there a way to do this? My data consist of a year worth of loads in MW in each hour & My output consist of 24 boxplots for each hour for each month. I am doing each month at a time because i am not sure if there is a way to run all 96(Each month, weekday/weekend , for 4 different zones) boxplots at once. Thanks in advance for help.
JANWD <-read.csv("C:\\My Directory\\MWBox2.csv")
JANWD.df<-data.frame(JANWD)
JANWD.sub <-subset(JANWD.df, MONTH < 2 & weekend == "NO")
KeepCols <-c("Hour" , "Houston_Load")
HWD <- JANWD.sub[ ,KeepCols]
sd <-tapply(HWD$Houston_Load, HWD$Hour, sd)
means <-tapply(HWD$Houston_Load, HWD$Hour, mean)
boxplot(Houston_Load ~ Hour, data=HWD, xlab="WEEKDAY HOURS", ylab="MW Differnce", ylim= c(-10, 20), smooth=TRUE ,col ="bisque", range=0)
points(sd, pch = 22, col= "blue")
points(means, pch=23, col ="red")
#Output of the subset of data used to run boxplot for month january in Houston
str(HWD)
'data.frame': 504 obs. of 2 variables:
`$ Hour : int 1 2 3 4 5 6 7 8 9 10 ...'
`$ Houston_Load: num 1.922 2.747 -2.389 0.515 1.922 ...'
#OUTPUT of the original data
str(JANWD)
'data.frame': 8783 obs. of 9 variables:
$ Date : Factor w/ 366 levels "1/1/2012","1/10/2012",..: 306 306 306 306 306 306 306 306 306 306 ...
`$ Hour : int 1 2 3 4 5 6 7 8 9 10 ...'
` $ MONTH : int 8 8 8 8 8 8 8 8 8 8 ...'
`$ weekend : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...'
`$ TOTAL_LOAD : num 0.607 5.111 6.252 7.607 0.607 ...'
`$ Houston_Load: num -2.389 0.515 1.922 2.747 -2.389 ...'
`$ North_Load : num 2.95 4.14 3.55 3.91 2.95 ...'
`$ South_Load : num -0.108 0.267 0.54 0.638 -0.108 ...'
`$ West_Load : num 0.154 0.193 0.236 0.311 0.154 ...'

Here is one way, using quantile() to compute the relevant percentiles for you. I add the marks using rug().
set.seed(1)
X <- rnorm(200)
boxplot(X, yaxt = "n")
## compute the required quantiles
qntl <- quantile(X, probs = c(0.85, 0.90))
## add them as a rgu plot to the left hand side
rug(qntl, side = 2, col = "blue", lwd = 2)
## add the box and axes
axis(2)
box()
Update: In response to the OP providing str() output, here is an example similar to the data that the OP has to hand:
set.seed(1) ## make reproducible
HWD <- data.frame(Hour = rep(0:23, 10),
Houston_Load = rnorm(24*10))
Now get I presume you want ticks at 85th and 90th percentiles for each Hour? If so we need to split the data by Hour and compute via quantile() as I showed earlier:
quants <- sapply(split(HWD$Houston_Load, list(HWD$Hour)),
quantile, probs = c(0.85, 0.9))
which gives:
R> quants <- sapply(split(HWD$Houston_Load, list(HWD$Hour)),
+ quantile, probs = c(0.85, 0.9))
R> quants
0 1 2 3 4 5 6
85% 0.3576510 0.8633506 1.581443 0.2264709 0.4164411 0.2864026 1.053742
90% 0.6116363 0.9273008 2.109248 0.4218297 0.5554147 0.4474140 1.366114
7 8 9 10 11 12 13 14
85% 0.5352211 0.5175485 1.790593 1.394988 0.7280584 0.8578999 1.437778 1.087101
90% 0.8625322 0.5969672 1.830352 1.519262 0.9399476 1.1401877 1.763725 1.102516
15 16 17 18 19 20 21
85% 0.6855288 0.4874499 0.5493679 0.9754414 1.095362 0.7936225 1.824002
90% 0.8737872 0.6121487 0.6078405 1.0990935 1.233637 0.9431199 2.175961
22 23
85% 1.058648 0.6950166
90% 1.145783 0.8436541
Now we can draw marks at the x locations of the boxplots
boxplot(Houston_Load ~ Hour, data = HWD, axes = FALSE)
xlocs <- 1:24 ## where to draw marks
tickl <- 0.15 ## length of marks used
for(i in seq_len(ncol(quants))) {
segments(x0 = rep(xlocs[i] - 0.15, 2), y0 = quants[, i],
x1 = rep(xlocs[i] + 0.15, 2), y1 = quants[, i],
col = c("red", "blue"), lwd = 2)
}
title(xlab = "Hour", ylab = "Houston Load")
axis(1, at = xlocs, labels = xlocs - 1)
axis(2)
box()
legend("bottomleft", legend = paste(c("0.85", "0.90"), "quantile"),
bty = "n", lty = "solid", lwd = 2, col = c("red", "blue"))
The resulting figure should look like this:

Related

Select observations near to their neighbor in PCA cloud

I have a dataset ind with two population fr2100 and nr, where each individual in this population have an unique numerous. Each individual has coordinates, a Dim.1 and Dim.2 value. As you can see here:
> ind <- get_pca_ind(res_acp)
> ind
Principal Component Analysis Results for individuals
===================================================
Name Description
1 "$coord" "Coordinates for the individuals"
2 "$cos2" "Cos2 for the individuals"
3 "$contrib" "contributions of the individuals"
# isolate the population 'fr2100'
> fr2100 <- ind$coord[substr(rownames(ind$coord), 1, 7) == 'fr2100_', ]
> str(fr2100)
'data.frame': 6873 obs. of 3 variables:
$ rowname: chr "fr2100_72" "fr2100_73" "fr2100_74" "fr2100_75" ...
$ Dim.1 : num 1.37 1.3 1.25 1.25 1.18 ...
$ Dim.2 : num -1.249 -1.028 -0.835 -0.624 -0.483 ...
# isolate the population 'nr'
> nr <- ind$coord[substr(rownames(ind$coord), 1, 3) == 'nr_', ]
> str(nr)
'data.frame': 4897 obs. of 3 variables:
$ rowname: chr "nr_174" "nr_175" "nr_176" "nr_177" ...
$ Dim.1 : num -3.74 -3.44 -3.26 -2.97 -3.88 ...
$ Dim.2 : num 1.26 1.55 1.7 1.91 1.3 ...
My question: I am trying to understand how I can select only, among the 6873 individuals of fr2100, the individuals who have a value of Dim.1 AND Dim.2 at a distance of more or less 0.01 from the 4897 individuals nr, represented in this cloud of points:
In other words, each individuals fr2100 that can be within the perimeter (at 0.01) of an individual nr. as theoretically represented here
I'm interested in any answers. I can provide more information if needed. Thank you in advance.
I guess distance_semi_join() from fuzzyjoin package would be rather straightforward and compact way to filter by euclidean distance. Other variants like distance_left_join() are also worth considering as those will provide an optional distance variable in resulting dataframe.
library(fuzzyjoin)
library(ggplot2)
# example datasets
set.seed(1)
nr <- data.frame(rowname = paste0("nr_", 1:100), Dim.1 = rnorm(100, -0.05, 0.03), Dim.2 = rnorm(100, 0, 0.02))
fr <- data.frame(rowname = paste0("fr_", 1:100), Dim.1 = rnorm(100, 0.05, 0.03), Dim.2 = rnorm(100, 0, 0.02))
# fr points within distance of closest nr point:
fr_in_dist <- distance_semi_join(fr, nr,
by = c("Dim.1","Dim.2"),
max_dist=0.01)
fr_in_dist
#> rowname Dim.1 Dim.2
#> 5 fr_5 -0.018557066 3.308291e-02
#> 14 fr_14 0.008893764 -1.311564e-02
#> 18 fr_18 0.012401307 -2.420202e-03
#> 25 fr_25 0.015302829 9.640590e-03
#> 28 fr_28 0.001834598 3.409789e-03
#> 32 fr_32 -0.036667620 -3.138164e-02
#> 38 fr_38 0.014406241 8.797409e-05
#> 46 fr_46 -0.010004948 -2.817701e-02
#> 57 fr_57 -0.022092886 -2.347154e-02
#> 68 fr_68 0.014326601 1.135904e-02
#> 77 fr_77 -0.018673719 2.577108e-03
#> 79 fr_79 0.010512645 -3.278219e-03
#> 84 fr_84 0.028963050 3.286837e-03
#> 86 fr_86 0.019967835 -1.130428e-03
#> 94 fr_94 0.007212280 6.132097e-03
ggplot() +
geom_point(data = nr, aes(x = Dim.1, y = Dim.2, color = "nr"))+
geom_point(data = fr, aes(x = Dim.1, y = Dim.2, color = "fr"))+
geom_point(data = fr_in_dist, aes(x = Dim.1, y = Dim.2), shape = 1, size = 5 )+
coord_fixed() +
theme_bw()
Original answer was about singe reference point vs point could, in this dist() from base is also quite straightforward:
library(ggplot2)
# sample data, add point fr2100_xx that would fall outside of the perimeter
df <- read.csv(text = "rowname, Dim.1, Dim.2
fr2100_72, 0.003810163, 0.006935450
fr2100_73, 0.003433946, 0.004698691
fr2100_74, 0.003168248, 0.003097222
fr2100_xx, 0.015, 0.015")
# nr and threshold distance
nr <- c(0.0035, 0.005)
thr_dist <- 0.01
# insert nr point to first position to use it in distance matrix calculation
dist_m <- rbind(c(0.0035, 0.005),df[,c("Dim.1", "Dim.2")]) |> dist() |> as.matrix()
# distances:
as.dist(dist_m)
#> 1 2 3 4
#> 2 0.0019601448
#> 3 0.0003084643 0.0022681777
#> 4 0.0019314822 0.0038915356 0.0016233602
#> 5 0.0152397507 0.0137930932 0.0154884012 0.0167829223
# extract first column, distnaces from point "nr" ([1,1] = 0)
df$dist <-dist_m[-1,1]
# flag points that fall outside of the perimeter
df$in_dist = df$dist <= thr_dist
df
#> rowname Dim.1 Dim.2 dist in_dist
#> 1 fr2100_72 0.003810163 0.006935450 0.0019601448 TRUE
#> 2 fr2100_73 0.003433946 0.004698691 0.0003084643 TRUE
#> 3 fr2100_74 0.003168248 0.003097222 0.0019314822 TRUE
#> 4 fr2100_xx 0.015000000 0.015000000 0.0152397507 FALSE
Viz - https://i.imgur.com/jiqHmXn.png

Formula notation for scatterplot producing unexpected results

I am working on a map, where the color of each point is proportional to one response variable, and the size of the point is proportional to another. I've noticed that when I try to plot the points using formula notation things go haywire, while default notation performs as expected. I have used formula notation to plot maps many times before, and thought that the notations were nearly interchangeable. Why would these produce different results? I have read through the plot.formula and plot.default documentation and haven't been able to figure it out. Based on this I am wondering if it has to do with the columns of dat being coerced to factors, but I'm not sure why that would be happening. Any ideas?
Consider the following example data frame, dat:
latitude <- c(runif(10, min = 45, max = 48))
latitude[9] <- NA
longitude <- c(runif(10, min = -124.5, max = -122.5))
longitude[9] <- NA
color <- c("#00FFCCCC", "#99FF00CC", "#FF0000CC", "#3300FFCC", "#00FFCCCC",
"#00FFCCCC", "#3300FFCC", "#00FFCCCC", NA, "#3300FFCC")
size <- c(4.916667, 5.750000, 7.000000, 2.000000, 5.750000,
4.500000, 2.000000, 4.500000, NA, 2.000000)
dat <- as.data.frame(cbind(longitude, latitude, color, size))
Plotting according to formula notation
plot(latitude ~ longitude, data = dat, type = "p", pch = 21, col = 1, bg = color, cex = size)
produces
this mess and the following error: graphical parameter "type" is obsolete.
Plotting according to the default notation
plot(longitude, latitude, type = "p", pch = 21, col = 1, bg = color, cex = size)
works as expected, though with the same error.
There are a couple of problems with this. First is that your use of cbind is turning this into a matrix, albeit temporarily, which is converting your numbers to character. See:
dat <- as.data.frame(cbind(longitude, latitude, color, size))
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: Factor w/ 9 levels "-122.855375511572",..: 6 8 9 1 4 3 2 7 NA 5
# $ latitude : Factor w/ 9 levels "45.5418886151165",..: 6 2 4 1 3 7 5 9 NA 8
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : Factor w/ 5 levels "2","4.5","4.916667",..: 3 4 5 1 4 2 1 2 NA 1
If instead you just use data.frame, you'll get:
dat <- data.frame(longitude, latitude, color, size)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)
But now the colors are all dorked. Okay, the problem is likely because your $color is a factor, which is being interpreted internally as integers. Try stringsAsFactors=F:
dat <- data.frame(longitude, latitude, color, size, stringsAsFactors=FALSE)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : chr "#00FFCCCC" "#99FF00CC" "#FF0000CC" "#3300FFCC" ...
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)

ggplot place facet between two rows of facets

I have 9 plots with 3 time series in each plot, one of these plots contains only one curve and it's the reference plot which I would like to place in between the two rows that contain the other 8 plots. Is there an easy way to do so?
I use facet_wrap(~density,nrow=2) but I get one row with 5 and another with 4 plots. I am sure other people had this problem, is there an easy way around to organize the position of this reference plot, or do I have to create two separate plots and overlay them? Otherwise I might have to move this reference plot in all the other plots but it seems redundant information.
This is my current result, but as you can see it's not very well laid out.
The graphic you are looking for can be generated with gridArrange from the
gridExtra package. Here is
an example using the storms data set from the
dplyr.
library(ggplot2)
library(gridExtra)
library(dplyr)
data(storms, package = 'dplyr')
str(storms)
## Classes 'tbl_df', 'tbl' and 'data.frame': 10010 obs. of 13 variables:
## $ name : chr "Amy" "Amy" "Amy" "Amy" ...
## $ year : num 1975 1975 1975 1975 1975 ...
## $ month : num 6 6 6 6 6 6 6 6 6 6 ...
## $ day : int 27 27 27 27 28 28 28 28 29 29 ...
## $ hour : num 0 6 12 18 0 6 12 18 0 6 ...
## $ lat : num 27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
## $ long : num -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
## $ status : chr "tropical depression" "tropical depression" "tropical depression" "tropical depression" ...
## $ category : Ord.factor w/ 7 levels "-1"<"0"<"1"<"2"<..: 1 1 1 1 1 1 1 1 2 2 ...
## $ wind : int 25 25 25 25 25 25 25 30 35 40 ...
## $ pressure : int 1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
## $ ts_diameter: num NA NA NA NA NA NA NA NA NA NA ...
## $ hu_diameter: num NA NA NA NA NA NA NA NA NA NA ...
Let's create two graphics. The first graphic will be only form category == -1
storms (this would be the control group in your question). The second
graphic will be a facteted graphic for the category > -1 storm
First, we'll build a generic ggplot object for the graphics.
graphic <-
ggplot() +
aes(x = long, y = lat, color = category) +
geom_point() +
facet_wrap( ~ category) +
scale_color_hue(breaks = levels(storms$category),
labels = levels(storms$category),
drop = FALSE)
Next we build the two graphics as needed.
g1 <- graphic %+% dplyr::filter(storms, category == -1) + theme(legend.position = "none")
g2 <- graphic %+% dplyr::filter(storms, category != -1)
gridExtra::grid.arrange can take a layout matrix where the numbers 1 and 2
denote the first and second graphics passed to the function. (This works for
a lot more than just two graphics, by the way.) By repeating the values of 1
and 2 in the matrix we can control the relative size of the two graphics in
the graphics device.
gridExtra::grid.arrange(g1, g2,
layout_matrix =
matrix(c(1, 1, 1, 2, 2, 2, 2, 2,
1, 1, 1, 2, 2, 2, 2, 2,
1, 1, 1, 2, 2, 2, 2, 2),
byrow = TRUE, nrow = 3)
)
If I understand the question correctly you could reformat your data with appropriate facetting variables to introduce a new row of reference panels
library(ggplot2)
d <- data.frame(x=rep(1:10, 8), y = rnorm(80),
f=gl(8,10, ordered = TRUE))
d$f1 <- factor(d$f <= 4, labels=c(1,3))
d$f2 <- as.numeric(d$f) %% 4
d2 <- data.frame(x=1:10, y=0, f1 = 2)
ggplot(d, aes(x,y)) +
geom_point(aes(colour=f)) +
geom_point(data=d2, colour="black") +
facet_grid(f1~f2)

Error in stat_summary(fun.y) when plotting outliers in a modified ggplot-boxplot

I want to plot boxplots showing the 95 percentile instead of the IQR, including outliers as defined by exceeding the 95% criterion.
This code is working fine, and based on several answers found here and on the web:
f1 <- function(x) {
subset(x, x < quantile(x, probs=0.025)) # only for low outliers
}
f2 <- function(x) {
r <- quantile(x, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
r
}
d <- data.frame(x=gl(2,50), y=rnorm(100))
library(ggplot2)
p0 <- ggplot(d, aes(x,y)) +
stat_summary(fun.data = f2, geom="boxplot") + coord_flip()
p1 <- p0 + stat_summary(fun.y = f1, geom="point")
The structure of d is:
'data.frame': 100 obs. of 2 variables:
$ x: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ y: num 2.275 0.659 -0.821 -0.129 1.997 ...
Now, coming to my real data, which is structured essentially the same:
str(test)
'data.frame': 11830917 obs. of 2 variables:
$ x: Ord.factor w/ 34 levels "SG26"<"SG22"<..: 18 18 18 18 18 18 18 18 18 18 ...
$ y: num 84.6 84.1 93.3 84 93.2 94.3 83.3 92.5 94.5 98.8 ...
Now, if i am applying the same plot command, i get:
p0 <- ggplot(test, aes(x,y)) + stat_summary(fun.data = f2, geom="boxplot") + coord_flip()
p1 <- p0 + stat_summary(fun.y = f1, geom="point")
p1
Warning message:
Computation failed in `stat_summary()`:
Argumente implizieren unterschiedliche Anzahl Zeilen: 1, 0
The final line is the german version of "arguments imply differing number of rows 1 0". p0 is produced just fine.
What could be the difference between the two datasets?
The problem, as identified by #Heroka and #bdemarest, arose by one factor level having only one value.
My workaround is to skip those factors:
f1 <- function(x) {
if (length(x) > 7) {
return(subset(x, x < quantile(x, probs=0.025))) # only for low outliers
} else {
return(NA)
}
}
For unknown reasons, the problem persisted until there were at least 7 values per factor level.

Trouble plotting with dates in R

I am relatively new to R and am having trouble plotting grouped data against date. I have count data grouped by month over 4 years. I don't want May of 2008 grouped with May 2009 but rather points for each month of each year with standard errors. Here is my code so far but I get a blank graph with no points. I can get rid of the axis.POSIXct line and I get a graph with points and error bars. The problem seems to be around the scaling or data format of the plot vs. the axis. Can anyone help me here?
> r <- as.POSIXct(range(refmCount$mo.yr), "month")
>
> ############# can get plot and points to line up on the x-axis##########################
> plot(refmCount$mo.yr, refmCount$count, type = "n", xaxt = "n",
+ xlab = "Date",
+ ylab = "Mean number of salamanders per night",
+ xlim = c(r[1], r[2]))
> axis.POSIXct(1, at = seq(r[1], r[2], by = "month"), format = "%b")
> points(refmCount$mo.yr, refmCount$count, type = "p", pch = 19)
points(depmCount$mo.yr, depmCount$count, type = "p", pch = 24)
> arrows(refmCount$mo.yr, refmCount$count+mCount$se, refmCount$mo.yr, refmCount$count- refmCount$se, angle=90, code=3, length=0)
>
> str(refmCount)
'data.frame': 19 obs. of 7 variables:
$ mo.yr:Class 'Date' num [1:19] 14000 14031 14061 14092 14123 ...
$ trt : Factor w/ 2 levels "Depletion","Reference": 2 2 2 2 2 2 2 2 2 2 ...
$ N : num 75 110 15 10 34 20 20 10 40 15 ...
$ count: num 3.6 5.95 3.47 6.7 11.12 ...
$ sd : num 8.58 8.4 4.42 3.47 11.88 ...
$ se : num 0.99 0.801 1.142 1.096 2.037 ...
$ ci : num 1.97 1.59 2.45 2.48 4.14 ...
> r
[1] "2008-04-30 20:00:00 EDT" "2011-05-31 20:00:00 EDT"
>
You have two choices. Install package "zoo" and use the yearmon class, or calculate numeric months so that May 2005 is 2005.4167. You can create prettier labels with paste(month.abb[month], year).

Resources