Need an explanation for this weird looking boxplot notch.
I've provided the data and the code before plotting (using ggplot2)
The notch looks inverted on top. How can this be explained?
I've never encountered a notch like this.
df<-read.table(text = ' SU AGC.low AGB
1 1 22.12 48.09
2 2 10.14 22.04
3 3 18.23 39.63
4 4 36.14 78.57
5 5 47.56 103.39
6 6 38.98 84.74
7 7 47.74 103.78
8 8 15.17 32.98
9 10 30.24 65.74
10 11 33.28 72.35
11 15 40.27 87.54', header=TRUE, sep="")
df = subset(df, select = -c(AGC.low))
dfm <- melt(df[,c('SU','AGB')],id.vars = 1)
str(dfm)
dfm$SU<-as.factor(dfm$SU) #dit is ook nodig voor collaps x - as
view(dat)
view(dfm)
# Make a boxplot for AGB
# Trim data frame
# Remove SU column
box_dfm = subset(dfm, select = -c(SU))
names(box_dfm)
names(box_dfm)[2] <- "AGB"
#names(box_dfm)[1] <- "AGB"
library(ggplot2)
# Change outlier, color, shape and size
p<-ggplot(box_dfm, aes(x=variable, y=AGB, color=variable)) +
geom_boxplot(outlier.colour="black", outlier.shape=20,outlier.size=2,notch=TRUE)+
scale_y_continuous(breaks=seq(0,160,20))+
ggtitle("Plot scale Biomass") +
xlab("Variable") + ylab("Biomass Mg B/ ha")+
theme(legend.position="none")
This boxplot is just a summary of the data. Reading the documentation for geom_boxplot:
The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).
Computing the upper hinge this gives
quantile(box_dfm$AGB, .75)
#> 75%
#> 86.14
The notched box is defined as:
In a notched box plot, the notches extend 1.58 * IQR / sqrt(n).
n <- nrow(box_dfm)
median(box_dfm$AGB) + (IQR(box_dfm$AGB) * 1.58 / sqrt(n))
#> 92.49168
If you had more data the width of the notch would be narrower. For this asymmetric case it ends up as an inverted notch.
Related
There are different data sets as bottom.
1-1.Data Set(cidf_min.csv)
name
number
value
samples
conf
lower
upper
level
apple
1
0.056008
100
0.95
0.05458
0.059141
2
apple
2
0.048256
100
0.95
0.046363
0.059142
2
apple
3
0.042819
100
0.95
0.040164
0.059143
2
apple
4
0.038663
100
0.95
0.035155
0.059144
2
apple
5
0.035325
100
0.95
0.030146
0.059145
2
1-2.Data Set(newdf_min.csv)
name
number
value
samples
conf
lower
upper
level
max
apple
2
0.01854
100
0.95
-0.06963
0.045235
2
2
'''code'''
cidf<-read.csv("D:/cidf_min.csv")
newdf<-read.csv("D:/newdf_min.csv")
p_min<-ggplot(cidf, aes(x=number, y=value, group=name))+geom_line(aes(color=level))+geom_ribbon(aes(ymin=lower, ymax=upper, fill=level, group=name), alpha=0.3)+geom_text(data=newdf, aes(label=name, color=level), hjust=-0.2, vjust=0.5, size=3, show.legend=F)+coord_cartesian(xlim=c(0,max(cidf$number)*1.2))+xlab(~"Con (\u00D7"~C[max]*")")+ylab(~"score ("*mu*"C/"*mu*"F)")+theme_bw()
2-1.Data Set(cidf_max.csv)
name
number
value
samples
conf
lower
upper
level
apple
1
0.068832
100
0.95
0.061945
0.069416
2
apple
2
0.065256
100
0.95
0.053687
0.065841
2
apple
3
0.060492
100
0.95
0.046201
0.06155
2
apple
4
0.05585
100
0.95
0.039848
0.058739
2
apple
5
0.047585
100
0.95
0.033555
0.056066
2
2-2.Data Set(newdf_max.csv)
name
number
value
samples
conf
lower
upper
level
max
apple
2
0.024221
100
0.95
-0.04546
0.076362
2
2
'''code'''
cidf<-read.csv("D:/cidf_max.csv")
newdf<-read.csv("D:/newdf_max.csv")
p_max<-ggplot(cidf, aes(x=number, y=value, group=name))+geom_line(aes(color=level))+geom_ribbon(aes(ymin=lower, ymax=upper, fill=level, group=name), alpha=0.3)+geom_text(data=newdf, aes(label=name, color=level), hjust=-0.2, vjust=0.5, size=3, show.legend=F)+coord_cartesian(xlim=c(0,max(cidf$number)*1.2))+xlab(~"Con (\u00D7"~C[max]*")")+ylab(~"score ("*mu*"C/"*mu*"F)")+theme_bw()
3-1.Data Set(cidf_mean.csv)
name
number
value
samples
conf
lower
upper
level
apple
1
0.069673
100
0.95
0.069673
0.069673
2
apple
2
0.06133
100
0.95
0.057955
0.062792
2
apple
3
0.060497
100
0.95
0.046201
0.06155
2
apple
4
0.054623
100
0.95
0.044241
0.058739
2
apple
5
0.039852
100
0.95
0.031906
0.043719
2
3-2.Data Set(newdf_mean.csv)
name
number
value
samples
conf
lower
upper
level
max
apple
2
0.014323
100
0.95
-0.06793
0.045717
2
2
'''code'''
cidf<-read.csv("D:/cidf_mean.csv")
newdf<-read.csv("D:/newdf_mean.csv")
p_mean<-ggplot(cidf, aes(x=number, y=value, group=name))+geom_line(aes(color=level))+geom_ribbon(aes(ymin=lower, ymax=upper, fill=level, group=name), alpha=0.3)+geom_text(data=newdf, aes(label=name, color=level), hjust=-0.2, vjust=0.5, size=3, show.legend=F)+coord_cartesian(xlim=c(0,max(cidf$number)*1.2))+xlab(~"Con (\u00D7"~C[max]*")")+ylab(~"score ("*mu*"C/"*mu*"F)")+theme_bw()
I already drew 3 plots using code of ggplot, geom_line and geom_ribbon etc.
I want to merge plots of p_min, p_max and p_mean.
p_min, p_max and p_mean must locate in y axis.
x axis is number(1,2,3,4,5).
Let me know how to draw plots of multiple y axis using complex variables in a layout.
What I did I first renamed each dataset to cidf1,cidf2, and cidf3 so that they won't mix.
and the same for newdf. I added a column type that basically has info of what type of graph it is. I was not sure whether you know the tidyvesre, so I used basic R to add a column using $ operator
#Set 1
cidf1 <- read.csv("cidf_min.csv")
cidf1$type ="P_min"
newdf1<-read.csv("newdf_min.csv")
newdf1$type ="P_min"
#Set 2
cidf2<-read.csv("cidf_max.csv")
cidf2$type ="P_max"
newdf2<-read.csv("newdf_max.csv")
newdf2$type ="P_max"
#Set 3
cidf3<-read.csv("cidf_mean.csv")
cidf3$type ="P_mean"
newdf3<-read.csv("newdf_mean.csv")
newdf3$type ="P_mean"
then I combined them together:
cidf = rbind(cidf1,cidf2,cidf3)
newdf =rbind(newdf1,newdf2,newdf3)
And plotted them, setting color=type to color each line according to the dataset. I removed other things that you had in your ggplot as they were not related to your question asked.
ggplot(cidf, aes(x=X, y=value))+geom_line(aes(color=type)) +
geom_ribbon(aes(ymin=lower, ymax=upper, fill=type), alpha=0.3) +
xlab(~"Con (\u00D7"~C[max]*")")+ylab(~"score ("*mu*"C/"*mu*"F)")+theme_bw()
So it is very close to what you are looking for. Let me know if I misinterpret what you want to do and I will update my code
I need some help. Here is my data which i want to plot. I want to keep $path.ID on y axis and numerics of all other columns added stepwise. this is a subset of very large dataset so i want to pathID labels attached to each line. and also the values of the other columns with each point if possible.
head(table)
Path.ID sc st rc rt
<chr> <dbl> <dbl> <dbl> <dbl>
1 map00230 1 12 5 52
2 map00940 1 20 10 43
3 map01130 NA 15 8 34
4 map00983 NA 14 5 28
5 map00730 NA 5 3 26
6 map00982 NA 16 2 24
somewhat like this
Thank you
Here is the pseudo code.
library(tidyr)
library(dplyr)
library(ggplot2)
# convert your table into a long format - sorry I am more used to this type of data
table_long <- table %>% gather(x_axis, value, sc:rt)
# Plot with ggplot2
ggplot() +
# draw line
geom_line(data=table_long, aes(x=x_axis, y=value, group=Path.ID, color=Path.ID)) +
# draw label at the last x_axis in this case is **rt**
geom_label(data=table_long %>% filter(x_axis=="rt"),
aes(x=x_axis, y=value, label=Path.ID, fill=Path.ID),
color="#FFFFFF")
Note that with this code if a Path.ID doesn't have the rt value then it will not have any label
p<-ggplot() +
# draw line
geom_line(data=table_long, aes(x=x_axis, y=value, group=Path.ID, color=Path.ID)) +
geom_text(data=table_long %>% filter(x_axis=="rt"),
aes(x=x_axis, y=value, label=Path.ID),
color= "#050505", size = 3, check_overlap = TRUE)
p +labs(title= "title",x = "x-lable", y="y-label")
I had to use geom_text as i had large dataset and it gave me somewhat more clear graph
thank you #sinh it it helped a lot.
I am using the ..count.. transformation in geom_bar and get the warning
position_stack requires non-overlapping x intervals when some of my categories have few counts.
This is best explained using some mock data (my data involves direction and windspeed and I retain names relating to that)
#make data
set.seed(12345)
FF=rweibull(100,1.7,1)*20 #mock speeds
FF[FF>60]=59
dir=sample.int(10,size=100,replace=TRUE) # mock directions
#group into speed classes
FFcut=cut(FF,breaks=seq(0,60,by=20),ordered_result=TRUE,right=FALSE,drop=FALSE)
# stuff into data frame & plot
df=data.frame(dir=dir,grp=FFcut)
ggplot(data=df,aes(x=dir,y=(..count..)/sum(..count..),fill=grp)) + geom_bar()
This works fine, and the resulting plot shows the frequency of directions grouped according to speed. It is of relevance that the velocity class with the fewest counts (here "[40,60)") will have 5 counts.
However more velocity classes leads to a warning. For instance, with
FFcut=cut(FF,breaks=seq(0,60,by=15),ordered_result=TRUE,right=FALSE,drop=FALSE)
the velocity class with the fewest counts (now "[45,60)") will have only 3 counts and ggplot2 will warn that
position_stack requires non-overlapping x intervals
and the plot will show data in this category spread out along the x axis.
It seems that 5 is the minimum size for a group to have for this to work correctly.
I would appreciate knowing if this is a feature or a bug in stat_bin (which geom_bar is using) or if I am simply abusing geom_bar.
Also, any suggestions how to get around this would be appreciated.
Sincerely
This occurs because df$dir is numeric, so the ggplot object assumes a continuous x-axis, and aesthetic parameter group is based on the only known discrete variable (fill = grp).
As a result, when there simply aren't that many dir values in grp = [45,60), ggplot gets confused over how wide each bar should be. This becomes more visually obvious if we split the plot into different facets:
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar() +
facet_wrap(~ grp)
> for(l in levels(df$grp)) print(sort(unique(df$dir[df$grp == l])))
[1] 1 2 3 4 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 3 4 5 7 9 10
[1] 2 4 7
We can also check manually that the minimum difference between sorted df$dir values is 1 for the first three grp values, but 2 for the last one. The default bar width is thus wider.
The following solutions should all achieve the same result:
1. Explicitly specify the same bar width for all groups in geom_bar():
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar(width = 0.9)
2. Convert dir to a categorical variable before passing it to aes(x = ...):
ggplot(data=df,
aes(x=factor(dir), y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar()
3. Specify that the group parameter should be based on both df$dir & df$grp:
ggplot(data=df,
aes(x=dir,
y=(..count..)/sum(..count..),
group = interaction(dir, grp),
fill = grp)) +
geom_bar()
This doesn't directly solve the issue, because I also don't get what's going on with the overlapping values, but it's a dplyr-powered workaround, and might turn out to be more flexible anyway.
Instead of relying on geom_bar to take the cut factor and give you shares via ..count../sum(..count..), you can easily enough just calculate those shares yourself up front, and then plot your bars. I personally like having this type of control over my data and exactly what I'm plotting.
First, I put dir and FF into a data frame/tbl_df, and cut FF. Then count lets me group the data by dir and grp and count up the number of observations for each combination of those two variables, then calculate the share of each n over the sum of n. I'm using geom_col, which is like geom_bar but when you have a y value in your aes.
library(tidyverse)
set.seed(12345)
FF <- rweibull(100,1.7,1) * 20 #mock speeds
FF[FF > 60] <- 59
dir <- sample.int(10, size = 100, replace = TRUE) # mock directions
shares <- tibble(dir = dir, FF = FF) %>%
mutate(grp = cut(FF, breaks = seq(0, 60, by = 15), ordered_result = T, right = F, drop = F)) %>%
count(dir, grp) %>%
mutate(share = n / sum(n))
shares
#> # A tibble: 29 x 4
#> dir grp n share
#> <int> <ord> <int> <dbl>
#> 1 1 [0,15) 3 0.03
#> 2 1 [15,30) 2 0.02
#> 3 2 [0,15) 4 0.04
#> 4 2 [15,30) 3 0.03
#> 5 2 [30,45) 1 0.01
#> 6 2 [45,60) 1 0.01
#> 7 3 [0,15) 6 0.06
#> 8 3 [15,30) 1 0.01
#> 9 3 [30,45) 2 0.02
#> 10 4 [0,15) 6 0.06
#> # ... with 19 more rows
ggplot(shares, aes(x = dir, y = share, fill = grp)) +
geom_col()
I am new to R and ggplot2. Any help is much appreciated! I have here a data set, I am trying to graph
weight band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42
In this set of data, I am trying to plot a bar graph of mean 1 and mean 2 side by side under the given weight_band (1-4) and applying error bars for min (1&2 respectively) and max (1&2 respectively). The "." notates that no data.
I have browsed through stackoverflow and other website, but haven't find the solution I am looking for.
the code I have is as follows:
sk1 <- read.csv(file="analysis.csv")
library(reshape2)
sk2 <- melt(sk1,id.vars = "Weight_band")
c <- ggplot(sk2, aes(x = Weight_band, y = value, fill = variable))
c + geom_bar(stat = "identity", position="dodge")
However, using this method, it does not limit the graph to only plotting the mean only. Is there a set of code to do so? Furthermore, is there a method to apply min and max as error bars to their respective mean? I thank everyone in advance. This would help me greatly in advancing my understanding of R's ggplot2 function
This should get you close, we need to do a little more data cleaning and reshaping to make ggplot happy :)
library(reshape2)
df <- read.table(text = "weight_band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42", header = T)
sk2 <- melt(df,id.vars = "weight_band")
## Clean
sk2$group <- gsub(".*_(\\d)", "\\1", sk2$variable)
# new column used for color or fill aes and position dodging
sk2$variable <- gsub("_.*", "", sk2$variable)
# make these variables universal not group specific
## Reshape again
sk3 <- dcast(sk2, weight_band + group ~ variable)
# spread it back to kinda wide format
sk3 <- dplyr::mutate_if(sk3, is.character, as.numeric)
# convert every column to numeric if character now
# plot values seem a little wonky but the plot is coming together
ggplot(sk3, aes(x = as.factor(weight_band), y = mean, color = as.factor(group))) +
geom_bar(position = "dodge", stat = "identity") +
geom_errorbar(aes(ymax = max, ymin = min), position = "dodge")
I would like to create a multivariate boxplot time series with ggplot2 and I need to have an x axis that positions the boxplots based on their associated dates.
I found two posts about this question: one is Time series plot with groups using ggplot2 but the x axis is not a scale_x_axis so graph is biased in my case. The other one is ggplot2 : multiple factors boxplot with scale_x_date axis in R but the person uses an interaction function which i don't use in my case.
Here is an example file and my code:
dtm <- read.table(text="date ruche mortes trmt
03.10.2013 1 8 P+
04.10.2013 1 7 P+
07.10.2013 1 34 P+
03.10.2013 7 16 P+
04.10.2013 7 68 P+
07.10.2013 7 170 P+
03.10.2013 2 7 P-
04.10.2013 2 7 P-
07.10.2013 2 21 P-
03.10.2013 5 8 P-
04.10.2013 5 27 P-
07.10.2013 5 24 P-
03.10.2013 3 15 T
04.10.2013 3 6 T
07.10.2013 3 13 T
03.10.2013 4 6 T
04.10.2013 4 18 T
07.10.2013 4 19 T ", h=T)
require(ggplot2)
require(visreg)
require(MASS)
require(reshape2)
library(scales)
dtm$asDate = as.Date(dtm[,1], "%d.%m.%Y")
## Plot 1: Nearly what I want but is biased by the x-axis format where date should not be a factor##
p2<-ggplot(data = dtm, aes(x = factor(asDate), y = mortes))
p2 + geom_boxplot(aes(fill = factor(dtm$trmt)))
## Plot 2: Doesn't show me what I need, ggplot apparently needs a factor as x##
p<-ggplot(data = dtm, aes(x = asDate, y = mortes))
p + geom_boxplot(aes( group = asDate, fill=trmt) ) `
Can anyone help me with this issue, please?
Is this what you want?
Code:
p <- ggplot(data = dtm, aes(x = asDate, y = mortes, group=interaction(date, trmt)))
p + geom_boxplot(aes(fill = factor(dtm$trmt)))
The key is to group by interaction(date, trmt) so that you get all of the boxes, and not cast asDate to a factor, so that ggplot treats it as a date. If you want to add anything more to the x axis, be sure to do it with + scale_x_date().