Box plot with bands with standard deviation and range - r

I am new to R and ggplot2. Any help is much appreciated! I have here a data set, I am trying to graph
weight band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42
In this set of data, I am trying to plot a bar graph of mean 1 and mean 2 side by side under the given weight_band (1-4) and applying error bars for min (1&2 respectively) and max (1&2 respectively). The "." notates that no data.
I have browsed through stackoverflow and other website, but haven't find the solution I am looking for.
the code I have is as follows:
sk1 <- read.csv(file="analysis.csv")
library(reshape2)
sk2 <- melt(sk1,id.vars = "Weight_band")
c <- ggplot(sk2, aes(x = Weight_band, y = value, fill = variable))
c + geom_bar(stat = "identity", position="dodge")
However, using this method, it does not limit the graph to only plotting the mean only. Is there a set of code to do so? Furthermore, is there a method to apply min and max as error bars to their respective mean? I thank everyone in advance. This would help me greatly in advancing my understanding of R's ggplot2 function

This should get you close, we need to do a little more data cleaning and reshaping to make ggplot happy :)
library(reshape2)
df <- read.table(text = "weight_band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42", header = T)
sk2 <- melt(df,id.vars = "weight_band")
## Clean
sk2$group <- gsub(".*_(\\d)", "\\1", sk2$variable)
# new column used for color or fill aes and position dodging
sk2$variable <- gsub("_.*", "", sk2$variable)
# make these variables universal not group specific
## Reshape again
sk3 <- dcast(sk2, weight_band + group ~ variable)
# spread it back to kinda wide format
sk3 <- dplyr::mutate_if(sk3, is.character, as.numeric)
# convert every column to numeric if character now
# plot values seem a little wonky but the plot is coming together
ggplot(sk3, aes(x = as.factor(weight_band), y = mean, color = as.factor(group))) +
geom_bar(position = "dodge", stat = "identity") +
geom_errorbar(aes(ymax = max, ymin = min), position = "dodge")

Related

no. of geom_point matches the value

I have an existing ggplot with geom_col and some observations from a dataframe. The dataframe looks something like :
over runs wickets
1 12 0
2 8 0
3 9 2
4 3 1
5 6 0
The geom_col represents the runs data column and now I want to represent the wickets column using geom_point in a way that the number of points represents the wickets.
I want my graph to look something like this :
As
As far as I know, we'll need to transform your data to have one row per point. This method will require dplyr version > 1.0 which allows summarize to expand the number of rows.
You can adjust the spacing of the wickets by multiplying seq(wickets), though with your sample data a spacing of 1 unit looks pretty good to me.
library(dplyr)
wicket_data = dd %>%
filter(wickets > 0) %>%
group_by(over) %>%
summarize(wicket_y = runs + seq(wickets))
ggplot(dd, aes(x = over)) +
geom_col(aes(y = runs), fill = "#A6C6FF") +
geom_point(data = wicket_data, aes(y = wicket_y), color = "firebrick4") +
theme_bw()
Using this sample data:
dd = read.table(text = "over runs wickets
1 12 0
2 8 0
3 9 2
4 3 1
5 6 0", header = T)

Is this boxplot notch correct?

Need an explanation for this weird looking boxplot notch.
I've provided the data and the code before plotting (using ggplot2)
The notch looks inverted on top. How can this be explained?
I've never encountered a notch like this.
df<-read.table(text = ' SU AGC.low AGB
1 1 22.12 48.09
2 2 10.14 22.04
3 3 18.23 39.63
4 4 36.14 78.57
5 5 47.56 103.39
6 6 38.98 84.74
7 7 47.74 103.78
8 8 15.17 32.98
9 10 30.24 65.74
10 11 33.28 72.35
11 15 40.27 87.54', header=TRUE, sep="")
df = subset(df, select = -c(AGC.low))
dfm <- melt(df[,c('SU','AGB')],id.vars = 1)
str(dfm)
dfm$SU<-as.factor(dfm$SU) #dit is ook nodig voor collaps x - as
view(dat)
view(dfm)
# Make a boxplot for AGB
# Trim data frame
# Remove SU column
box_dfm = subset(dfm, select = -c(SU))
names(box_dfm)
names(box_dfm)[2] <- "AGB"
#names(box_dfm)[1] <- "AGB"
library(ggplot2)
# Change outlier, color, shape and size
p<-ggplot(box_dfm, aes(x=variable, y=AGB, color=variable)) +
geom_boxplot(outlier.colour="black", outlier.shape=20,outlier.size=2,notch=TRUE)+
scale_y_continuous(breaks=seq(0,160,20))+
ggtitle("Plot scale Biomass") +
xlab("Variable") + ylab("Biomass Mg B/ ha")+
theme(legend.position="none")
This boxplot is just a summary of the data. Reading the documentation for geom_boxplot:
The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).
Computing the upper hinge this gives
quantile(box_dfm$AGB, .75)
#> 75%
#> 86.14
The notched box is defined as:
In a notched box plot, the notches extend 1.58 * IQR / sqrt(n).
n <- nrow(box_dfm)
median(box_dfm$AGB) + (IQR(box_dfm$AGB) * 1.58 / sqrt(n))
#> 92.49168
If you had more data the width of the notch would be narrower. For this asymmetric case it ends up as an inverted notch.

ggplot2 geom_bar position failure

I am using the ..count.. transformation in geom_bar and get the warning
position_stack requires non-overlapping x intervals when some of my categories have few counts.
This is best explained using some mock data (my data involves direction and windspeed and I retain names relating to that)
#make data
set.seed(12345)
FF=rweibull(100,1.7,1)*20 #mock speeds
FF[FF>60]=59
dir=sample.int(10,size=100,replace=TRUE) # mock directions
#group into speed classes
FFcut=cut(FF,breaks=seq(0,60,by=20),ordered_result=TRUE,right=FALSE,drop=FALSE)
# stuff into data frame & plot
df=data.frame(dir=dir,grp=FFcut)
ggplot(data=df,aes(x=dir,y=(..count..)/sum(..count..),fill=grp)) + geom_bar()
This works fine, and the resulting plot shows the frequency of directions grouped according to speed. It is of relevance that the velocity class with the fewest counts (here "[40,60)") will have 5 counts.
However more velocity classes leads to a warning. For instance, with
FFcut=cut(FF,breaks=seq(0,60,by=15),ordered_result=TRUE,right=FALSE,drop=FALSE)
the velocity class with the fewest counts (now "[45,60)") will have only 3 counts and ggplot2 will warn that
position_stack requires non-overlapping x intervals
and the plot will show data in this category spread out along the x axis.
It seems that 5 is the minimum size for a group to have for this to work correctly.
I would appreciate knowing if this is a feature or a bug in stat_bin (which geom_bar is using) or if I am simply abusing geom_bar.
Also, any suggestions how to get around this would be appreciated.
Sincerely
This occurs because df$dir is numeric, so the ggplot object assumes a continuous x-axis, and aesthetic parameter group is based on the only known discrete variable (fill = grp).
As a result, when there simply aren't that many dir values in grp = [45,60), ggplot gets confused over how wide each bar should be. This becomes more visually obvious if we split the plot into different facets:
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar() +
facet_wrap(~ grp)
> for(l in levels(df$grp)) print(sort(unique(df$dir[df$grp == l])))
[1] 1 2 3 4 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 3 4 5 7 9 10
[1] 2 4 7
We can also check manually that the minimum difference between sorted df$dir values is 1 for the first three grp values, but 2 for the last one. The default bar width is thus wider.
The following solutions should all achieve the same result:
1. Explicitly specify the same bar width for all groups in geom_bar():
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar(width = 0.9)
2. Convert dir to a categorical variable before passing it to aes(x = ...):
ggplot(data=df,
aes(x=factor(dir), y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar()
3. Specify that the group parameter should be based on both df$dir & df$grp:
ggplot(data=df,
aes(x=dir,
y=(..count..)/sum(..count..),
group = interaction(dir, grp),
fill = grp)) +
geom_bar()
This doesn't directly solve the issue, because I also don't get what's going on with the overlapping values, but it's a dplyr-powered workaround, and might turn out to be more flexible anyway.
Instead of relying on geom_bar to take the cut factor and give you shares via ..count../sum(..count..), you can easily enough just calculate those shares yourself up front, and then plot your bars. I personally like having this type of control over my data and exactly what I'm plotting.
First, I put dir and FF into a data frame/tbl_df, and cut FF. Then count lets me group the data by dir and grp and count up the number of observations for each combination of those two variables, then calculate the share of each n over the sum of n. I'm using geom_col, which is like geom_bar but when you have a y value in your aes.
library(tidyverse)
set.seed(12345)
FF <- rweibull(100,1.7,1) * 20 #mock speeds
FF[FF > 60] <- 59
dir <- sample.int(10, size = 100, replace = TRUE) # mock directions
shares <- tibble(dir = dir, FF = FF) %>%
mutate(grp = cut(FF, breaks = seq(0, 60, by = 15), ordered_result = T, right = F, drop = F)) %>%
count(dir, grp) %>%
mutate(share = n / sum(n))
shares
#> # A tibble: 29 x 4
#> dir grp n share
#> <int> <ord> <int> <dbl>
#> 1 1 [0,15) 3 0.03
#> 2 1 [15,30) 2 0.02
#> 3 2 [0,15) 4 0.04
#> 4 2 [15,30) 3 0.03
#> 5 2 [30,45) 1 0.01
#> 6 2 [45,60) 1 0.01
#> 7 3 [0,15) 6 0.06
#> 8 3 [15,30) 1 0.01
#> 9 3 [30,45) 2 0.02
#> 10 4 [0,15) 6 0.06
#> # ... with 19 more rows
ggplot(shares, aes(x = dir, y = share, fill = grp)) +
geom_col()

Plot a Matrix with ggplot

I want to display this matrix with ggplot in order to have lines :
Example : in X the portion from 1 to 12, and in Y ther is 5 lines (categories) with different colors, and their corresponding values.
Example first point x=1 and Y = 12.25 in red
Second point x=2 and Y=0.9423 in green
DF <- read.table(text = "
Portion 1 2 3 4 5
1 1 12.250000 0.9423077 33.92308 0.0000000 1.8846154
2 2 6.236364 1.7818182 38.30909 0.8909091 1.7818182
3 3 9.333333 1.8666667 28.00000 0.0000000 2.8000000
4 4 9.454545 2.8363636 34.03636 4.7272727 0.9454545
5 5 27.818182 0.0000000 19.47273 2.7818182 0.9272727
6 6 19.771930 2.5789474 19.77193 0.8596491 6.0175439
7 7 22.350877 1.7192982 22.35088 0.8596491 1.7192982
8 8 17.769231 4.0384615 15.34615 0.8076923 4.0384615
9 9 16.925373 8.8656716 23.37313 2.4179104 2.4179104
10 10 10.036364 8.3636364 25.09091 0.8363636 1.6727273
11 11 8.937500 8.9375000 8.12500 0.0000000 0.0000000
12 12 12.157895 5.2105263 14.76316 0.8684211 0.0000000", header = TRUE)
newResults <- as.data.frame(DF)
library(reshape2)
R = data.frame(Portion = c('1','2','3','4','5','6','7','8','9','10','11','12'), newResults[,1], newResults[,2], newResults[,3], newResults[,4], newResults[,5])
meltR = melt(R, id = "Portion")
ggplot(meltR, aes(reorder(Portion, -value), y = value, group = variable, colour = variable)) + geom_line().
Why is my X value are not ordered ? and is it the healthiest way to do this ?
Thanks a lot.
Try:
meltR = melt(DF, id = "Portion")
ggplot(meltR, aes(x=Portion, y = value, group = variable, colour = variable)) + geom_line()
In this case there is no need to reorder anything in the aesthetic for ggplot. This will give you the following graph:
You may want to change the names of the variables, either by renaming them in the first step, or by providing custom labels to ggplot.

how to put percentage label in ggplot when geom_text is not suitable?

Here is my simplified data :
company <-c(rep(c(rep("company1",4),rep("company2",4),rep("company3",4)),3))
product<-c(rep(c(rep(c("product1","product2","product3","product4"),3)),3))
week<-c( c(rep("w1",12),rep("w2",12),rep("w3",12)))
mydata<-data.frame(company=company,product=product,week=week)
mydata$rank<-c(rep(c(1,3,2,3,2,1,3,2,3,2,1,1),3))
mydata=mydata[mydata$company=="company1",]
And, R code I used :
ggplot(mydata,aes(x = week,fill = as.factor(rank))) +
geom_bar(position = "fill")+
scale_y_continuous(labels = percent_format())
In the bar plot, I want to label the percentage by week, by rank.
The problem is the fact that the data doesn't have percentage of rank. And the structure of this data is not suitable to having one.
(of course, the original data has much more observations than the example)
Is there anyone who can teach me How I can label the percentage in this graph ?
I'm not sure I understand why geom_text is not suitable. Here is an answer using it, but if you specify why is it not suitable, perhaps someone might come up with an answer you are looking for.
library(ggplot2)
library(plyr)
mydata = mydata[,c(3,4)] #drop unnecessary variables
data.m = melt(table(mydata)) #get counts and melt it
#calculate percentage:
m1 = ddply(data.m, .(week), summarize, ratio=value/sum(value))
#order data frame (needed to comply with percentage column):
m2 = data.m[order(data.m$week),]
#combine them:
mydf = data.frame(m2,ratio=m1$ratio)
Which gives us the following data structure. The ratio column contains the relative frequency of given rank within specified week (so one can see that rank == 3 is twice as abundant as the other two).
> mydf
week rank value ratio
1 w1 1 1 0.25
4 w1 2 1 0.25
7 w1 3 2 0.50
2 w2 1 1 0.25
5 w2 2 1 0.25
8 w2 3 2 0.50
3 w3 1 1 0.25
6 w3 2 1 0.25
9 w3 3 2 0.50
Next, we have to calculate the position of the percentage labels and plot it.
#get positions of percentage labels:
mydf = ddply(mydf, .(week), transform, position = cumsum(value) - 0.5*value)
#make plot
p =
ggplot(mydf,aes(x = week, y = value, fill = as.factor(rank))) +
geom_bar(stat = "identity")
#add percentage labels using positions defined previously
p + geom_text(aes(label = sprintf("%1.2f%%", 100*ratio), y = position))
Is this what you wanted?

Resources