Complex barplots with multiple grouping levels - r

I am trying to build a complex bar plot that has categories to distinguish. Here is the data frame
Treatment DCA.f Megalorchestia Talitridae Traskorchestia
1 A (-Inf,0] 8.000000 4843.6667 1394.0000
2 U (-Inf,0] 21.000000 2905.3333 483.6667
3 A (0,0.1] 25.000000 254.8571 41.0000
4 U (0,0.1] 30.714286 691.0000 360.1429
5 A (0.1,0.2] 35.400000 1355.2000 127.4000
6 U (0.1,0.2] 104.400000 705.4000 50.2000
7 A (0.2,0.3] 3.857143 649.7143 633.4286
8 U (0.2,0.3] 10.857143 510.4286 268.7143
9 A (0.3,0.4] 13.444444 981.5556 207.5556
10 U (0.3,0.4] 10.666667 1567.5556 417.5556
11 A (0.4, Inf] 0.000000 3.0000 1.2000
12 U (0.4, Inf] 0.000000 3.8000 0.0000
I want a barplot that for each DCA.f group shows 6 values for the three organisms categories (the right three columns), separated by treatment (A v U). So if you read the bottom of the plot there would be a big category for DCA.f and then with in that category there would be six bars. Two for each genera color coded by treatment. And then repeated for all DAC.f. I have looked through many of the other barplot posts and they have not gotten me anywhere.
Any help?

Here is one possibility. When using barplot, each column of the input matrix will correspond to a group of bars, and each row to different bars within groups. Thus, we need to reshape the data so that columns represent the levels of 'DCA.f'
library(reshape2)
library(RColorBrewer)
# reshape data
df2 <- melt(df)
df3 <- dcast(df2, Treatment + variable ~ DCA.f)
# create color palette
ncols <- length(unique(df3$variable))
cols <- c(brewer.pal(ncols, "Greens"), brewer.pal(ncols, "Reds"))
# plot
barplot(as.matrix(df3[ , -c(1, 2)]),
beside = TRUE,
col = cols,
cex.names = 0.7)
# add legend
legend(x = 15, y = 4000, legend = paste(df3$Treatment, df3$variable), fill = cols)

Related

I can't fill an histogram in r using ggplot

I have a dataframe called "employee_attrition". There are two variables of my interest, the first one is called "MonthlyIncome" (with continuous data of salary) and the second one is "PerformanceRating" which takes discrete values (1,2,3 or 4). My intention is to create a histogram for the MonthlyIncome, and show the PerformanceRating in the same plot. I have this:
ggplot(data = employee_attrition, aes(x=MonthlyIncome, fill=PerformanceRating))+
geom_histogram(aes(y=..count..))+
xlab("Salario mensual (MonthlyIncome)")+
ylab("Frecuencia")+
ggtitle("Histograma: MonthlyIncome y Attrition")+
theme_minimal()
The problem is that the plot does not show the "PerformanceRating" associated with each bar of the histogram.
My data frame is something like this:
MonthlyIncome PerformanceRating
1 5993 1
2 5130 1
3 2090 4
4 2909 3
5 3468 4
6 3068 3
And i want a histogram that shows the frequency of MonthlyIncome and each bar with 4 colours of the PerformanceRating.
Something like this, but with 4 colours (PerformanceRating Values)
To make the fill commands works, you should first making factor the grouping variables.
library(tibble)
library(tidyverse)
##---------------------------------------------------
##Creating a sample dataset simulating your dataset
##---------------------------------------------------
employee_attrition <- tibble(
MonthlyIncome = sample(3000:5993, 1000, replace = FALSE),
PerformanceRating = sample(1:4, 1000, replace = TRUE)
)
##------------------------------------
## Plot - also changing the format of
## PerformanceRating to "factor"
##-----------------------------------
employee_attrition %>%
mutate(PerformanceRating = as.factor(PerformanceRating)) %>%
ggplot(aes(x=MonthlyIncome, fill=PerformanceRating))+
geom_histogram(aes(y=..count..), bins = 20) +
xlab("Salario mensual (MonthlyIncome)")+
ylab("Frecuencia")+
ggtitle("Histograma: MonthlyIncome y Attrition")+
theme_minimal()

ggplot2 geom_bar position failure

I am using the ..count.. transformation in geom_bar and get the warning
position_stack requires non-overlapping x intervals when some of my categories have few counts.
This is best explained using some mock data (my data involves direction and windspeed and I retain names relating to that)
#make data
set.seed(12345)
FF=rweibull(100,1.7,1)*20 #mock speeds
FF[FF>60]=59
dir=sample.int(10,size=100,replace=TRUE) # mock directions
#group into speed classes
FFcut=cut(FF,breaks=seq(0,60,by=20),ordered_result=TRUE,right=FALSE,drop=FALSE)
# stuff into data frame & plot
df=data.frame(dir=dir,grp=FFcut)
ggplot(data=df,aes(x=dir,y=(..count..)/sum(..count..),fill=grp)) + geom_bar()
This works fine, and the resulting plot shows the frequency of directions grouped according to speed. It is of relevance that the velocity class with the fewest counts (here "[40,60)") will have 5 counts.
However more velocity classes leads to a warning. For instance, with
FFcut=cut(FF,breaks=seq(0,60,by=15),ordered_result=TRUE,right=FALSE,drop=FALSE)
the velocity class with the fewest counts (now "[45,60)") will have only 3 counts and ggplot2 will warn that
position_stack requires non-overlapping x intervals
and the plot will show data in this category spread out along the x axis.
It seems that 5 is the minimum size for a group to have for this to work correctly.
I would appreciate knowing if this is a feature or a bug in stat_bin (which geom_bar is using) or if I am simply abusing geom_bar.
Also, any suggestions how to get around this would be appreciated.
Sincerely
This occurs because df$dir is numeric, so the ggplot object assumes a continuous x-axis, and aesthetic parameter group is based on the only known discrete variable (fill = grp).
As a result, when there simply aren't that many dir values in grp = [45,60), ggplot gets confused over how wide each bar should be. This becomes more visually obvious if we split the plot into different facets:
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar() +
facet_wrap(~ grp)
> for(l in levels(df$grp)) print(sort(unique(df$dir[df$grp == l])))
[1] 1 2 3 4 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 3 4 5 7 9 10
[1] 2 4 7
We can also check manually that the minimum difference between sorted df$dir values is 1 for the first three grp values, but 2 for the last one. The default bar width is thus wider.
The following solutions should all achieve the same result:
1. Explicitly specify the same bar width for all groups in geom_bar():
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar(width = 0.9)
2. Convert dir to a categorical variable before passing it to aes(x = ...):
ggplot(data=df,
aes(x=factor(dir), y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar()
3. Specify that the group parameter should be based on both df$dir & df$grp:
ggplot(data=df,
aes(x=dir,
y=(..count..)/sum(..count..),
group = interaction(dir, grp),
fill = grp)) +
geom_bar()
This doesn't directly solve the issue, because I also don't get what's going on with the overlapping values, but it's a dplyr-powered workaround, and might turn out to be more flexible anyway.
Instead of relying on geom_bar to take the cut factor and give you shares via ..count../sum(..count..), you can easily enough just calculate those shares yourself up front, and then plot your bars. I personally like having this type of control over my data and exactly what I'm plotting.
First, I put dir and FF into a data frame/tbl_df, and cut FF. Then count lets me group the data by dir and grp and count up the number of observations for each combination of those two variables, then calculate the share of each n over the sum of n. I'm using geom_col, which is like geom_bar but when you have a y value in your aes.
library(tidyverse)
set.seed(12345)
FF <- rweibull(100,1.7,1) * 20 #mock speeds
FF[FF > 60] <- 59
dir <- sample.int(10, size = 100, replace = TRUE) # mock directions
shares <- tibble(dir = dir, FF = FF) %>%
mutate(grp = cut(FF, breaks = seq(0, 60, by = 15), ordered_result = T, right = F, drop = F)) %>%
count(dir, grp) %>%
mutate(share = n / sum(n))
shares
#> # A tibble: 29 x 4
#> dir grp n share
#> <int> <ord> <int> <dbl>
#> 1 1 [0,15) 3 0.03
#> 2 1 [15,30) 2 0.02
#> 3 2 [0,15) 4 0.04
#> 4 2 [15,30) 3 0.03
#> 5 2 [30,45) 1 0.01
#> 6 2 [45,60) 1 0.01
#> 7 3 [0,15) 6 0.06
#> 8 3 [15,30) 1 0.01
#> 9 3 [30,45) 2 0.02
#> 10 4 [0,15) 6 0.06
#> # ... with 19 more rows
ggplot(shares, aes(x = dir, y = share, fill = grp)) +
geom_col()

Overlaying unique column values as geom_point in ggplot2

Here is an excerpt of the dataset I am working on.
Name Value ID Total
A 10 1 3
A 11 2 3
A 10 3 3
B 10 1 4
B 11 2 4
B 11 3 4
B 11 4 4
What I want to do is plot Name on the x-axis ID on the y-axis for all Values of 11; on top of which I want to overlay Total so that when the graph is interpreted, it is possible to see the count of items per a Name group. This might be achieved using length of a group in the Name variable or using Total. Here is what I did and a sample of the output desired.
mydf <- read.csv("./test1.csv", header = T)
x <- ggplot(mydf, aes(Name, ID))+ geom_point(data = subset(mydf, Value==11), size=3, colour="tomato3")+ scale_y_continuous(name="Class ID", limits=c(1,4),breaks=seq(1,4, by=1))
y <- x+ xlab("Class")+theme_bw()
z <- y+scale_x_discrete(limits = c("A","B", "C"))
The three orange asterisks at (A,3) and (B,4) are manual text annotation that I want to replace with either a short line or a circle to indicate the total number of items.
Thank you for your help.

Plot a Matrix with ggplot

I want to display this matrix with ggplot in order to have lines :
Example : in X the portion from 1 to 12, and in Y ther is 5 lines (categories) with different colors, and their corresponding values.
Example first point x=1 and Y = 12.25 in red
Second point x=2 and Y=0.9423 in green
DF <- read.table(text = "
Portion 1 2 3 4 5
1 1 12.250000 0.9423077 33.92308 0.0000000 1.8846154
2 2 6.236364 1.7818182 38.30909 0.8909091 1.7818182
3 3 9.333333 1.8666667 28.00000 0.0000000 2.8000000
4 4 9.454545 2.8363636 34.03636 4.7272727 0.9454545
5 5 27.818182 0.0000000 19.47273 2.7818182 0.9272727
6 6 19.771930 2.5789474 19.77193 0.8596491 6.0175439
7 7 22.350877 1.7192982 22.35088 0.8596491 1.7192982
8 8 17.769231 4.0384615 15.34615 0.8076923 4.0384615
9 9 16.925373 8.8656716 23.37313 2.4179104 2.4179104
10 10 10.036364 8.3636364 25.09091 0.8363636 1.6727273
11 11 8.937500 8.9375000 8.12500 0.0000000 0.0000000
12 12 12.157895 5.2105263 14.76316 0.8684211 0.0000000", header = TRUE)
newResults <- as.data.frame(DF)
library(reshape2)
R = data.frame(Portion = c('1','2','3','4','5','6','7','8','9','10','11','12'), newResults[,1], newResults[,2], newResults[,3], newResults[,4], newResults[,5])
meltR = melt(R, id = "Portion")
ggplot(meltR, aes(reorder(Portion, -value), y = value, group = variable, colour = variable)) + geom_line().
Why is my X value are not ordered ? and is it the healthiest way to do this ?
Thanks a lot.
Try:
meltR = melt(DF, id = "Portion")
ggplot(meltR, aes(x=Portion, y = value, group = variable, colour = variable)) + geom_line()
In this case there is no need to reorder anything in the aesthetic for ggplot. This will give you the following graph:
You may want to change the names of the variables, either by renaming them in the first step, or by providing custom labels to ggplot.

ggplot2: arranging multiple boxplots as a time series

I would like to create a multivariate boxplot time series with ggplot2 and I need to have an x axis that positions the boxplots based on their associated dates.
I found two posts about this question: one is Time series plot with groups using ggplot2 but the x axis is not a scale_x_axis so graph is biased in my case. The other one is ggplot2 : multiple factors boxplot with scale_x_date axis in R but the person uses an interaction function which i don't use in my case.
Here is an example file and my code:
dtm <- read.table(text="date ruche mortes trmt
03.10.2013 1 8 P+
04.10.2013 1 7 P+
07.10.2013 1 34 P+
03.10.2013 7 16 P+
04.10.2013 7 68 P+
07.10.2013 7 170 P+
03.10.2013 2 7 P-
04.10.2013 2 7 P-
07.10.2013 2 21 P-
03.10.2013 5 8 P-
04.10.2013 5 27 P-
07.10.2013 5 24 P-
03.10.2013 3 15 T
04.10.2013 3 6 T
07.10.2013 3 13 T
03.10.2013 4 6 T
04.10.2013 4 18 T
07.10.2013 4 19 T ", h=T)
require(ggplot2)
require(visreg)
require(MASS)
require(reshape2)
library(scales)
dtm$asDate = as.Date(dtm[,1], "%d.%m.%Y")
## Plot 1: Nearly what I want but is biased by the x-axis format where date should not be a factor##
p2<-ggplot(data = dtm, aes(x = factor(asDate), y = mortes))
p2 + geom_boxplot(aes(fill = factor(dtm$trmt)))
## Plot 2: Doesn't show me what I need, ggplot apparently needs a factor as x##
p<-ggplot(data = dtm, aes(x = asDate, y = mortes))
p + geom_boxplot(aes( group = asDate, fill=trmt) ) `
Can anyone help me with this issue, please?
Is this what you want?
Code:
p <- ggplot(data = dtm, aes(x = asDate, y = mortes, group=interaction(date, trmt)))
p + geom_boxplot(aes(fill = factor(dtm$trmt)))
The key is to group by interaction(date, trmt) so that you get all of the boxes, and not cast asDate to a factor, so that ggplot treats it as a date. If you want to add anything more to the x axis, be sure to do it with + scale_x_date().

Resources