ggplot differences between groups - r

I have a df with groups in different trials, and I want to make a bar graph of just deltas between trials in ggplot. Having a hard time getting ggplot to understand I want the differences in one df. Also, some of the treatments aren't represented in the second trial, so I want to just count that as 0 (i.e. delta would be = trial 1 - 0).
set.seed(1)
df <- data.frame((matrix(nrow=175,ncol=4)))
colnames(df) <- c("group","trial","count","hour")
df$group <- rep(c("A","B","C","D","A","B","D"),each=25)
df$trial <- rep(c(rep(1,times=100),rep(2,times=75)))
df$count <- runif(175,0,50)
df$hour <- rep(1:25,times=7)
df2 <- aggregate(df[,3:4],list(df$group,df$trial),mean)
colnames(df2)[1:2] <- c("group","trial")
That's where I've gotten to. I have plotted with individual bars for (group*trial), but I can't figure out how to subtract them. I want a plot of x=group and y= delta(trial).
I tried this:
ggplot(df2 %>% group_by(group) %>% delta=diff(count),
aes(x=group,y=delta)) + geom_bar()
from a similar posting I came across, but no luck.

this should do the trick:
ggplot(df2 %>% group_by(group) %>% summarise(delta=ifelse(nrow(.)>1,diff(count),0)),
aes(x=group,y=delta)) + geom_col()#geom_bar(stat="identity")
The problems are, that "diff" returns not the value 0 but a vector of length 0, when there is only one input value. Also instead of using geom_bar, I recommend geom_col. Another thing, you should think about, is that the diff result is depending on the order of your data frame. As such I would recommend to use
ggplot(df2 %>% group_by(group) %>% summarise(delta_trial_1_trial_2=
ifelse(length(trial)>1,
count[trial==2]-count[trial==1],0)),
aes(x=group,y=delta_trial_1_trial_2)) + geom_col()

Related

Fill geom_tile by mode of a factor variable or other ways to create a heat map in R

I am trying to create a heat map in R using three factors. I would like to be able to fill the colour using the modal category of one of the factors but I have not been able to find out how to do this.
When I try ggplot with geom_tile, it does produce the heatmap, however, I am not sure how it chooses the value of the fill variable. It certainly isn't the mode because I've checked this.
For instance, using the inbuilt dataset ChickWeight, I would like the fill to be based on the modal (most frequent) category of a variable "weight_group" I created.
data(ChickWeight)
glimpse(ChickWeight)
ChickWeight$Time <- ifelse(ChickWeight$Time >= 10,1,0)
ChickWeight <- ChickWeight %>% mutate(weight_group = ntile(weight, 3))
ChickWeight$Diet <- as.factor(ChickWeight$Diet)
ChickWeight$Time <- as.factor(ChickWeight$Time)
ChickWeight$weight_group <- as.factor(ChickWeight$weight_group)
table(ChickWeight$Diet, ChickWeight$Time, ChickWeight$weight_group)
ggplot(data = ChickWeight, aes(x=Time, y=Diet, fill=weight_group)) +
geom_tile()
Based on the three-way table, the bottom right block should be pink (corresponding to weight_group==1) rather than green as the modal category of weight_group when Diet==1 & Time==1 is weight_group==1 (11 counts).
Any help on this would be greatly appreciated.
Thank you!
You can define a function getMode that calculates the mode of a vector using plyr's count function to create a data frame of the counts for each class. Then sort the data frame and get the top value.
library(plyr)
getMode <- function(vec){
df <- plyr::count(vec) %>%
arrange(-freq)
return(df[1,"x"])
}
From here group by time and diet so you can find the mode for each combination of these groups and then use this as the fill for ggplot.
ChickWeight %>%
group_by(Time, Diet) %>%
summarize(modeWeightGroup = getMode(weight_group)) %>%
ggplot(aes(x=Time, y=Diet, fill= modeWeightGroup)) +
geom_tile()
I also don't think that the bottom right square should be weight_group 1 because it looks like the three way table is already sorted based on weight_group so that square is saying that of chicks in weight_group 1, their modal time, diet combination is (1,1).
Using dplyr to count the most frequent category of weight_group for each combination of Time and Diet :
ChickWeight %>%
group_by(Time, Diet) %>%
count(weight_group) %>%
filter(n == max(n)) %>%
ggplot(
aes(x = Time,
y = Diet,
fill = weight_group)
) +
geom_tile()
By the way, since you already know dplyr::mutate, you should know you can do all the pre-processing you are doing here inside a single mutate.
That means instead of :
ChickWeight$Time <- ifelse(ChickWeight$Time >= 10,1,0)
ChickWeight <- ChickWeight %>% mutate(weight_group = ntile(weight, 3))
ChickWeight$Diet <- as.factor(ChickWeight$Diet)
ChickWeight$Time <- as.factor(ChickWeight$Time)
ChickWeight$weight_group <- as.factor(ChickWeight$weight_group)
you can simply type :
ChickWeight <-
ChickWeight %>%
mutate(
Time = as.factor(ifelse(Time>=10, 1 ,0)),
Diet = as.factor(Diet),
weight_group = as.factor(ntile(weight, 3))
)

How to plot a large number of density plots with different categorical variables

I have a dataset in which I have one numeric variable and many categorical variables. I would like to make a grid of density plots, each showing the distribution of the numeric variable for different categorical variables, with the fill corresponding to subgroups of each categorical variable. For example:
library(tidyverse)
library(nycflights13)
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
plot_1 <- dat %>%
ggplot(aes(x = distance, fill = carrier)) +
geom_density()
plot_1
plot_2 <- dat %>%
ggplot(aes(x = distance, fill = origin)) +
geom_density()
plot_2
I would like to find a way to quickly make these two plots. Right now, the only way I know how to do this is to create each plot individually, and then use grid_arrange to put them together. However, my real dataset has something like 15 categorical variables, so this would be very time intensive!
Is there a quicker and easier way to do this? I believe that the hardest part about this is that each plot has its own legend, so I'm not sure how to get around that stumbling block.
This solutions gives all the plots in a list. Here we make a single function that accepts a variable that you want to plot, and then use lapply with a vector of all the variables you want to plot.
fill_variables <- vars(carrier, origin)
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!fill_variable)) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
If you have no idea of what those !! mean, I recommend watching this 5 minute video that introduces the key concepts of tidy evaluation. This is what you want to use when you want to create this sorts of wrapper functions to do stuff programmatically. I hope this helps!
Edit: If you want to feed an array of strings instead of a quosure, you can change !!fill_variable for !!sym(fill_variable) as follows:
fill_variables <- c('carrier', 'origin')
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!sym(fill_variable))) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
Alternative solution
As #djc wrote in the comments, I'm having trouble passing the column names into 'fill_variables'. Right now I am extracting column names using the following code...
You can separate the categorical and numerical variables like; cat_vars <- flights[, sapply(flights, is.character)] for categorical variables and cat_vars <- flights[, sapply(flights, !is.character)] for continuous variables and then pass these vectors into the wrapper function given by mgiormenti
Full code is given below;
library(tidyverse)
library(nycflights13)
cat_vars <- flights[, sapply(flights, is.character)]
cont_vars<- flights[, !sapply(flights, is.character)]
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
func_plot_cat <- function(cat_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cat_vars)) +
geom_density()
}
func_plot_cont <- function(cont_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cont_vars)) +
geom_point()
}
plotlist_cat_vars <- lapply(cat_vars, func_plot_cat)
plotlist_cont_vars<- lapply(cont_vars, func_plot_cont)
print(plotlist_cat_vars)
print(plotlist_cont_vars)

Show outliers in an efficient manner using ggplot

The actual data (and aim) I have is different but for reproducing purposes I used the Titanic dataset. My aim is create a plot of the age outliers (1 time SD) per class and sex.
Therefore the first thing I did is calculating the sd values and ranges:
library(dplyr)
library(ggplot2)
#Load titanic set
titanic <- read.csv("titanic_total.csv")
group <- group_by(titanic, Pclass, Sex)
#Create outlier ranges
summarise <- summarise(group, mean=mean(Age), sd=sd(Age))
summarise <- as.data.frame(summarise)
summarise$outlier_max <- summarise$mean + summarise$sd
summarise$outlier_min <- summarise$mean - summarise$sd
#Create a key
summarise$key <- paste0(summarise$Pclass, summarise$Sex)
#Create a key for the base set
titanic$key <- paste0(titanic$Pclass, titanic$Sex)
total_data <- left_join(titanic, summarise, by = "key")
total_data$outlier <- 0
Next, using a loop I determine whether the age is inside or outside the range
for (row in 1:nrow(total_data)){
if((total_data$Age[row]) > (total_data$outlier_max[row])){
total_data$outlier[row] <- 1
} else if ((total_data$Age[row]) < (total_data$outlier_min[row])){
total_data$outlier[row] <- 1
} else {
total_data$outlier[row] <- 0
}
}
Do some data cleaning ...
total_data$Pclass.x <- as.factor(total_data$Pclass.x)
total_data$outlier <- as.factor(total_data$outlier)
Now this code gives me the plot I am looking for.
ggplot(total_data, aes(x = Age, y = Pclass.x, colour = outlier)) + geom_point() +
facet_grid(. ~Sex.x)
However, this not really seems like the easiest way to crack this problem. Any thoughts on how I can include best practises to make this more efficients.
One way to reduce your code and make it less repetitive is to get it all into one procedure thanks to the pipe. Instead of creating a summary with the values, re-join this with the data, you could basically do this within one mutate step:
titanic %>%
mutate(Pclass = as.factor(Pclass)) %>%
group_by(Pclass, Sex) %>%
mutate(Age.mean = mean(Age),
Age.sd = sd(Age),
outlier.max = Age.mean + Age.sd,
outlier.min = Age.mean - Age.sd,
outlier = as.factor(ifelse(Age > outlier.max, 1,
ifelse(Age < outlier.min, 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, Pclass, colour = outlier)) +
facet_grid(.~Sex)
Pclass is mutated to a factor in advance, as it is a grouping factor. Then, the steps are done within the original dataframe, instead of creating two new ones. No changes are made to the original dataset however! If you would want this, just reassign the results to titanic or another data frame, and execute the ggplot-part as next step. Else you would assign the result of the figure to your data.
For the identification of outliers, one way is to work with the ifelse. Alternatively, dplyr offers the nice between function, however, for this, you would need to add rowwise, i.e. after creating the min and max thresholds for outliers:
...
rowwise() %>%
mutate(outlier = as.factor(as.numeric(between(Age, outlier.min, outlier.max)))) %>% ...
Plus:
Additionally, you could even reduce your code further, depends on which variables you want to keep in which way:
titanic %>%
group_by(Pclass, Sex) %>%
mutate(outlier = as.factor(ifelse(Age > (mean(Age) + sd(Age)), 1,
ifelse(Age < (mean(Age) - sd(Age)), 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, as.factor(Pclass), colour = outlier)) +
facet_grid(.~Sex)

boxplot: order groups by the mean of a subset of each group

Let's consider this data:
df = data.frame('score'=round(runif(15, 1, 10)),
'group'=paste0("a",rep(c(1,2,3),each=5)),
'category'=rep(c("big", "big", "big", "big", "small"), 3))
I would like to plot boxplots of this data with ggplot2. What i want is: boxplot(score~group), but with the boxplots arranged according to the mean of the "big" individuals of each group.
I can't figure it out in a simple way, without creating new variables. OK to use Dplyr. Thanks.
I don't know if this qualifies as a simple way, I personally find it simple, but I use dplyr to find the means:
#find the means for each group
library(dplyr)
means <-
df %>%
#filter out small since you only need category equal to 'big'
filter(category=='big') %>%
#use the same groups as in the ggplot
group_by(group) %>%
#calculate the means
summarise(mean = mean(score))
#order the groups according to the order of the means
myorder <- means$group[order(means$mean)]
In this case the order is:
> myorder
[1] a1 a2 a3
In order to arrange the order of the boxplots according to the above you just need to do:
library(ggplot2)
ggplot(df, aes(group, score)) +
geom_boxplot() +
#you just need to use scale_x_discrete with the limits argument
#to pass in details of the order of appearance for the boxplots
#in this case the order is the myorders vector
scale_x_discrete(limits=myorder)
And that's it.

Plotting Average/Median of each column in data frame grouped by factors

I am trying to make a grouped barplot and I am running into trouble. For example, if I was using the mtcars dataset and I wanted to group everything by the 'vs' column (col #8), find the average of all remaining columns, and then plot them by group.
Below is a very poor example of what I am trying to do and I know it is incorrect.
Ideally, mpg for vs=1 & vs=0 would be side by side, followed by cyl's means side by side, etc. I don't care if aggregate is skipped for dyplr or if ggplot is used or even if the aggregate step is not needed...just looking for a way to do this since it is driving me crazy.
df = mtcars
agg = aggregate(df[,-8], by=list(df$vs), FUN=mean)
agg
barplot(t(agg), beside=TRUE, col=df$vs))
Try
library(ggplot2)
library(dplyr)
library(tidyr)
df %>%
group_by(vs=factor(vs)) %>%
summarise_each(funs(mean)) %>%
gather(Var, Val, -vs) %>%
ggplot(., aes(x=Var, y=Val, fill=vs))+
geom_bar(stat='identity', position='dodge')
Or using base R
m1 <- as.matrix(agg[-1])
row.names(m1) <- agg[,1]
barplot(m1, beside=TRUE, col=c('red', 'blue'), legend=row.names(m1))

Resources