ggplot2() bar chart and dplyr() grouped and overall data in R - r

I'd like to make a stacked proportional bar chart representing the prevalence of diabetes in a cohort of individuals residing in towns A, B, and C. I'd also like the plot to feature a bar representing the entire cohort.
I'm happy with the below plot, but I'd like to know if there is a way of incorporating the pre-processing step into the processing step, ie piping it with dplyr()?
Thanks!
Starting point (df):
dfa <- data.frame(town=c("A","A","A","B","B","C","C","C","C","C"),diabetes=c("y","y","n","n","y","n","y","n","n","y"),heartdisease=c("n","y","y","n","y","y","n","n","n","y"))
Pre-processing:
dfb <- rbind(dfa, transform(dfa, town = "ALL"))
Processing and plot:
library(dplyr)
library(ggplot)
dfc <- dfb %>%
group_by(town) %>%
count(diabetes) %>%
mutate(prop = n / sum(n))
ggplot(dfc, aes(x = town, y = prop, fill = diabetes)) +
geom_bar(stat = "identity") +
coord_flip()

Like this:
dfc <- dfa %>%
bind_rows(dfa %>%
mutate(town = "ALL")) %>%
group_by(town) %>%
count(diabetes) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = town, y = prop, fill = diabetes)) +
geom_bar(stat = "identity") +
coord_flip()
EDIT: added pre-processing into pipeline using bind_rows and mutate instead of rbind and transform

Related

bicolor heatmap with factor levels

I have this dataframe:
set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)
And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:
ggplot(df, aes(id, year, fill= year)) +
geom_tile()
The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).
EDIT:
Two things I forgot to add (hope it's not too late):
How to add alpha transparency to geom_tile() without messing it?
I need to sort the ids from maximum missings to minimum missings.
The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:
df <- df %>%
mutate(flag = TRUE) %>%
complete(id, year, fill = list(flag = FALSE))
ggplot(df, aes(id, year, fill = flag)) +
geom_tile()
EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.
EDIT2: To sort by missingness add the following code prior to the ggplot code:
# Determine the order of the IDs
df_order <- df %>%
group_by(id) %>%
summarize(sum = sum(flag)) %>%
arrange(desc(sum)) %>%
mutate(order = row_number()) %>%
select(id, order)
# Set the IDs in order on the chart
df <- df %>%
left_join(df_order) %>%
mutate(id = fct_reorder(id, order))
I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.
library(tidyverse)
df %>%
mutate_all(~as.integer(as.character(.))) %>%
mutate(data_exist = 1) %>%
complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
mutate(data_exist = factor(data_exist)) %>%
ggplot() + aes(id, year, fill= data_exist) + geom_tile()
With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df
all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>%
left_join(df) %>%
mutate(present=ifelse(is.na(present),'0','1'))
ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) +
geom_tile() +
scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
theme(legend.position="None") # hide legend

bar chart of row freq ggplot2

I have the following data:
dataf <- read.table(text = "index,group,taxa1,taxa2,taxa3,total
s1,g1,2,5,3,10
s2,g1,3,4,3,10
s3,g2,1,2,7,10
s4,g2,0,4,6,10", header = T, sep = ",")
I'm trying to make a stacked bar plot of the frequences of the data so that it counts across the row (not down a column) for each index (s1,s2,s3,s4) and then for each group (g1,g2) of each taxa. I'm only able to figure out how to graph the species of one taxa but not all three stacked on each other.
Here are some examples of what I'm trying to make:
These were made on google sheets so they don't look like ggplot but it would be easier to make in r with ggplot2 because the real data set is larger.
You would need to reshape the data.
Here is my solution (broken down by plot)
For first plot
library(tidyverse)
##For first plot
prepare_data_1 <- dataf %>% select(index, taxa1:taxa3) %>%
gather(taxa,value, -index) %>%
mutate(index = str_trim(index)) %>%
group_by(index) %>% mutate(prop = value/sum(value))
##Plot 1
prepare_data_1 %>%
ggplot(aes(x = index, y = prop, fill = fct_rev(taxa))) + geom_col()
For second plot
##For second plot
prepare_data_2 <- dataf %>% select(group, taxa1:taxa3) %>%
gather(taxa,value, -group) %>%
mutate(group = str_trim(group)) %>%
group_by(group) %>% mutate(prop = value/sum(value))
##Plot 2
prepare_data_2 %>%
ggplot(aes(x = group, y = prop, fill = fct_rev(taxa))) + geom_col()
##You need to reshape data before doing that.
dfm = melt(dataf, id.vars=c("index","group"),
measure.vars=c("taxa1","taxa2","taxa3"),
variable.name="variable", value.name="values")
ggplot(dfm, aes(x = index, y = values, group = variable)) +
geom_col(aes(fill=variable)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.25)) +
geom_text(aes(label = values), position = position_stack(vjust = .5), size = 3) + theme_gray()

ggalluvial: How do I plot an alluvial diagram when I have a dataframe with links and nodes?

I have this dataframe with timepoints (a, b and c), labels (l1, l2, l3) and frequencies that are distributed over the timepoints and labels.
I want to create a sankey diagram with the ggalluvial package in R.
Here's some code:
library(tidyverse)
library(forcats)
library(ggalluvial)
library(magrittr)
plotAlluvial <- function(.df,name=freq) {
y_name <- enquo(name)
ggplot(.df,
aes(
x = tp,
stratum = lbl,
alluvium = id,
label=lbl,
fill = lbl,
y=!!y_name
)
) +
geom_stratum() +
geom_flow(stat = "flow", color = "darkgray") +
geom_text(stat = "stratum") +
scale_fill_brewer(type = "qual", palette = "Set2")
}
x1=c(6,0,0,5,5,4,2,0,3)
x2=c(5,5,3,0,0,5,0,7,0)
df=data_frame(tp1=rep(c('a','b'),each=9),
lbl1=c(rep(c('l1','l2','l3'),2,each=3)),
tp2=rep(c('b','c'),each=9),
lbl2=c(rep(c('l1','l2','l3'),6)),
freq=c(x1,x2)
)
df2=df %>%
mutate(id=row_number()) %>%
unite(un1,c(tp1,lbl1)) %>%
unite(un2,c(tp2,lbl2)) %>%
tidyr::gather(key,value,-c(freq,id)) %>%
separate('value',c('tp','lbl'))
df2.left= df2 %>%
dplyr::filter(!(key=='un1' & tp=='b'))
df2.right= df2 %>%
dplyr::filter(!(key=='un2' & tp=='b'))
I can plot the left side and plot the right side of the diagram I want:
plotAlluvial(df2.left)
plotAlluvial(df2.right)
But if I try to plot the left and right side at the same time I get this plot:
plotAlluvial(df2)
When I use the code above, the plot of the diagram has too many frequencies at timepoint b. The stratum should be as high as the other two stratums so have a height of 25.
What am I doing wrong? How can I create a diagram that combines the first two plots?
EDIT:
After a comment I added a proportion of the frequencies variable. Now the stratum b is of the correct height but the incoming and outgoing flows still only occupy 50% of each condition in timepoint b.
df2 %<>% group_by(tp) %>% mutate(prop = freq / sum(freq)) %>%
ungroup()
plotAlluvial(df2,prop)

dplyr() and ggolot2()::geom_tile, filtering a group of summary statistics

I've got a data frame (df) with three categorical variables called site, purchase, and happycustomer.
I'd like to use gglot2's geom_tile function to create a heat-map of customer experience. I'd like site on the x-axis, purchase on the y-axis, and happycustomer as the fill. I'd like the heat map to feature the percentages for the happy customers grouped by site and purchase (ie the ones for which the value of happycustomer is y).
My problem's that at the moment the plot features both the happy and the unhappy customers.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("GA","NY","BO","NY","BO","NY","BO","NY","BO","GA","NY","GA","NY","NY","NY"),purchase=c("a1","a2","a1","a1","a3","a1","a1","a3","a1","a2","a1","a2","a1","a2","a1"),happycustomer=c("n","y","n","y","y","y","n","y","n","y","y","y","n","y","n"))
Current code:
library(ggplot2)
library(dplyr)
df %>%
group_by(site, purchase,happycustomer) %>%
summarize(bin = sum(happycustomer==happycustomer)) %>%
group_by(site,happycustomer) %>%
mutate(bin_per = (bin/sum(bin)*100)) %>%
ggplot(aes(site,purchase)) + geom_tile(aes(fill = bin_per),colour = "white") + geom_text(aes(label = round(bin_per, 1))) +
scale_fill_gradient(low = "blue", high = "red")
Here is the solution with two data frames.
happyDF <- df %>%
filter(happycustomer == "y") %>%
group_by(site, purchase) %>%
summarise( n = n() )
totalDF <- df %>%
group_by(site, purchase) %>%
summarise( n = n() )
And the ggplot code:
merge(happyDF, totalDF, by=c("site", "purchase") ) %>%
mutate(prop = 100 * (n.x / n.y) ) %>%
ggplot(., aes(site, purchase)) +
geom_tile(aes(fill = prop),colour = "white") +
geom_text(aes(label = round(prop, 1))) +
scale_fill_gradient(low = "blue", high = "red")

ggplot fill does not work - no errors [MRE]

the ggplot analysis below is intended show number of survey responses by date. I'd like to color the bars by the three survey administrations (the Admini variable).While there are no errors thrown, the bars do not color.
Can anyone point out how/why my bars are not color-coded? THANKS!
library(ggplot2)
library(dplyr)
library(RCurl)
OSTadminDates2<-getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/OSTadminDates.csv")
OSTadminDates<-read.csv(text=OSTadminDates2)
ndate1<-as.Date(OSTadminDates$Date,"%m/%d/%y");ndate1
SurvAdmin<-as.factor(OSTadminDates$Admini)
R<-ggplot(data=OSTadminDates,aes(x=ndate1),fill=Admini,group=1) +
geom_bar(stat = "count",width = .5 )
R
Here's a work-around you could use:
library(ggplot2)
library(dplyr)
library(RCurl)
OSTadminDates2<-getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/OSTadminDates.csv")
OSTadminDates<-read.csv(text=OSTadminDates2)
OSTadminDates$Date<-as.Date(OSTadminDates$Date,"%m/%d/%y")
OSTadminDates$Admini <- factor(OSTadminDates$Admini)
df <- OSTadminDates %>%
group_by(Date, Admini) %>%
summarise(n = n())
ggplot(data = df) +
geom_bar(aes(x = Date, y = n, fill = Admini), stat = "identity")

Resources