Overlay a frequency polygon over a bar plot with non count stat - r

I would like to overlay a frequency polygon over a bar plot where stat = 'identity' and not count. This answer works with count data but not when you are using summarised data. Take this example below:
Data
data <- tibble(my_factors = c(1,1,1,2,2,2,3,3,3,4,4,4),
total = c(10,20,30,40,50,60,70,80,90,100,110,120))
Group by factors and plot total
data %>%
group_by(my_factors) %>%
summarise(total = sum(total)) %>%
ungroup() %>%
ggplot(aes(my_factors, total)) +
geom_bar(stat = 'identity')
Desired output
In this case it's a fairly linear line but would a 'smooth' line also be possible?

Related

bicolor heatmap with factor levels

I have this dataframe:
set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)
And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:
ggplot(df, aes(id, year, fill= year)) +
geom_tile()
The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).
EDIT:
Two things I forgot to add (hope it's not too late):
How to add alpha transparency to geom_tile() without messing it?
I need to sort the ids from maximum missings to minimum missings.
The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:
df <- df %>%
mutate(flag = TRUE) %>%
complete(id, year, fill = list(flag = FALSE))
ggplot(df, aes(id, year, fill = flag)) +
geom_tile()
EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.
EDIT2: To sort by missingness add the following code prior to the ggplot code:
# Determine the order of the IDs
df_order <- df %>%
group_by(id) %>%
summarize(sum = sum(flag)) %>%
arrange(desc(sum)) %>%
mutate(order = row_number()) %>%
select(id, order)
# Set the IDs in order on the chart
df <- df %>%
left_join(df_order) %>%
mutate(id = fct_reorder(id, order))
I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.
library(tidyverse)
df %>%
mutate_all(~as.integer(as.character(.))) %>%
mutate(data_exist = 1) %>%
complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
mutate(data_exist = factor(data_exist)) %>%
ggplot() + aes(id, year, fill= data_exist) + geom_tile()
With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df
all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>%
left_join(df) %>%
mutate(present=ifelse(is.na(present),'0','1'))
ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) +
geom_tile() +
scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
theme(legend.position="None") # hide legend

I wish to change the variable on the y-axis of my bar graph

I am trying to plot proportions on the y-axis of my bar graph, as opposed to the usual count.
I am doing something like the following:
ggplot(data=mpg, aes(model))+geom_bar(aes(y=stat(count/sum(count)))
I am getting a blank plot.
are you talking about coord_flip()? This will turn your chart 90 degrees
EDITED, Added below:
Try this below
ggplot(data=mpg)+
geom_bar(mapping=aes(x=model, y=..prop.., group=1))
You need to create a data frame containing the proportions first, then use stat = "identity".
library(tidyverse)
mtcars %>%
as_tibble %>%
group_by(cyl) %>%
summarize(prop = n()/nrow(.)) %>%
ggplot() +
geom_bar(aes(cyl, prop), stat = "identity")

ggalluvial: How do I plot an alluvial diagram when I have a dataframe with links and nodes?

I have this dataframe with timepoints (a, b and c), labels (l1, l2, l3) and frequencies that are distributed over the timepoints and labels.
I want to create a sankey diagram with the ggalluvial package in R.
Here's some code:
library(tidyverse)
library(forcats)
library(ggalluvial)
library(magrittr)
plotAlluvial <- function(.df,name=freq) {
y_name <- enquo(name)
ggplot(.df,
aes(
x = tp,
stratum = lbl,
alluvium = id,
label=lbl,
fill = lbl,
y=!!y_name
)
) +
geom_stratum() +
geom_flow(stat = "flow", color = "darkgray") +
geom_text(stat = "stratum") +
scale_fill_brewer(type = "qual", palette = "Set2")
}
x1=c(6,0,0,5,5,4,2,0,3)
x2=c(5,5,3,0,0,5,0,7,0)
df=data_frame(tp1=rep(c('a','b'),each=9),
lbl1=c(rep(c('l1','l2','l3'),2,each=3)),
tp2=rep(c('b','c'),each=9),
lbl2=c(rep(c('l1','l2','l3'),6)),
freq=c(x1,x2)
)
df2=df %>%
mutate(id=row_number()) %>%
unite(un1,c(tp1,lbl1)) %>%
unite(un2,c(tp2,lbl2)) %>%
tidyr::gather(key,value,-c(freq,id)) %>%
separate('value',c('tp','lbl'))
df2.left= df2 %>%
dplyr::filter(!(key=='un1' & tp=='b'))
df2.right= df2 %>%
dplyr::filter(!(key=='un2' & tp=='b'))
I can plot the left side and plot the right side of the diagram I want:
plotAlluvial(df2.left)
plotAlluvial(df2.right)
But if I try to plot the left and right side at the same time I get this plot:
plotAlluvial(df2)
When I use the code above, the plot of the diagram has too many frequencies at timepoint b. The stratum should be as high as the other two stratums so have a height of 25.
What am I doing wrong? How can I create a diagram that combines the first two plots?
EDIT:
After a comment I added a proportion of the frequencies variable. Now the stratum b is of the correct height but the incoming and outgoing flows still only occupy 50% of each condition in timepoint b.
df2 %<>% group_by(tp) %>% mutate(prop = freq / sum(freq)) %>%
ungroup()
plotAlluvial(df2,prop)

ggplot2 bar plot by two groups and mean of y variable

I'm trying to create a bar plot for which I have two groups and the y variable is the mean of one of those groups.
Sample Bar Graph
So looking at the above bar graph in the photo, I have bars grouped by country and prosocial, and on the y-axis I have taken the fraction of prosocial individuals. I am only able, however, to create a bar plot that only takes the mean of prosocial and groups it by country. Basically, it's just one bar per county. Which is not exactly what I'm looking for. So far this is the code I've been using to group the data for the bar plot, which has been somewhat unsuccessful.
plotData <- myData2[!is.na(myData2$prosocial),]
plotData <- plotData %>%
mutate(mean_prosocial = mean(prosocial)) %>%
group_by(country) %>%
summarise(mean_prosocial = mean(prosocial),se = sd(prosocial) / sqrt(n()))
This only groups by country and if I want to group by prosocial as well, I obviously just get NAs for the mean variable. Below is a link to the working data:
workable data.
Thanks.
Say you want to find the fraction of prosocial/non-prosocial across countries:
require(dplyr)
require(ggplot2)
First find how many observations in each country. Later it will be used in fraction calculation.
count_country <- myData2 %>%
filter(!is.na(prosocial)) %>%
group_by(country) %>%
summarise(n = length(country)) %>%
ungroup
Next find the number of prosocial/non-prosocial count across countries.
count_prosocial <- myData2 %>%
filter(!is.na(prosocial)) %>%
group_by(country, prosocial) %>%
summarise(n = length(prosocial)) %>%
mutate(prosocial = as.factor(prosocial))
Merge two dataframes by country name and find the fractions:
df <- count_prosocial %>%
left_join(count_country, by = "country") %>%
mutate(frac = round(n.x / n.y, 2))
Display fractions across different countries using facet_wrap:
ggplot(data=df, aes(x=prosocial, y=frac, fill=prosocial)) +
geom_bar(stat = "identity")+
geom_text(aes(x=prosocial, y=frac, label = frac),
position = position_dodge(width = 1),
vjust = 2, size = 3, color = "white", fontface = "bold")+
facet_wrap(~country)+
labs(y = "Fraction of prosocial/non-prosocial") +
scale_fill_discrete(labels=c("Prosocial", "Individualist"))+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

ggplot2() bar chart and dplyr() grouped and overall data in R

I'd like to make a stacked proportional bar chart representing the prevalence of diabetes in a cohort of individuals residing in towns A, B, and C. I'd also like the plot to feature a bar representing the entire cohort.
I'm happy with the below plot, but I'd like to know if there is a way of incorporating the pre-processing step into the processing step, ie piping it with dplyr()?
Thanks!
Starting point (df):
dfa <- data.frame(town=c("A","A","A","B","B","C","C","C","C","C"),diabetes=c("y","y","n","n","y","n","y","n","n","y"),heartdisease=c("n","y","y","n","y","y","n","n","n","y"))
Pre-processing:
dfb <- rbind(dfa, transform(dfa, town = "ALL"))
Processing and plot:
library(dplyr)
library(ggplot)
dfc <- dfb %>%
group_by(town) %>%
count(diabetes) %>%
mutate(prop = n / sum(n))
ggplot(dfc, aes(x = town, y = prop, fill = diabetes)) +
geom_bar(stat = "identity") +
coord_flip()
Like this:
dfc <- dfa %>%
bind_rows(dfa %>%
mutate(town = "ALL")) %>%
group_by(town) %>%
count(diabetes) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = town, y = prop, fill = diabetes)) +
geom_bar(stat = "identity") +
coord_flip()
EDIT: added pre-processing into pipeline using bind_rows and mutate instead of rbind and transform

Resources