Ggplot - How to present the mean of a third varience? - r

Let's say I have this data frame:
The data frame
I want to make a graph which presents for each SES (Social Economy Status) what is the mean income for females and what is the mean income for males.
I have so far this code:
ggplot(incomeSorted, aes(GENDER)) +
scale_y_continuous("Mean")+
geom_bar(position = "dodge")+
facet_wrap("SES")
and this is the output:
How do I make the graph to present the mean of income instead of counting the number of females and males at each category?
Thanks ahead!

If you want to display mean income, you have to compute it. You can use dplyr and group_by() with summarise() to obtain the key variable and then plot. Here a code for the task:
library(ggplot2)
library(dplyr)
#Data
df <- data.frame(id=1:8,Gender=c(rep('Female',4),rep(c('Male','Female'),2)),
income=c(73,150,220.18,234,314.16,983.1,1001,1012),
SES=c('Bottom','Bottom','Middle','Middle','Middle',
'Upper','Upper','Upper'),
stringsAsFactors = F)
#Compute and plot
df %>% group_by(SES,Gender) %>%
summarise(MeanIncome=mean(income,na.rm=T)) %>%
ggplot(aes(x=Gender,y=MeanIncome)) +
scale_y_continuous("Mean")+
geom_bar(stat = 'identity')+
facet_wrap(.~SES)
Output:
Or you can avoid facets and displaying the plot with a fill variable like this:
#Code 2
df %>% group_by(SES,Gender) %>%
summarise(MeanIncome=mean(income,na.rm=T)) %>%
ggplot(aes(x=Gender,y=MeanIncome,fill=SES)) +
scale_y_continuous("Mean")+
geom_bar(stat = 'identity',position = position_dodge2(0.9,preserve = 'single'))
Output:

Related

Plot based on descending value of a variable

I want to create a plot that shows the relationship between countries (categorical), their government type (4 categories, including NA), and the proportion of covid deaths to population. I want to show the 30 countries with the highest death proportion and if there is a relationship with the government type.
Right now the countries are plotted in alphabetical order, but I would like to plot the death proportion in descending order. I can't seem to figure out how to do this. Thanks!
library(tidyverse)
library(lubridate)
library(readr)
Governmental System, Country, Proportion of Deaths to Population
covid_data <- read_csv(here::here("data/covid_data.csv"))
covid_data <- covid_data %>%
mutate(death_proportion = total_deaths / population)
covid_data[with(covid_data, order(-death_proportion)), ] %>%
head(30) %>%
ggplot(aes(x = death_proportion,
y = country,
color = government)) +
geom_point()
I think you just need to use forcats::fct_reorder to set the order of you countries by the plotting variable.
Check this example:
library(tidyverse)
mtcars %>%
rownames_to_column(var = "car_name") %>%
mutate(car_name = fct_reorder(car_name, desc(mpg))) %>%
ggplot(aes(x = mpg,
y = car_name,
color = factor(cyl))) +
geom_point()
Created on 2021-03-16 by the reprex package (v1.0.0)

Using geom_smooth for fitting a glm to fractions

This post is somewhat related to this post.
Here I have xy grouped data where y are fractions:
library(dplyr)
library(ggplot2)
library(ggpmisc)
set.seed(1)
df1 <- data.frame(value = c(0.8,0.5,0.4,0.2,0.5,0.6,0.5,0.48,0.52),
age = rep(c("d2","d4","d45"),3),
group = c("A","A","A","B","B","B","C","C","C")) %>%
dplyr::mutate(time = as.integer(age)) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df1$group_age <- factor(df1$group_age,levels=unique(df1$group_age))
What I'm trying to achieve is to plot df1 as a bar plot, like this:
ggplot(df1,aes(x=group_age,y=value,fill=age)) +
geom_bar(stat='identity')
But I want to fit to each group a binomial glm with a logit link function, which estimates how these fractions are affected by time.
Let's say I have 100 observations per each age (time) in each group:
df2 <- do.call(rbind,lapply(1:nrow(df1),function(i){
data.frame(age=df1$age[i],group=df1$group[i],time=df1$time[i],group_age=df1$group_age[i],value=c(rep(T,100*df1$value[i]),rep(F,100*(1-df1$value[i]))))
}))
Then the glm for each group (e.g., group A) is:
glm(value ~ time, dplyr::filter(df2, group == "A"), family = binomial(link='logit'))
So I would like to add to the plot above the estimated regression slopes for each group along with their corresponding p-values (similar to what I'm doing for the continuous df$value in this post).
I thought that using:
ggplot(df1,aes(x=group_age,y=value,fill=age)) +
geom_bar(stat='identity') +
geom_smooth(data=df2,mapping=aes(x=group_age,y=value,group=group),color="black",method='glm',method.args=list(family=binomial(link='logit')),size=1,se=T) +
stat_poly_eq(aes(label=stat(p.value.label)),formula=my_formula,parse=T,npcx="center",npcy="bottom") +
scale_x_log10(name="Age",labels=levels(df$age),breaks=1:length(levels(df$age))) +
facet_wrap(~group) + theme_minimal()
Would work but I get the error:
Error in Math.factor(x, base) : ‘log’ not meaningful for factors
Any idea how to get it right?
I believe this could help:
library(tidyverse)
library(broom)
df2$value <- as.numeric(df2$value)
#Estimate coefs
dfmodel <- df2 %>% group_by(group) %>%
do(fitmodel = glm(value ~ time, data = .,family = binomial(link='logit')))
#Extract coeffs
dfCoef = tidy(dfmodel, fitmodel)
#Create labels
dfCoef %>% filter(term=='(Intercept)') %>% mutate(Label=paste0(round(estimate,3),'(p=',round(p.value,3),')'),
group_age=paste0(group,'_','d4')) %>%
select(c(group,Label,group_age)) -> Labels
#Values
df2 %>% group_by(group,group_age) %>% summarise(value=sum(value)) %>% ungroup() %>%
group_by(group) %>% filter(value==max(value)) %>% select(-group_age) -> values
#Combine
Labels %>% left_join(values) -> Labels
Labels %>% mutate(age=NA) -> Labels
#Plot
ggplot(df2,aes(x=group_age,y=value,fill=age)) +
geom_text(data=Labels,aes(x=group_age,y=value,label=Label),fontface='bold')+
geom_bar(stat='identity')+
facet_wrap(.~group,scales='free')
Thanks to Pedro Aphalo this is nearly a complete solution:
Generate the data.frame with the fractions (here use time as an integer by deleting "d" in age rather than using time as the levels of age):
library(dplyr)
library(ggplot2)
library(ggpmisc)
set.seed(1)
df1 <- data.frame(value = c(0.8,0.5,0.4,0.2,0.5,0.6,0.5,0.48,0.52),
age = rep(c("d2","d4","d45"),3),
group = c("A","A","A","B","B","B","C","C","C")) %>%
dplyr::mutate(time = as.integer(gsub("d","",age))) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df1$group_age <- factor(df1$group_age,levels=unique(df1$group_age))
Inflate df1 to 100 observations per each age in each group but specify value as an integer rather than a binary:
df2 <- do.call(rbind,lapply(1:nrow(df1),function(i){
data.frame(age=df1$age[i],group=df1$group[i],time=df1$time[i],group_age=df1$group_age[i],value=c(rep(1,100*df1$value[i]),rep(0,100*(1-df1$value[i]))))
}))
And now plot it using geom_smooth and stat_fit_tidy:
ggplot(df1,aes(x=time,y=value,group=group,fill=age)) +
geom_bar(stat='identity') +
geom_smooth(data=df2,mapping=aes(x=time,y=value,group=group),color="black",method='glm',method.args=list(family=binomial(link='logit'))) +
stat_fit_tidy(data=df2,mapping=aes(x=time,y=value,group=group,label=sprintf("P = %.3g",stat(x_p.value))),method='glm',method.args=list(formula=y~x,family=binomial(link='logit')),parse=T,label.x="center",label.y="top") +
scale_x_log10(name="Age",labels=levels(df2$age),breaks=unique(df2$time)) +
facet_wrap(~group) + theme_minimal()
Which gives (note that the scale_x_log10 is mainly a cosmetic approach to presenting the x-axis as time rather than levels of age):
The only imperfection is that the p-values seem to appear messed up.

ggplot2 bar plot by two groups and mean of y variable

I'm trying to create a bar plot for which I have two groups and the y variable is the mean of one of those groups.
Sample Bar Graph
So looking at the above bar graph in the photo, I have bars grouped by country and prosocial, and on the y-axis I have taken the fraction of prosocial individuals. I am only able, however, to create a bar plot that only takes the mean of prosocial and groups it by country. Basically, it's just one bar per county. Which is not exactly what I'm looking for. So far this is the code I've been using to group the data for the bar plot, which has been somewhat unsuccessful.
plotData <- myData2[!is.na(myData2$prosocial),]
plotData <- plotData %>%
mutate(mean_prosocial = mean(prosocial)) %>%
group_by(country) %>%
summarise(mean_prosocial = mean(prosocial),se = sd(prosocial) / sqrt(n()))
This only groups by country and if I want to group by prosocial as well, I obviously just get NAs for the mean variable. Below is a link to the working data:
workable data.
Thanks.
Say you want to find the fraction of prosocial/non-prosocial across countries:
require(dplyr)
require(ggplot2)
First find how many observations in each country. Later it will be used in fraction calculation.
count_country <- myData2 %>%
filter(!is.na(prosocial)) %>%
group_by(country) %>%
summarise(n = length(country)) %>%
ungroup
Next find the number of prosocial/non-prosocial count across countries.
count_prosocial <- myData2 %>%
filter(!is.na(prosocial)) %>%
group_by(country, prosocial) %>%
summarise(n = length(prosocial)) %>%
mutate(prosocial = as.factor(prosocial))
Merge two dataframes by country name and find the fractions:
df <- count_prosocial %>%
left_join(count_country, by = "country") %>%
mutate(frac = round(n.x / n.y, 2))
Display fractions across different countries using facet_wrap:
ggplot(data=df, aes(x=prosocial, y=frac, fill=prosocial)) +
geom_bar(stat = "identity")+
geom_text(aes(x=prosocial, y=frac, label = frac),
position = position_dodge(width = 1),
vjust = 2, size = 3, color = "white", fontface = "bold")+
facet_wrap(~country)+
labs(y = "Fraction of prosocial/non-prosocial") +
scale_fill_discrete(labels=c("Prosocial", "Individualist"))+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

ggplot2() bar chart and dplyr() grouped and overall data in R

I'd like to make a stacked proportional bar chart representing the prevalence of diabetes in a cohort of individuals residing in towns A, B, and C. I'd also like the plot to feature a bar representing the entire cohort.
I'm happy with the below plot, but I'd like to know if there is a way of incorporating the pre-processing step into the processing step, ie piping it with dplyr()?
Thanks!
Starting point (df):
dfa <- data.frame(town=c("A","A","A","B","B","C","C","C","C","C"),diabetes=c("y","y","n","n","y","n","y","n","n","y"),heartdisease=c("n","y","y","n","y","y","n","n","n","y"))
Pre-processing:
dfb <- rbind(dfa, transform(dfa, town = "ALL"))
Processing and plot:
library(dplyr)
library(ggplot)
dfc <- dfb %>%
group_by(town) %>%
count(diabetes) %>%
mutate(prop = n / sum(n))
ggplot(dfc, aes(x = town, y = prop, fill = diabetes)) +
geom_bar(stat = "identity") +
coord_flip()
Like this:
dfc <- dfa %>%
bind_rows(dfa %>%
mutate(town = "ALL")) %>%
group_by(town) %>%
count(diabetes) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = town, y = prop, fill = diabetes)) +
geom_bar(stat = "identity") +
coord_flip()
EDIT: added pre-processing into pipeline using bind_rows and mutate instead of rbind and transform

Plotting grouped probabilities in R

I'm new to R and I'm trying to graph probability of flight delays by hour of day. Probability of flight delays would be calculated using a "Delays" column of 1's and 0's.
Here's what I have. I was trying to put a custom function into fun.y, but it doesn't seem like it's allowed.
library(ggplot2)
ggplot(data = flights, aes(flights$HourOfDay, flights$ArrDelay)) +
stat_summary(fun.y = (sum(flights$Delay)/no_na_flights), geom = "bar") +
scale_x_discrete(limits=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25)) +
ylim(0,500)
What's the best way to do this?
Thanks in advance.
I am not sure if that is what you wanted, but I did it in the following way:
library(ggplot2)
library(dplyr)
library(nycflights13)
probs <- flights %>%
# Testing whether a delay occurred for departure or arrival
mutate(Delay = dep_delay > 0 | arr_delay > 0) %>%
# Grouping the data by hour
group_by(hour) %>%
# Calculating the proportion of delays for each hour
summarize(Prob_Delay = sum(Delay, na.rm = TRUE) / n()) %>%
ungroup()
theme_set(theme_bw())
ggplot(probs) +
aes(x = hour,
y = Prob_Delay) +
geom_bar(stat = "identity") +
scale_x_continuous(breaks = 0:24)
Which gives the following plot:
I think it is always better to do data manipulation outside ggplot, using for instance dplyr.

Resources