R ggplot: I am having problem drawing R histogram - r

I am working with the built-in esoph dataset. My task is to formulate a histogram of "ncontrols" variable for each age group in the dataset
Here are the codes I write down. First, I do the group_by on agegp (age groups,) calculate the total ncontrols (number of control cases) for each age group, and rename both agegp and ncontrols to something easily readable
library(tidyverse)
library(datasets)
library(ggplot2)
data_esoph <- esoph %>% group_by(agegp) %>%
summarise(Total_number_of_control_case = sum(ncontrols)) %>%
rename(Age_group = agegp)
Then I try to draw a histogram using geom_histogram
plot_histogram <- ggplot(data_esoph, aes(x = Age_group)) +
geom_histogram(color = 'black', fill = 'grey70') +
labs(title ="Number of control cases by age group",x = "Age group", y = "Cases")+
theme(axis.title= element_text(size = 12), plot.title = element_text(size = 16))
I run into an error that says
Error: StatBin requires a continuous x variable: the x variable is discrete.Perhaps you want stat="count"?
I know this error is because agegp (Age_group) is discrete variable. I try to convert it to numeric but to no avail. Anyone have any idea what can I do to fix this problem and draw a histogram ?

You can set stat="identity" to the geom_bar like this:
library(tidyverse)
df %>%
ggplot(aes(x = Age_group, y = Total_number_of_control_case)) +
geom_bar(stat = "identity") +
labs(title ="Number of control cases by age group",x = "Age group", y = "Cases") +
theme(axis.title= element_text(size = 12), plot.title = element_text(size = 16))
Output:
Data
df <- data.frame(Age_group = c("25-34", "35-44", "45-54", "55-64", "65-74", "75+"),
Total_number_of_control_case = c(115,190,167,166,106,31))

Related

Adding cumulative quantities to a geom_bar plots drawn with facet_wrap

newbie here! After a long search I still could not find a satisfying solution to my problem. I have a dataset of heart failure rates (https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records) and I would like to display a series of geom plot where the "Sruvived" and "Dead" are counted per category (i.e. sex, smoking and so on).
I think i have done a decent job at preparing the plots, and they look right to me. The problem is, it is difficult to see the how the ratio between surviving and dying patient with different characteristics is.
I have two but both of them elude me:
Put a count on top of every bar so that the ratio becomes obvious
Directly show the ratio on every characteristic.
Here is the code I wrote.
library(ggplot)
heart_faliure_data <- read.csv(file = "heart_failure_clinical_records_dataset.csv", header = FALSE, skip=1)
#Prepare Column Names
c_names <- c("Age",
"Anaemia",
"creatinine_phosphokinase",
"diabetes",
"ejection_fraction",
"high_blood_pressure",
"platelets",
"serum_creatinine",
"serum_sodium",
"sex",
"smoking",
"time",
"DEATH_EVENT")
#Apply column names to the dataframe
colnames(heart_faliure_data) <- c_names
# Some Classes like sex, Anaemia, diabetes, high_blood_pressure smoking and DEATH_EVENT are booleans
# (see description of Dataset) and should be transformed into factors
heart_faliure_data$sex <- factor(heart_faliure_data$sex,
levels=c(0,1),
labels=c("Female","Male"))
heart_faliure_data$smoking <- factor(heart_faliure_data$smoking,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$DEATH_EVENT <- factor(heart_faliure_data$DEATH_EVENT,
levels=c(0,1),
labels=c("Survived","Died"))
heart_faliure_data$high_blood_pressure <- factor(heart_faliure_data$high_blood_pressure,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$Anaemia <- factor(heart_faliure_data$Anaemia,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$diabetes <- factor(heart_faliure_data$diabetes,
levels=c(0,1),
labels=c("No","Yes"))
# Adjust Age to a int value
heart_faliure_data$Age <- as.integer(heart_faliure_data$Age)
# selecting the categorical variables and study the effect of each variable on death-event
categorical.heart_failure <- heart_faliure_data %>%
select(Anaemia,
diabetes,
high_blood_pressure,
sex,
smoking,
DEATH_EVENT) %>%
gather(key = "key", value = "value", -DEATH_EVENT)
#Visualizing this effect with a grouped barplot
categorical.heart_failure %>%
ggplot(aes(value)) +
geom_bar(aes(x = value,
fill = DEATH_EVENT),
alpha = .2,
position = "dodge",
color = "black",
width = .7,
stat = "count") +
labs(x = "",
y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key,
scales = "free",
nrow = 4) +
scale_fill_manual(values = c("#FFA500", "#0000FF"),
name = "Death Event",
labels = c("Survived", "Dead"))
And here is a (not so bad) image of the result:
The goal would be to have some numerical value on top of the bars. Or even just a a y indication...
I would be glad about any help you can give me!
What about something like this. To make it work, I aggregated the data first:
tmp <- categorical.heart_failure %>%
group_by(DEATH_EVENT, key, value) %>%
summarise(n = n())
#Visualizing this effect with a grouped barplot
tmp %>%
ggplot(aes(x = value, y=n)) +
geom_bar(aes(fill = DEATH_EVENT),
alpha = .2,
position = position_dodge(width=1),
color = "black",
width = .7,
stat = "identity") +
geom_text(aes(x=value, y=n*1.1, label = n, group=DEATH_EVENT), position = position_dodge(width=1), vjust=0) +
labs(x = "",
y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key,
scales = "free",
nrow = 4) +
scale_fill_manual(values = c("#FFA500", "#0000FF"),
name = "Death Event",
labels = c("Survived", "Dead")) +
coord_cartesian(ylim=c(0, max(tmp$n)*1.25))

Normal curves on multiple histograms on a same plot

My example dataframe:
sample1 <- seq(100,157, length.out = 50)
sample2 <- seq(113, 167, length.out = 50)
sample3 <- seq(95,160, length.out = 50)
sample4 <-seq(88, 110, length.out = 50)
df <- as.data.frame(cbind(sample1, sample2, sample3, sample4))
I have managed to create histograms for these four variables, which share the same y-axis. Now I need an overlay normal curve. Based on previous posts, I've managed a density curve, but this is not what I want. This comes close, but I'd like a smooth line...
This is my current code for plotting:
df <- as.data.table(df)
new.df<-melt(df,id.vars="sample")
names(new.df)=c("sample","type","value")
cdat <- ddply(new.df, "type", summarise, value.mean=mean(value))
ggplot(data = new.df,aes(x=value)) +
geom_histogram(aes(x = value), bins = 15, colour = "black", fill = "gray") +
facet_wrap(~ type) + geom_density(aes(x = value),alpha=.2, fill="#FF6666") +
geom_vline(data=cdat, aes(xintercept=value.mean),
linetype="dashed", size=1, colour="black") +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),aspect.ratio = 0.75 )
And I found the following code, which I hoped would do the trick, but this gives me nothing:
stat_function(fun = dnorm, args = list(mean = mean(df$value), sd = sd(df$value)))
Unfortunately, stat_function doesn't play nicely with facets: it overlays the same function on each facet without taking account of the faceting variable.
One of the most common reasons I see for people posting ggplot questions on Stack Overflow is that they get lost while trying to coerce ggplot to do too much of their data manipulation. Functions like geom_smooth and geom_function are useful helpers for common tasks, but if you want to do something that is complex or uncommon, it is best to produce the data you want to plot, then plot it.
In fact, the main author of ggplot2 recommends this approach for a very similar problem to yours in this thread, saying:
I think you are better off generating the data outside of ggplot2 and then plotting it. See https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse to get started.
Hadley Wickham, 26 April 2018
So here's one way of doing that using tidyverse. You create a data frame of the dnorm for each sample and plot these using plain old geom_line.
Note that your histograms are counts, so you either need to change them to density, or multiply the dnorm output by the number of observations * the binwidth, otherwise you will just get an apparently "flat" line on the x axis, since the dnorm values will all be so small in relation to the counts:
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
dfn <- df %>%
pivot_longer(everything()) %>%
ddply("name", function(x) {
xvar <- seq(min(x$value), max(x$value), length.out = 100)
data.frame(value = xvar,
y = 5 * nrow(x) * dnorm(xvar, mean(x$value), sd(x$value)))
})
df %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mutate(mean = mean(value), sd = sd(value)) %>%
ggplot(aes(value)) +
geom_histogram(aes(x = value), binwidth = 5,
colour = "black", fill = "gray") +
facet_wrap(~ name) +
geom_vline(aes(xintercept = mean),
linetype = "dashed", size=1, colour="black") +
geom_line(data = dfn, aes(y = y)) +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),
aspect.ratio = 0.75 )
Created on 2020-12-07 by the reprex package (v0.3.0)

How to annotate inside the plot when using datetime on the X axis with ggplot2?

I have successfully created a line a graph in R using ggplot2 with percentage on Y axis and Date/Time on the X axis, but I am unsure how to annotate inside the graph for specific date/time points when their is a high/low peak.
The examples I identified (on R-bloggers & RPubs) are annotated without using date/time, and I have made attempts to annotate it (with ggtext and annotate functions, etc), but got nowhere. Please can you show me an example of how to do this using ggplot2 in R?
The current R code below creates the line graph, but can you help me extend the code to annotate inside of the graph?
sentimentdata <- read.csv("sentimentData-problem.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
sentimentTime <- sentimentdata %>%
filter(between(Hour, 11, 23))
sentimentTime$Datetime <- ymd_hm(sentimentTime$Datetime)
library(zoo)
sentimentTime %>%
filter(Cat %in% c("Negative", "Neutral", "Positive")) %>%
ggplot(aes(x = Datetime, y = Percent, group = Cat, colour = Cat)) +
geom_line() +
scale_x_datetime(breaks = date_breaks("1 hours"), labels = date_format("%H:00")) +
labs(title="Peak time on day of event", colour = "Sentiment Category") +
xlab("By Hour") +
ylab("Percentage of messages")
Data source available via GitHub:
Since you have multiple lines and you want two labels on each line according to the maxima and minima, you could create two small dataframes to pass to geom_text calls.
First we ensure the necessary packages and the data are loaded:
library(lubridate)
library(ggplot2)
library(scales)
library(dplyr)
url <- paste0("https://raw.githubusercontent.com/jcool12/",
"datasets/master/sentimentData-problem.csv")
sentimentdata <- read.csv(url, stringsAsFactors = FALSE)
sentimentdata$Datetime <- dmy_hm(sentimentdata$Datetime)
sentimentTime <- filter(sentimentdata, between(Hour, 11, 23))
Now we can create a max_table and min_table that hold the x and y co-ordinates and the labels for our maxima and minima:
max_table <- sentimentTime %>%
group_by(Cat) %>%
summarise(Datetime = Datetime[which.max(Percent)],
Percent = max(Percent) + 3,
label = paste(trunc(Percent, 3), "%"))
min_table <- sentimentTime %>%
group_by(Cat) %>%
summarise(Datetime = Datetime[which.min(Percent)],
Percent = min(Percent) - 3,
label = paste(trunc(Percent, 3), "%"))
Which allows us to create our plot without much trouble:
sentimentTime %>%
filter(Cat %in% c("Negative", "Neutral", "Positive")) %>%
ggplot(aes(x = Datetime, y = Percent, group = Cat, colour = Cat)) +
geom_line() +
geom_text(data = min_table, aes(label = label)) + # minimum labels
geom_text(data = max_table, aes(label = label)) + # maximum labels
scale_x_datetime(breaks = date_breaks("1 hours"),
labels = date_format("%H:00")) +
labs(title="Peak time on day of event", colour = "Sentiment Category") +
xlab("By Hour") +
ylab("Percentage of messages")

Plot two variables in bar plot side by side using ggplot2

I think I need to use melt function but I'm not sure how to do so? Sample data, code, and resulting graph below. Basically, the "cnt" column is made up of the "registered" plus the "casual" for each row. I want to display the total "registered" vs the total 'casual" per month, instead of overall total "cnt"
example data
#Bar Chart
bar <- ggplot(data=subset(bikesharedailydata, !is.na(mnth)), aes(x=mnth, y=cnt)) +
geom_bar(stat="identity", position="dodge") +
coord_flip() +
labs(title="My Bar Chart", subtitle = "Total Renters per Month", caption = "Caption", x = "Month", y = "Total Renters") +
mychartattributes
To "melt" your data use reshape2::melt:
library(ggplot2)
library(reshape2)
# Subset your data
d <- subset(bikesharedailydata, !is.na(mnth))
# Select columns that you will plot
d <- d[, c("mnth", "registered", "casual")]
# Melt according to month
d <- melt(d, "mnth")
# Set fill by variable (registered or casual)
ggplot(d, aes(mnth, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(title="My Bar Chart", subtitle = "Total Renters per Month",
caption = "Caption",
x = "Month", y = "Total Renters")
Use tidyr and dplyr:
set.seed(1000)
library(dplyr)
library(tidyr)
library(ggplot2)
bikesharedailydata <- data.frame(month = month.abb, registered = rpois(n = 12, lambda = 2), casual =rpois(12, lambda = 1))
bikesharedailydata %>% gather(key="type", value = "count", -month) %>%
ggplot(aes(x=month, y=count, fill = type))+geom_bar(stat = "identity", position = "dodge")

Set the Axis values (in an animation)

How do I stop the Y-axis changing during an animation?
The graph I made is at http://i.imgur.com/EKx6Tw8.gif
The idea is to make an animated heatmap of population and income each year. The problem is the y axis jumps to include 0 or not include the highest value sometime. How do you solidly set the axis values? I know this must be a common issue but I can't find the answer
The code to recreate it is
library(gapminder)
library(ggplot2)
library(devtools)
install_github("dgrtwo/gganimate")
library(gganimate)
library(dplyr)
mydata <- dplyr::select(gapminder, country,continent,year,lifeExp,pop,gdpPercap)
#bin years into 5 year bins
mydata$lifeExp2 <- as.integer(round((mydata$lifeExp-2)/5)*5)
mydata$income <- cut(mydata$gdpPercap, breaks=c(0,250,500,750,1000,1500,2000,2500,3000,3500,4500,5500,6500,7500,9000,11000,21000,31000,41000, 191000),
labels=c(0,250,500,750,1000,1500,2000,2500,3000,3500,4500,5500,6500,7500,9000,11000,21000,31000,41000))
sizePer <- mydata%>%
group_by(lifeExp2, income, year)%>%
mutate(popLikeThis = sum(pop))%>%
group_by(year)%>%
mutate(totalPop = sum(as.numeric(pop)))%>%
mutate(per = (popLikeThis/totalPop)*100)
sizePer$percent <- cut(sizePer$per, breaks=c(0,.1,.3,1,2,3,5,10,20,Inf),
labels=c(0,.1,.3,1,2.0,3,5,10,20))
saveGIF({
for(i in c(1997,2002,2007)){
print(ggplot(sizePer %>% filter(year == i),
aes(x = lifeExp2, y = income)) +
geom_tile(aes(fill = percent)) +
theme_bw()+
theme(legend.position="top", plot.title = element_text(size=30, face="bold",hjust = 0.5))+
coord_cartesian(xlim = c(20,85), ylim = c(0,21)) +
scale_fill_manual("%",values = c("#ffffcc","#ffeda0","#fed976","#feb24c","#fd8d3c","#fc4e2a","#e31a1c","#bd0026","#800026"),drop=FALSE)+
annotate(x=80, y=3, geom="text", label=i, size = 6) +
annotate(x=80, y=1, geom="text", label="#iamreddave", size = 5) +
ylab("Income") + # Remove x-axis label
xlab("Life Expenctancy")+
ggtitle("Worldwide Life Expectancy and Income")
)}
}, interval=0.7,ani.width = 900, ani.height = 600)
Solution:
Adding scale_y_discrete(drop = F) to the ggplot call. Answered by #bdemarest in comments.

Resources