Plotting grouped probabilities in R - r

I'm new to R and I'm trying to graph probability of flight delays by hour of day. Probability of flight delays would be calculated using a "Delays" column of 1's and 0's.
Here's what I have. I was trying to put a custom function into fun.y, but it doesn't seem like it's allowed.
library(ggplot2)
ggplot(data = flights, aes(flights$HourOfDay, flights$ArrDelay)) +
stat_summary(fun.y = (sum(flights$Delay)/no_na_flights), geom = "bar") +
scale_x_discrete(limits=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25)) +
ylim(0,500)
What's the best way to do this?
Thanks in advance.

I am not sure if that is what you wanted, but I did it in the following way:
library(ggplot2)
library(dplyr)
library(nycflights13)
probs <- flights %>%
# Testing whether a delay occurred for departure or arrival
mutate(Delay = dep_delay > 0 | arr_delay > 0) %>%
# Grouping the data by hour
group_by(hour) %>%
# Calculating the proportion of delays for each hour
summarize(Prob_Delay = sum(Delay, na.rm = TRUE) / n()) %>%
ungroup()
theme_set(theme_bw())
ggplot(probs) +
aes(x = hour,
y = Prob_Delay) +
geom_bar(stat = "identity") +
scale_x_continuous(breaks = 0:24)
Which gives the following plot:
I think it is always better to do data manipulation outside ggplot, using for instance dplyr.

Related

How to normalize ggplot geom_bin2d() for each group on x axis?

Consider the following example:
library(ggplot)
set.seed(1e3)
n <- 1e3
dt <- data.frame(
age = 50*rbeta(n, 5, 1),
value = 1000*rbeta(n, 1, 3)
)
And let's assume that you are interested by the relative behavior of value within each band of age.
dt %>% ggplot(aes(x = age, y = value)) + geom_bin2d() would provide an "absolute" map of the data (even if using geom_bin2d(aes(fill = ..density..)) which divide the whole data by total counts). Is there a way to achieve initial goal i.e. to rescale counts for each "column" (each group of age created by geom_bin2d()) in order to unbias comparison due to sample size in each group?
Would like to stick with "maps" since they are quite relevant when there is a lot of underlying data, but other approach is welcome.
When you are trying to do something a bit different from what the standard ggplot summary functions are used for, you often find it is easier to just manipulate the data yourself. For example, you can easily bin the data yourself using findInterval, then normalize each age band using standard dplyr functions. Then you are free to plot however you like, using a plain geom_tile without trying to coax a more complex calculation out of ggplot.
library(ggplot2)
library(dplyr)
dt %>%
mutate(age = seq(10, 50, 2)[findInterval(dt$age, seq(10, 50, 2))]) %>%
mutate(value = seq(0, 1000, 45)[findInterval(dt$value, seq(0, 1000, 45))]) %>%
count(age, value) %>%
group_by(age) %>%
mutate(n = n/sum(n)) %>%
ggplot(aes(age, value, fill = n)) +
geom_tile() +
scale_fill_viridis_c(name = 'normalized counts\nby age band') +
theme_minimal(base_size = 16)

tile plot for continuous variable in r

I want to see the average departure delay in flights dataset from nycflights13 by distance and month with tile plot. I plotted it and I got this:
How can I see it better? I can't understand anything.
This is because the distance column is continuous. A tile plot needs the two axes to be categorical. So you first need to categorise the distance column; one way to do this is with cut_number from ggplot2.
library(ggplot2)
ggplot(nycflights13::flights,
aes(x = cut_number(distance, n = 5),
y = factor(month))) +
geom_tile(aes(fill = dep_delay))
(A tip: next time you ask a question, it is helpful for us to see the code you have written - otherwise it is more difficult to help you. I needed to check which package the flights dataset was from, and what its variables were called).
Maybe you want something like this. I divide the 'average_delay` in 5 categories so that you get more different colors. You can use this code:
library(nycflights13)
nycflights13::flights
flights %>%
group_by(month) %>%
mutate(average_delay = mean(dep_delay, na.rm=TRUE)) %>%
ggplot(aes(x = distance, y = month)) +
geom_tile(aes(fill = cut_number(average_delay, n = 5))) +
scale_colour_gradientn(colours = terrain.colors(10)) +
scale_fill_discrete(name = "Average delay")
Output:

Plotting a line graph by datetime with a histogram/bar graph by date

I'm relatively new to R and could really use some help with some pretty basic ggplot2 work.
I'm trying to visualize total number of submissions on a graph, showing the overall total in a line graph and the daily total in a histogram (or bar graph) on top of it. I'm not sure how to add breaks or bins to the histogram so that it takes the submission datetime column and makes each bar the daily total.
I tried adding a column that converts the datetime into just date and plots based on that, but I'd really like the line graph to include the time.
Here's what I have so far:
df <- df %>%
mutate(datetime = lubridate::mdy_hm(datetime))%>%
mutate(date = lubridate::as_date(datetime))
#sort by datetime
df <- df %>%
arrange(datetime)
#add total number of submissions
df <- df %>%
mutate(total = row_number())
#ggplot
line_plus_histo <- df%>%
ggplot() +
geom_histogram(data = df, aes(x=datetime)) +
geom_line(data = df, aes(x=datetime, y=total), col = "red") +
stat_bin(data = df, aes(x=date), geom = "bar") +
labs(
title="Submissions by Day",
x="Date",
y="Submissions",
legend=NULL)
line_plus_histo
As you can see, I'm also calculating the total number of submissions by sorting by time and then adding a column with the row number. So if you can help me use a better method I'd really appreciate it.
Please, find below the line plus histogram of time v. submissions:
Here's the pastebin link with my data
You can extend your data manipulation by:
df <- df |>
mutate(datetime = lubridate::mdy_hm(datetime)) |>
arrange(datetime) |>
mutate(midday = as_datetime(floor_date(as_date(datetime), unit = "day") + 0.5)) |>
mutate(totals = row_number()) |>
group_by(midday) |>
mutate(N = n())|>
ungroup()
then use midday for bars and datetime for line:
df%>%
ggplot() +
geom_bar(data = df, aes(x = midday)) +
geom_line(data = df, aes(x=datetime, y=totals), col = "red") +
labs(
title="Submissions by Day",
x="Date",
y="Submissions",
legend=NULL)
PS. Sorry for Polish locales on X axis.
PS2. With geom_bar it looks much better
Created on 2022-02-03 by the reprex package (v2.0.1)

R: using ggplot2 with a group_by data set

I can't quite figure this out. A CSV of 200+ rows assigned to data like so:
gid,bh,p1_id,p1_x,p1_y
90467,R,543333,80.184,98.824
90467,L,408045,74.086,90.923
90467,R,543333,57.629,103.797
90467,L,408045,58.589,95.937
Trying to group by p1_id and plot the mean values for p1_x and p1_y:
grp <- data %>% group_by(p1_id)
Trying to plot geom_point objects like so:
geom_point(aes(mean(grp$p1_x), mean(grp$p1_y), color=grp$p1_id))
But that isn't showing unique plot points per distinct p1_id values.
What's the missing step here?
Why not calculate the mean first?
library(dplyr)
grp <- data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y))
Then plot:
library(ggplot2)
ggplot(grp, aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))
Edit: As per #eipi10, you can also pipe directly into ggplot
data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y)) %>%
ggplot(aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))

ggplot fill does not work - no errors [MRE]

the ggplot analysis below is intended show number of survey responses by date. I'd like to color the bars by the three survey administrations (the Admini variable).While there are no errors thrown, the bars do not color.
Can anyone point out how/why my bars are not color-coded? THANKS!
library(ggplot2)
library(dplyr)
library(RCurl)
OSTadminDates2<-getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/OSTadminDates.csv")
OSTadminDates<-read.csv(text=OSTadminDates2)
ndate1<-as.Date(OSTadminDates$Date,"%m/%d/%y");ndate1
SurvAdmin<-as.factor(OSTadminDates$Admini)
R<-ggplot(data=OSTadminDates,aes(x=ndate1),fill=Admini,group=1) +
geom_bar(stat = "count",width = .5 )
R
Here's a work-around you could use:
library(ggplot2)
library(dplyr)
library(RCurl)
OSTadminDates2<-getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/OSTadminDates.csv")
OSTadminDates<-read.csv(text=OSTadminDates2)
OSTadminDates$Date<-as.Date(OSTadminDates$Date,"%m/%d/%y")
OSTadminDates$Admini <- factor(OSTadminDates$Admini)
df <- OSTadminDates %>%
group_by(Date, Admini) %>%
summarise(n = n())
ggplot(data = df) +
geom_bar(aes(x = Date, y = n, fill = Admini), stat = "identity")

Resources