tile plot for continuous variable in r - r

I want to see the average departure delay in flights dataset from nycflights13 by distance and month with tile plot. I plotted it and I got this:
How can I see it better? I can't understand anything.

This is because the distance column is continuous. A tile plot needs the two axes to be categorical. So you first need to categorise the distance column; one way to do this is with cut_number from ggplot2.
library(ggplot2)
ggplot(nycflights13::flights,
aes(x = cut_number(distance, n = 5),
y = factor(month))) +
geom_tile(aes(fill = dep_delay))
(A tip: next time you ask a question, it is helpful for us to see the code you have written - otherwise it is more difficult to help you. I needed to check which package the flights dataset was from, and what its variables were called).

Maybe you want something like this. I divide the 'average_delay` in 5 categories so that you get more different colors. You can use this code:
library(nycflights13)
nycflights13::flights
flights %>%
group_by(month) %>%
mutate(average_delay = mean(dep_delay, na.rm=TRUE)) %>%
ggplot(aes(x = distance, y = month)) +
geom_tile(aes(fill = cut_number(average_delay, n = 5))) +
scale_colour_gradientn(colours = terrain.colors(10)) +
scale_fill_discrete(name = "Average delay")
Output:

Related

Combine scale_x_upset with scale_y_break

I made an upset plot using the ggupset package and added a break to the y axis with scale_y_break from the ggbreakpackage.
However, when I add scale_y_break, the combination matrix under the bar plot disappears.
Is there a way to combine the combination matrix of the plot made without scale_y_break with the bar plot portion of a plot made with scale_y_break? I can't seem to be able to access the grobs of these plots or use any other workaround. If anyone could help, I would greatly appreciate it!
Example with scale_x_upset and scale_y_break:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
I would like to combine the barplot portion of the plot created with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
with the combination matrix portion of the plot made with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)
Thanks!

how to make the y-axis value in order when using ggplot2 in R language

I was trying to use ggplot2 to plot multiple lines in a window. this is my code here:
ggplot(type1_age, aes(x = Year, y = Value, group = Age, color = Age)) + geom_line()+ggtitle('The percentage distribution of type1 diabetes patients in different age groups')+ylab("percentage (%)")
type1_age file looks like this:
the result figure is this:
the problem is that y-axis in the result figure is not in order. can you please help me to figure out? Thanks!
you should try this (with tidyverse):
type1_age %>%
mutate(Value = Value %>% as.character %>% as.numeric) %>%
ggplot(aes(x = Year, y = Value, group = Age, color = Age)) +
geom_line()+
ggtitle('The percentage distribution of type1 diabetes patients in different age groups')+
ylab("percentage (%)")
I suspect you Value variable to be a character vector and not a numeric one. The first part of the code should transform it in one.
Tell me if it works!

How to calculate and label peak value of distribution by multiple conditions/facets in R ggplot?

While the question appears similar to others, there's a key difference in my mind.
I want to be able to calculate and/or print (graphing it would be the ultimate goal, but calculating it in the data frame the primary goal) the peak value of a density curve of EACH SUB-CONDITION BY FACET The density graph looks like this:
So, ideally, I would be able to know the intensity (x-axis value) corresponding to the highest peak of the density curves for each condition.
Here's some dummy data:
set.seed(1234)
library(tidyverse)
library(fs)
n = 100000
silence = factor(c("sil1", "sil2", "sil3", "sil4", "sil5"))
treat = factor(c("con", "uos", "uos+wnt5a", "wnt5a"))
silence = rep(silence, n)
treat = rep(treat, n)
intensity = sample(4000:10000, n)
df <- cbind(silence, treat, intensity)
df$silence <- silence
df$treat <- treat
What I've tried:
Subsetting the primary DF and going through and calculating the density of each condition, but this could take days
Something close to this answer: Calculating peaks in histograms or density functions but not quite. I think the data look better as a histogram personally, but that constructs an arbitrary number of bins for intensity data (a continuous measure). The histogram looks like this:
Again, it would be sufficient to get the peak values for each of these groups (i.e., treatments by silencing subdistributions) just in the console, but adding them as a vertical line in the graphs would be a sweet cherry on top (it could also make it hella busy, so I will see about that piece later)
Thank you!!
Depending on the way you're producing the density plots, there may be a more direct way to recreate the density calculation before it goes into ggplot. That'll be the easiest way to get the peak values and keep them in the format of your data.
Without that, here's a hack that should work in general, but requires some kludging to fit the extracted points back into the form of your original data.
Here's a plot like yours:
mtcars %>%
mutate(gear = as.character(gear)) %>%
ggplot(aes(wt, fill = gear, group = gear)) +
geom_density(alpha = 0.2) +
facet_wrap(~am) ->my_plot
Here are the components that make up that plot:
ggplot_build(my_plot) -> my_plot_innards
With some ugly hacking we can extract the points that make up the curves and make them look kind of like our original data. Some info is destroyed, e.g. the gear values 3/4/5 become group 1/2/3. There might be a cool way to convert back, but I don't know it yet.
extracted_points <- tibble(
wt = my_plot_innards[["data"]][[1]][["x"]],
y = my_plot_innards[["data"]][[1]][["y"]],
gear = (my_plot_innards[["data"]][[1]][["group"]] + 2) %>% as.character, # HACK
am = (my_plot_innards[["data"]][[1]][["PANEL"]] %>% as.numeric) - 1 # HACK
)
ggplot(extracted_points, aes(wt, y, fill = gear)) +
geom_point(size = 0.3) +
facet_wrap(~am)
extracted_points_notes <- extracted_points %>%
group_by(gear, am) %>%
slice_max(y)
my_plot +
geom_point(data = extracted_points_notes,
aes(y = y), color = "red", size = 3, show.legend = FALSE) +
geom_text(data = extracted_points_notes, hjust = -0.5,
aes(y = y, label = scales::comma(y)), color = "red", size = 3, show.legend = FALSE)

Percentage of bin total on y-axis with facet_wrap and time series on x-axis

I am investigating a dataset with loan information from Prosper, specifically investor behavior.
The plot I would like to create would show investors on the y axis, and time on the x axis, binned to the average month. This would also be faceted by a Credit Grade. Ultimately, I would like each bin to show what percentage of total investors were allocated to each Credit Grade (the facet variable), per calculated month (or actual month, but calculated seems easier for binning).
I have tried ..density.., ..count../sum(..count..), geom_density, etc and seen plenty of posts that will sum each facet to 1 or the entire plot to 1. To re-iterate I am trying to sum each bin, among all the facets, to 1. I was also hoping to do this directly in ggplot, rather than alter the dataframe, but I'll take what I can get.
The following code shows two ways to display the investor counts (count per bin and percentage of entire plot per bin):
t1 <- ggplot(data = loans, aes(x=as.POSIXct(strptime(LoanOriginationDate, '%Y-%m-%d %H:%M:%S')))) +
geom_histogram(binwidth = 60*60*24*30.4375, aes(y = ..count../sum(..count..), group = Investors)) +
facet_wrap(~ProsperCreditGrade) +
scale_y_continuous()
t2 <- ggplot(loans,aes(x=as.POSIXct(strptime(LoanOriginationDate, '%Y-%m-%d %H:%M:%S')),fill=ProsperCreditGrade))+
geom_histogram(aes(y=2629800* ..count../sum(..count..)),
alpha=1,position='identity',binwidth=2629800) +
facet_wrap(~ProsperCreditGrade) +
stat_bin(aes(y = ..density..))
grid.arrange(t1,t2,ncol=1)
As you can see in the plot, total investors went up quite a bit toward the end of the time covered in the dataset. This does not show relative investment behavior over a given time, which is what I am trying to investigate.
What else can I try?
With help from Stephen of Udacity.com and dplyr, the final code is as follows:
loans$month <- month(as.POSIXct((round(as.numeric(as.POSIXct(loans$LoanOriginationDate))/2629800)*2629800), origin = "1969-12-31 19:00:00"))
loans$year <- year(as.POSIXct((round(as.numeric(as.POSIXct(loans$LoanOriginationDate))/2629800)*2629800), origin = "1969-12-31 19:00:00"))
loans$calculatedMonth <- ((loans$year-2005)*12)+loans$month
loanInvestors <- loans %>% group_by(calculatedMonth, ProsperCreditGrade) %>% summarise (n = n()) %>% mutate(proportion = n / sum(n))
ggplot(data = loanInvestors, aes(x = calculatedMonth, y = proportion, fill = proportion, width = 3)) +
geom_bar(stat = "identity") + facet_wrap(~ProsperCreditGrade) +
scale_y_sqrt() + geom_smooth(color = "red") +
scale_fill_gradient()
Investors per quarter by Credit Grade

Plotting grouped probabilities in R

I'm new to R and I'm trying to graph probability of flight delays by hour of day. Probability of flight delays would be calculated using a "Delays" column of 1's and 0's.
Here's what I have. I was trying to put a custom function into fun.y, but it doesn't seem like it's allowed.
library(ggplot2)
ggplot(data = flights, aes(flights$HourOfDay, flights$ArrDelay)) +
stat_summary(fun.y = (sum(flights$Delay)/no_na_flights), geom = "bar") +
scale_x_discrete(limits=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25)) +
ylim(0,500)
What's the best way to do this?
Thanks in advance.
I am not sure if that is what you wanted, but I did it in the following way:
library(ggplot2)
library(dplyr)
library(nycflights13)
probs <- flights %>%
# Testing whether a delay occurred for departure or arrival
mutate(Delay = dep_delay > 0 | arr_delay > 0) %>%
# Grouping the data by hour
group_by(hour) %>%
# Calculating the proportion of delays for each hour
summarize(Prob_Delay = sum(Delay, na.rm = TRUE) / n()) %>%
ungroup()
theme_set(theme_bw())
ggplot(probs) +
aes(x = hour,
y = Prob_Delay) +
geom_bar(stat = "identity") +
scale_x_continuous(breaks = 0:24)
Which gives the following plot:
I think it is always better to do data manipulation outside ggplot, using for instance dplyr.

Resources