I am trying to calculate a median as one number, and then use that number as a value in the aesthetic of ggplot.
I first try to obtain the median as a value:
mean_delay_median <- nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
median(mean_delay)
This yields the error message:
Error in median.default(., mean_delay) : need numeric data
How can I fix this?
Once I get this working, my second step would be to color a map based on values above and below this median "mean_delay_median", with something like this:
nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(nycflights13::airports, c('dest' = 'faa')) %>%
ggplot(aes(lon, lat, color=mean_delay>mean_delay_median)) +
borders("state") +
geom_point() +
coord_quickmap()
In general, I seek guidance on using prior statistics in subsequent code.
Thanks!
you just miss the summarise(median_all_delay = median(mean_delay, na.rm = TRUE))
try this:
mean_delay_median <- nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
summarise(median_all_delay = median(mean_delay, na.rm = TRUE)) %>%
unlist()
nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(nycflights13::airports, c('dest' = 'faa')) %>%
ggplot(aes(lon, lat, color=mean_delay>mean_delay_median)) +
borders("state") +
geom_point() +
coord_quickmap()
output is:
You should note that one destination(dest) has every arr_delay observation missing.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(dest) %>%
filter(all(is.na(arr_delay))) %>%
select(dest, arr_delay)
#> # A tibble: 1 x 2
#> # Groups: dest [1]
#> dest arr_delay
#> <chr> <dbl>
#> 1 LGA NA
This leads to NaN, not zero.
mean(c(NA), na.rm = TRUE)
#> [1] NaN
In other words, you should add na.rm = TRUE again in median function.
flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
mutate(arrival = ifelse(mean_delay > median(mean_delay, na.rm = TRUE), "late", "okay")) %>% # na.rm option to median
inner_join(airports, by = c("dest" = "faa")) %>%
ggplot() +
aes(lon, lat, colour = arrival) +
borders("state") +
geom_point() +
coord_quickmap()
Since mean value for LGA does not have any value, its label might become NA.
Related
I am creating a grouped bar chart like so:
library(tidyverse)
library(echarts4r)
data("starwars")
starwars %>%
group_by(sex, eye_color) %>%
summarise(height = mean(height, na.rm=TRUE)) %>%
group_by(sex) %>%
e_charts(x = eye_color, timeline = TRUE) %>%
e_bar(height, legend = FALSE)
How do I set the range of the y axis (height) to be the same across groups (sex)?
You could set maximum value for the y axis using e_y_axis(max = XXX), e.g. in the code below I set the max value based on the maximum of height.
library(tidyverse)
library(echarts4r)
data("starwars")
ymax <- max()
dat <- starwars %>%
group_by(sex, eye_color) %>%
summarise(height = mean(height, na.rm=TRUE), .groups = "drop")
ymax <- 50 * ceiling(max(dat$height, na.rm = TRUE) / 50)
dat %>%
group_by(sex) %>%
e_charts(x = eye_color, timeline = TRUE) %>%
e_bar(height, legend = FALSE) %>%
e_y_axis(max = ymax)
I am using 'flights' data set from 'nycflights13' package. I was required to convert to code containing 'stat_summary_2d' from the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
group_by(carrier, origin) %>%
summarise(cancel = 100*mean(cancel, na.rm = T)) %>%
ggplot() +
geom_tile(aes(origin, carrier, fill = cancel)) +
geom_text(aes(origin, carrier, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
In the data set the missing value of the 'dep_time' variable means a cancelled flight, and the variable 'cancel' is created by calculating the proportion of cancelled flights over the scheduled flights.
Below is how I apply 'stat_summary_2d' to convert the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
ggplot() +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flights %>% group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = T)) %>% ungroup,
aes(factor(carrier), origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
When I execute my code, the error is
> Error in summarize()`: ! Problem while computing `cancel = 100 *
> mean(cancel, na.rm = T)`. i The error occurred in group 1: carrier =
> "9E", origin = "EWR".
Could anyone tell me how to fix this problem? Thank you so much!
The original data is not updated when with the new column unless we use %<>% instead of %>%. But, it may be easier to create two objects
library(dplyr)
library(ggplot2)
flight1 <- flights %>%
mutate(cancel = 1*(dep_time %>% is.na))
flight2 <- flights1 %>%
group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = TRUE), .groups = 'drop')
ggplot(flight1) +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flight2, aes(factor(carrier),
origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
I am trying to reproduce a map I found here: http://zevross.com/blog/2018/10/02/creating-beautiful-demographic-maps-in-r-with-the-tidycensus-and-tmap-packages/
I am using RStudio and am running the following code:
library(ggplot2) # For plotting
library(tidycensus) # For downloading Census data
library(tmap) # For creating tmap
library(tmaptools) # For reading and processing spatial data related to tmap
library(dplyr) # For data wrangling
library(sf) # For reading, writing and working with spatial objects
census_api_key("enter your API key here", overwrite = TRUE)
dat12 <- get_acs("county", table = "B27001", year = 2012,
output = "tidy", state = NULL, geometry = FALSE) %>%
rename(`2012` = estimate) %>%
select(-NAME, -moe)
dat16 <- get_acs("county", table = "B27001", year = 2016,
output = "tidy", state = NULL, geometry = TRUE, shift_geo = TRUE) %>%
rename(`2016` = estimate) %>%
select(-moe)
dat <- left_join(dat16, dat12, by = c("GEOID", "variable"))
st_geometry(dat) <- NULL # This drops the geometry and leaves a table
head(dat)
dat <- mutate(dat,
cat = case_when(
variable %in% paste0("B27001_0",
c("09","12","37","40")) ~ "pop1834",
variable %in% paste0("B27001_0",
c("11","14","39","42")) ~ "pop1834ni")) %>%
filter(!is.na(cat))
# Create long version
dat <- tidyr::gather(dat, year, estimate, c(`2012`, `2016`))
# Group the data by our new categories and sum
dat <- group_by(dat, GEOID, NAME, year, cat) %>%
summarize(estimate = sum(estimate)) %>%
ungroup() %>%
tidyr::spread(cat, estimate)
dat <- mutate(dat, est = (pop1834ni/pop1834) * 100) %>%
select(-c(pop1834, pop1834ni)) %>%
tidyr::spread(year, est) %>%
mutate(diff = `2016`-`2012`)
head(dat)
datlong <- select(dat, -diff) %>%
tidyr::gather(year, estimate, c(`2012`, `2016`)) %>%
group_by(year) %>%
mutate(med = round(median(estimate, na.rm = TRUE), 1))
ggplot(datlong, aes(estimate)) +
geom_histogram(fill = "firebrick2",
color = "white", bins = 60) +
xlab("Uninsured adults ages 18-34 by county (%)") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~year, ncol = 1) +
geom_vline(aes(xintercept = med,
group = year), lty = "dashed") +
geom_text(aes(label = paste("Median = ", med), x = med, y = 55))
d10 <- top_n(dat, 10, diff) %>%
mutate(type = "Insured population decreased",
difftemp = diff)
i10 <- top_n(dat, -10, diff) %>%
mutate(type = "Insured population increased",
difftemp = abs(diff))
id10 <- bind_rows(list(i10, d10)) %>%
arrange(desc(difftemp))
ggplot(id10) +
geom_col(aes(x = forcats::fct_reorder(NAME, difftemp),
y = difftemp, fill = type)) +
coord_flip() +
scale_fill_manual(values = c("firebrick2", "cyan4")) +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.title = element_blank()) +
ggtitle("Counties with the greatest change (+/-) in
insured population, ages 18-34, 2012-2016") +
ylab("Difference in % insured (2016 - 2012)") +
xlab("")
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
Up until the last bit of code, the one that begins with shp, everything runs perfect. Once,
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
is run, I get the following error:
Error in select(GEOID, NAME) : object 'GEOID' not found
I have checked dat16 and dat. GEOID and NAME are present there. I am not sure what is wrong with the SELECT function as I have not loaded another library which may interfere with it. Any help would be appreciated.
I see now what was missing, a %>% (pipe) following the 'filter':
shp <- dat16 %>%
filter(variable == "B27001_001") %>% # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(
uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff
)
I'm having trouble getting a geom_rect to display a shaded area when using facet_wrap and the dplyr do(...) to generate the plots.
NOTE: The issue here may be related to a data structure issue. See this SO question for the current state of play.
The following minimal example uses the ggplot2 packages economics data and the NBER recession dates from the tis package.
Appreciate hints tips and incantations.
library(tis)
library(ggplot2)
# Prepare NBER recession start end dates.
start <- data.frame(date = as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"),
start= as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"))
end <- data.frame(date = as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"),
end= as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"))
dl <- economics %>%
gather(metric, value, pce:unemploy ) %>%
group_by(metric) %>%
mutate(diff = value - lag(value, default=first(value))) %>%
mutate(pct = diff/value) %>%
gather(transform, value, value:pct ) %>%
full_join(x=., y=start, by=c('date' = 'date')) %>%
full_join(x=., y=end, by=c('date' = 'date')) %>%
mutate(ymin = 0) %>%
mutate(ymax = Inf)
# Check the start end dates are present
dl %>% group_by(metric,transform, start) %>% summarise( count=n())
pl <- dl %>%
do(
plots = ggplot(data=., aes(x = date, y = value)) +
geom_point() +
geom_rect(aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
stat_smooth(method="auto",size=1.5) +
facet_wrap(~transform, scales="free_y")
)
pl[[1,2]]
I have checked that the minimum and maximum dates for each group are the same (NA group is not plotted):
dl %>%
group_by(transform) %>%
summarise(min= min(start, na.rm =TRUE), max = max(start, na.rm =TRUE))#
A tibble: 4 x 3
transform min max
<chr> <date> <date>
1 diff 1970-01-01 2008-01-01
2 pct 1970-01-01 2008-01-01
3 value 1970-01-01 2008-01-01
4 NA 1857-07-01 1960-05-01
Even if it is not the optimal solution, you can hard code both dates and use annotate to avoid opacity as geom_rect will draw multiple rectangles. I added alpha = 0.5 for transparency.
pl <- dl %>%
do(
plots = ggplot(data=., aes(x = date, y = value)) +
geom_point() +
annotate('rect', xmin = as.Date("1970-01-01"), xmax = as.Date("2008-01-01"),
ymin = -Inf, ymax = Inf, alpha = 0.5) +
stat_smooth(method="auto",size=1.5) +
facet_wrap(~transform, scales="free_y")
)
pl[[1,2]]
Okay, the issue here is the construction of the data frame is non-trivial. Two uses of outer join does not provide the required structure.
# Prepare NBER recession start end dates.
recessions <- data.frame(start = as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"),
end= as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"))
# Create the long format data frame
dl <- economics %>%
gather(metric, value, pce:unemploy ) %>%
group_by(metric) %>%
mutate(diff = value - lag(value, default=first(value))) %>%
mutate(pct = diff/value) %>%
gather(transform, value, value:pct ) #%>%
# Build the data frame with start and end dates given in recessions
df1 <- dl %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
filter(date >= start & date <= end) %>%
select(-dummy)
# Build data frame of all other dates with start=NA and end=NA
df2 <- dl %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
mutate(start=NA, end=NA) %>%
unique() %>%
select(-dummy)
# Now merge the two. Overwirte NA values with start and end dates
dl <- df2 %>%
left_join(x=., y=df1, by="date") %>%
mutate(date, start = ifelse(is.na(start.y), as.character(start.x), as.character(start.y)),end = ifelse(is.na(end.y), as.character(end.x), as.character(end.y))) %>%
mutate(start=as.Date(start), end=as.Date(end) ) %>%
select(-starts_with("start."),-starts_with("end."),-ends_with(".y")) %>%
setNames(sub(".x", "", names(.))) %>%
mutate(ymin = -Inf) %>% #min(value)) %>%
mutate(ymax = Inf) #max(value)) #%>%
# Check the start end dates are present
dl %>% group_by(metric,transform, start, end) %>% summarise( count = n() ) %>% print(n=180)
pl <- dl %>%
group_by(metric) %>%
do(
plots = ggplot(data=., aes(x = date, y = value)) +
geom_point() +
# annotate('rect', xmin = start, xmax = end,
# ymin = ymin, ymax = ymax, alpha = 0.5) +
geom_rect(aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax), na.rm=TRUE) +
stat_smooth(method="auto",size=1.5) +
facet_wrap(~transform, scales="free_y")
)
grid.draw(pl[[1,2]])
Consider this simple example
library(dplyr)
library(ggplot2)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
Here I want to write a function that takes the dataframe and the grouping variable as input. Ideally, after grouping and aggregating I would like to print a ggpplot chart.
This works:
get_charts2 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
> get_charts2(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
Unfortunately, adding ggplot into the function above FAILS
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean, color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
> get_charts1(dataframe, group)
Error in !quo_var : invalid argument type
I dont understand what is wrong here. Any ideas?
Thanks!
EDIT: interesting follow-up here how to create factor variables from quosures in functions using ggplot and dplyr?
ggplot does not yet support tidy eval syntax (you can't use the !!). You need to use more traditional standard evaluation calls. You can use aes_q in ggplot to help with this.
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes_q(x = quote(count), y = quote(mean), color = quo_var, group = quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)
ggplot2 v3.0.0 released in July 2018 supports !! (bang bang), !!!, and :=. aes_()/aes_q() and aes_string() are soft-deprecated.
OP's original code should work
library(tidyverse)
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)
Edit: using the tidy evaluation pronoun .data[] to slice the chosen variable from the data frame also works
get_charts2 <- function(data, mygroup){
df_agg <- data %>%
group_by(.data[[mygroup]]) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = .data[[mygroup]], group = .data[[mygroup]])) +
geom_point() +
geom_line()
}
get_charts2(dataframe, "group")
Created on 2018-04-04 by the reprex package (v0.2.0).