I am working with this time series and I plot the smooth mean but for some reason, I cannot get the confidence area to appear. I tried using level=0.95 on the geom_smooth command but still, nothing happens.
data=https://github.com/gonzalodqa/timeseries
months_order <- c(7:12,1:6)
dates <- make_datetime(c(rep(3,6), rep(4,6)), months_order)
t %>%
mutate(datetime = make_datetime(year, month, day, hour, minute, second)) %>%
filter(datetime >= make_datetime(2018,7), datetime < make_datetime(2020,7)) %>%
group_by(year, month) %>%
mutate(dummy = month(datetime) == 7 & datetime == min(datetime)) %>%
ungroup() %>%
mutate(dummy = cumsum(dummy)) %>%
group_by(dummy) %>%
mutate(datetime = datetime - years(year - 4) - years(month>=7),
years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = years)) +
scale_x_datetime(breaks = dates, labels = month.abb[months_order]) +
labs(colour = "Year")+geom_smooth(aes(x=datetime,y=T42),`level=0.95,color="black")+theme_light()+
xlab("Time (Months)")+ylab("Temperature (°C)")+geom_hline(yintercept=5, linetype="dashed", color
= "black",lwd=1)+
scale_color_manual(values=c("grey","grey","red"))
I have not specified any formula on geom_smooth() I tried to google the answer and also here but I cannot seem to find a solution
Thank you for any imput
I think it's because it's color, and it's CI is extremely narrow. By adding lwd = 0.5, fill = "steelblue", I can barely find existence of CI. Take a look very carefully, then you may see something blue.
t %>%
mutate(datetime = make_datetime(year, month, day, hour, minute, second)) %>%
filter(datetime >= make_datetime(2018,7), datetime < make_datetime(2020,7)) %>%
group_by(year, month) %>%
mutate(dummy = month(datetime) == 7 & datetime == min(datetime)) %>%
ungroup() %>%
mutate(dummy = cumsum(dummy)) %>%
group_by(dummy) %>%
mutate(datetime = datetime - years(year - 4) - years(month>=7),
years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = years)) +
scale_x_datetime(breaks = dates, labels = month.abb[months_order]) +
labs(colour = "Year")+geom_smooth(aes(x=datetime,y=T42),level=0.95,color="black", lwd = 0.5, fill = "steelblue")+theme_light()+
xlab("Time (Months)")+ylab("Temperature (°C)")+geom_hline(yintercept=5, linetype="dashed", color
= "black",lwd=1)+
scale_color_manual(values=c("grey","grey","red"))
Related
I am using 'flights' data set from 'nycflights13' package. I was required to convert to code containing 'stat_summary_2d' from the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
group_by(carrier, origin) %>%
summarise(cancel = 100*mean(cancel, na.rm = T)) %>%
ggplot() +
geom_tile(aes(origin, carrier, fill = cancel)) +
geom_text(aes(origin, carrier, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
In the data set the missing value of the 'dep_time' variable means a cancelled flight, and the variable 'cancel' is created by calculating the proportion of cancelled flights over the scheduled flights.
Below is how I apply 'stat_summary_2d' to convert the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
ggplot() +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flights %>% group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = T)) %>% ungroup,
aes(factor(carrier), origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
When I execute my code, the error is
> Error in summarize()`: ! Problem while computing `cancel = 100 *
> mean(cancel, na.rm = T)`. i The error occurred in group 1: carrier =
> "9E", origin = "EWR".
Could anyone tell me how to fix this problem? Thank you so much!
The original data is not updated when with the new column unless we use %<>% instead of %>%. But, it may be easier to create two objects
library(dplyr)
library(ggplot2)
flight1 <- flights %>%
mutate(cancel = 1*(dep_time %>% is.na))
flight2 <- flights1 %>%
group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = TRUE), .groups = 'drop')
ggplot(flight1) +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flight2, aes(factor(carrier),
origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
I am trying to compare different years' variables but I am having trouble plotting them together.
The time series is a temperature series which can be found in https://github.com/gonzalodqa/timeseries as temp.csv
I would like to plot something like the image but I find it difficult to subset the months between the years and then combine the lines in the same plot under the same months
If someone can give some advice or point me in the right direction I would really appreciate it
You can try this way.
The first chart shows all the available temperatures, the second chart is aggregated by month.
In the first chart, we force the same year so that ggplot will plot them aligned, but we separate the lines by colour.
For the second one, we just use month as x variable and year as colour variable.
Note that:
with scale_x_datetime we can hide the year so that no one can see that we forced the year 2020 to every observation
with scale_x_continous we can show the name of the months instead of the numbers
[just try to run the charts with and without scale_x_... to understand what I'm talking about]
month.abb is a useful default variable for months names.
# read data
df <- readr::read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
# libraries
library(ggplot2)
library(dplyr)
# line chart by datetime
df %>%
# make datetime: force unique year
mutate(datetime = lubridate::make_datetime(2020, month, day, hour, minute, second)) %>%
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = factor(year))) +
scale_x_datetime(breaks = lubridate::make_datetime(2020,1:12), labels = month.abb) +
labs(title = "Temperature by Datetime", colour = "Year")
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
ggplot() +
geom_line(aes(x = month, y = T42, colour = factor(year))) +
scale_x_continuous(breaks = 1:12, labels = month.abb, minor_breaks = NULL) +
labs(title = "Average Temperature by Month", colour = "Year")
In case you want your chart to start from July, you can use this code instead:
months_order <- c(7:12,1:6)
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
# create new groups starting from each July
group_by(neworder = cumsum(month == 7)) %>%
# keep only complete years
filter(n() == 12) %>%
# give new names to groups
mutate(years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# reorder months
mutate(month = factor(month, levels = months_order, labels = month.abb[months_order], ordered = TRUE)) %>%
# plot
ggplot() +
geom_line(aes(x = month, y = T42, colour = years, group = years)) +
labs(title = "Average Temperature by Month", colour = "Year")
EDIT
To have something similar to the first plot but starting from July, you could use the following code:
# libraries
library(ggplot2)
library(dplyr)
library(lubridate)
# custom months order
months_order <- c(7:12,1:6)
# fake dates for plot
# note: choose 4 to include 29 Feb which exist only in leap years
dates <- make_datetime(c(rep(3,6), rep(4,6)), months_order)
# line chart by datetime
df %>%
# create date time
mutate(datetime = make_datetime(year, month, day, hour, minute, second)) %>%
# filter years of interest
filter(datetime >= make_datetime(2018,7), datetime < make_datetime(2020,7)) %>%
# create increasing group after each july
group_by(year, month) %>%
mutate(dummy = month(datetime) == 7 & datetime == min(datetime)) %>%
ungroup() %>%
mutate(dummy = cumsum(dummy)) %>%
# force unique years and create custom name
group_by(dummy) %>%
mutate(datetime = datetime - years(year - 4) - years(month>=7),
years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# plot
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = years)) +
scale_x_datetime(breaks = dates, labels = month.abb[months_order]) +
labs(title = "Temperature by Datetime", colour = "Year")
To order month differently and sum up the values in couples of years, you've to work a bit with your data before plotting them:
library(dplyr) # work data
library(ggplot2) # plots
library(lubridate) # date
library(readr) # fetch data
# your data
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df %>%
mutate(date = make_date(year, month,day)) %>%
# reorder month
group_by(month_2 = factor(as.character(month(date, label = T, locale = Sys.setlocale("LC_TIME", "English"))),
levels = c('Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun')),
# group years as you like
year_2 = ifelse( year(date) %in% (2018:2019), '2018/2019', '2020/2021')) %>%
# you can put whatever aggregation function you need
summarise(val = mean(T42, na.rm = T)) %>%
# plot it!
ggplot(aes(x = month_2, y = val, color = year_2, group = year_2)) +
geom_line() +
ylab('T42') +
xlab('month') +
theme_light()
A slightly different solution without the all dates to 2020 trick.
library(tidyverse)
library(lubridate)
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df <- df |>
filter(year %in% c(2018, 2019, 2020)) %>%
mutate(year = factor(year),
month = ifelse(month<10, paste0(0,month), month),
day = paste0(0, day),
month_day = paste0(month, "-", day))
df |> ggplot(aes(x=month_day, y=T42, group=year, col=year)) +
geom_line() +
scale_x_discrete(breaks = c("01-01", "02-01", "03-01", "04-01", "05-01", "06-01", "07-01", "08-01", "09-01", "10-01", "11-01", "12-01"))
I am trying to reproduce a map I found here: http://zevross.com/blog/2018/10/02/creating-beautiful-demographic-maps-in-r-with-the-tidycensus-and-tmap-packages/
I am using RStudio and am running the following code:
library(ggplot2) # For plotting
library(tidycensus) # For downloading Census data
library(tmap) # For creating tmap
library(tmaptools) # For reading and processing spatial data related to tmap
library(dplyr) # For data wrangling
library(sf) # For reading, writing and working with spatial objects
census_api_key("enter your API key here", overwrite = TRUE)
dat12 <- get_acs("county", table = "B27001", year = 2012,
output = "tidy", state = NULL, geometry = FALSE) %>%
rename(`2012` = estimate) %>%
select(-NAME, -moe)
dat16 <- get_acs("county", table = "B27001", year = 2016,
output = "tidy", state = NULL, geometry = TRUE, shift_geo = TRUE) %>%
rename(`2016` = estimate) %>%
select(-moe)
dat <- left_join(dat16, dat12, by = c("GEOID", "variable"))
st_geometry(dat) <- NULL # This drops the geometry and leaves a table
head(dat)
dat <- mutate(dat,
cat = case_when(
variable %in% paste0("B27001_0",
c("09","12","37","40")) ~ "pop1834",
variable %in% paste0("B27001_0",
c("11","14","39","42")) ~ "pop1834ni")) %>%
filter(!is.na(cat))
# Create long version
dat <- tidyr::gather(dat, year, estimate, c(`2012`, `2016`))
# Group the data by our new categories and sum
dat <- group_by(dat, GEOID, NAME, year, cat) %>%
summarize(estimate = sum(estimate)) %>%
ungroup() %>%
tidyr::spread(cat, estimate)
dat <- mutate(dat, est = (pop1834ni/pop1834) * 100) %>%
select(-c(pop1834, pop1834ni)) %>%
tidyr::spread(year, est) %>%
mutate(diff = `2016`-`2012`)
head(dat)
datlong <- select(dat, -diff) %>%
tidyr::gather(year, estimate, c(`2012`, `2016`)) %>%
group_by(year) %>%
mutate(med = round(median(estimate, na.rm = TRUE), 1))
ggplot(datlong, aes(estimate)) +
geom_histogram(fill = "firebrick2",
color = "white", bins = 60) +
xlab("Uninsured adults ages 18-34 by county (%)") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~year, ncol = 1) +
geom_vline(aes(xintercept = med,
group = year), lty = "dashed") +
geom_text(aes(label = paste("Median = ", med), x = med, y = 55))
d10 <- top_n(dat, 10, diff) %>%
mutate(type = "Insured population decreased",
difftemp = diff)
i10 <- top_n(dat, -10, diff) %>%
mutate(type = "Insured population increased",
difftemp = abs(diff))
id10 <- bind_rows(list(i10, d10)) %>%
arrange(desc(difftemp))
ggplot(id10) +
geom_col(aes(x = forcats::fct_reorder(NAME, difftemp),
y = difftemp, fill = type)) +
coord_flip() +
scale_fill_manual(values = c("firebrick2", "cyan4")) +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.title = element_blank()) +
ggtitle("Counties with the greatest change (+/-) in
insured population, ages 18-34, 2012-2016") +
ylab("Difference in % insured (2016 - 2012)") +
xlab("")
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
Up until the last bit of code, the one that begins with shp, everything runs perfect. Once,
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
is run, I get the following error:
Error in select(GEOID, NAME) : object 'GEOID' not found
I have checked dat16 and dat. GEOID and NAME are present there. I am not sure what is wrong with the SELECT function as I have not loaded another library which may interfere with it. Any help would be appreciated.
I see now what was missing, a %>% (pipe) following the 'filter':
shp <- dat16 %>%
filter(variable == "B27001_001") %>% # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(
uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff
)
Trying to plot total cases of covid19 at the country level with a histogram of daily new cases to show a sustained drop in new cases leads to a 'flattening of the curve' (assuming that is the case).
library(tidyverse)
#clean raw data source
c19 = read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/time-series-19-covid-combined.csv") %>%
mutate(Cases = Confirmed) %>%
mutate(Country = `Country/Region`) %>%
select(Date, Country, Cases, Deaths) %>%
group_by(Date, Country) %>%
summarise(Cases = sum(Cases),
Deaths = sum(Deaths)) %>%
ungroup() %>%
group_by(Country) %>%
mutate(Lagged_Cases = ifelse(is.na(lag(Cases)), 0, lag(Cases))) %>%
mutate(NewCases = Cases - Lagged_Cases) %>%
mutate(IndexDate = ifelse(Lagged_Cases == 0 & Cases > 0, 1, ifelse(Lagged_Cases > 0, 2, 0))) %>%
filter(IndexDate > 0) %>%
mutate(Index = row_number()) %>%
ungroup() %>%
select(-IndexDate) %>%
filter(Country %in% c("US","Korea, South","Sweden")) %>%
inner_join(data.frame(Country = c("US","Korea, South","Sweden"),
Pop = c(328000000,51245707,10230000)))
c19 %>%
ggplot() +
geom_line(aes(x=Index, y=Cases/1000, color=Country), size=2) +
geom_histogram(aes(x=Index, y=NewCases/75, group=Country), stat="identity", alpha=.4) +
#scale_y_continuous(sec.axis = sec_axis(~./data$Cases)) +
facet_wrap(vars(Country), scales="free_y") +
ggtitle("Flattening The Curve?") +
xlab("Days Since First Case") +
ylab("Total Cases (thousands) - Daily New Cases (not to scale)")
I'm having trouble getting a geom_rect to display a shaded area when using facet_wrap and the dplyr do(...) to generate the plots.
NOTE: The issue here may be related to a data structure issue. See this SO question for the current state of play.
The following minimal example uses the ggplot2 packages economics data and the NBER recession dates from the tis package.
Appreciate hints tips and incantations.
library(tis)
library(ggplot2)
# Prepare NBER recession start end dates.
start <- data.frame(date = as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"),
start= as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"))
end <- data.frame(date = as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"),
end= as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"))
dl <- economics %>%
gather(metric, value, pce:unemploy ) %>%
group_by(metric) %>%
mutate(diff = value - lag(value, default=first(value))) %>%
mutate(pct = diff/value) %>%
gather(transform, value, value:pct ) %>%
full_join(x=., y=start, by=c('date' = 'date')) %>%
full_join(x=., y=end, by=c('date' = 'date')) %>%
mutate(ymin = 0) %>%
mutate(ymax = Inf)
# Check the start end dates are present
dl %>% group_by(metric,transform, start) %>% summarise( count=n())
pl <- dl %>%
do(
plots = ggplot(data=., aes(x = date, y = value)) +
geom_point() +
geom_rect(aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
stat_smooth(method="auto",size=1.5) +
facet_wrap(~transform, scales="free_y")
)
pl[[1,2]]
I have checked that the minimum and maximum dates for each group are the same (NA group is not plotted):
dl %>%
group_by(transform) %>%
summarise(min= min(start, na.rm =TRUE), max = max(start, na.rm =TRUE))#
A tibble: 4 x 3
transform min max
<chr> <date> <date>
1 diff 1970-01-01 2008-01-01
2 pct 1970-01-01 2008-01-01
3 value 1970-01-01 2008-01-01
4 NA 1857-07-01 1960-05-01
Even if it is not the optimal solution, you can hard code both dates and use annotate to avoid opacity as geom_rect will draw multiple rectangles. I added alpha = 0.5 for transparency.
pl <- dl %>%
do(
plots = ggplot(data=., aes(x = date, y = value)) +
geom_point() +
annotate('rect', xmin = as.Date("1970-01-01"), xmax = as.Date("2008-01-01"),
ymin = -Inf, ymax = Inf, alpha = 0.5) +
stat_smooth(method="auto",size=1.5) +
facet_wrap(~transform, scales="free_y")
)
pl[[1,2]]
Okay, the issue here is the construction of the data frame is non-trivial. Two uses of outer join does not provide the required structure.
# Prepare NBER recession start end dates.
recessions <- data.frame(start = as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"),
end= as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"))
# Create the long format data frame
dl <- economics %>%
gather(metric, value, pce:unemploy ) %>%
group_by(metric) %>%
mutate(diff = value - lag(value, default=first(value))) %>%
mutate(pct = diff/value) %>%
gather(transform, value, value:pct ) #%>%
# Build the data frame with start and end dates given in recessions
df1 <- dl %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
filter(date >= start & date <= end) %>%
select(-dummy)
# Build data frame of all other dates with start=NA and end=NA
df2 <- dl %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
mutate(start=NA, end=NA) %>%
unique() %>%
select(-dummy)
# Now merge the two. Overwirte NA values with start and end dates
dl <- df2 %>%
left_join(x=., y=df1, by="date") %>%
mutate(date, start = ifelse(is.na(start.y), as.character(start.x), as.character(start.y)),end = ifelse(is.na(end.y), as.character(end.x), as.character(end.y))) %>%
mutate(start=as.Date(start), end=as.Date(end) ) %>%
select(-starts_with("start."),-starts_with("end."),-ends_with(".y")) %>%
setNames(sub(".x", "", names(.))) %>%
mutate(ymin = -Inf) %>% #min(value)) %>%
mutate(ymax = Inf) #max(value)) #%>%
# Check the start end dates are present
dl %>% group_by(metric,transform, start, end) %>% summarise( count = n() ) %>% print(n=180)
pl <- dl %>%
group_by(metric) %>%
do(
plots = ggplot(data=., aes(x = date, y = value)) +
geom_point() +
# annotate('rect', xmin = start, xmax = end,
# ymin = ymin, ymax = ymax, alpha = 0.5) +
geom_rect(aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax), na.rm=TRUE) +
stat_smooth(method="auto",size=1.5) +
facet_wrap(~transform, scales="free_y")
)
grid.draw(pl[[1,2]])