How to convert to stat_summary_2d? - r

I am using 'flights' data set from 'nycflights13' package. I was required to convert to code containing 'stat_summary_2d' from the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
group_by(carrier, origin) %>%
summarise(cancel = 100*mean(cancel, na.rm = T)) %>%
ggplot() +
geom_tile(aes(origin, carrier, fill = cancel)) +
geom_text(aes(origin, carrier, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
In the data set the missing value of the 'dep_time' variable means a cancelled flight, and the variable 'cancel' is created by calculating the proportion of cancelled flights over the scheduled flights.
Below is how I apply 'stat_summary_2d' to convert the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
ggplot() +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flights %>% group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = T)) %>% ungroup,
aes(factor(carrier), origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
When I execute my code, the error is
> Error in summarize()`: ! Problem while computing `cancel = 100 *
> mean(cancel, na.rm = T)`. i The error occurred in group 1: carrier =
> "9E", origin = "EWR".
Could anyone tell me how to fix this problem? Thank you so much!

The original data is not updated when with the new column unless we use %<>% instead of %>%. But, it may be easier to create two objects
library(dplyr)
library(ggplot2)
flight1 <- flights %>%
mutate(cancel = 1*(dep_time %>% is.na))
flight2 <- flights1 %>%
group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = TRUE), .groups = 'drop')
ggplot(flight1) +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flight2, aes(factor(carrier),
origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()

Related

Logit-Normal distribution in R

I am trying replicate the published code for ML-NMR from the multinma package, which is published here:
https://cran.r-project.org/web/packages/multinma/vignettes/example_plaque_psoriasis.html#ref-methods_paperl.
When I get to the following steps, I ran into problems applying the dlogitnorm function from the logitnorm package. I assume this is due to package dependencies. I've replaced the last line of the code since the function takes as arguments 'mu' and 'sigma', instead of 'mean' and 'sd'. However, when I produce the histogram, it doesn't match the published one. Any ideas how to match the published histogram?
library(multinma)
library(logitnorm) # required to use logit-Normal distribution
**# Get mean and sd of covariates in each study**
ipd_summary <- pso_ipd %>%
group_by(studyc) %>%
summarise_at(vars(weight, durnpso, bsa), list(mean = mean, sd = sd, min = min, max = max)) %>%
pivot_longer(weight_mean:bsa_max, names_sep = "_", names_to = c("covariate", ".value")) %>%
# Assign distributions
mutate(dist = recode(covariate,
bsa = "dlogitnorm",
durnpso = "dgamma",
weight = "dgamma")) %>%
# Compute density curves
group_by(studyc, covariate) %>%
mutate(value = if_else(dist == "dlogitnorm",
list(seq(0, 1, length.out = 101)),
list(seq(min*0.8, max*1.2, length.out = 101)))) %>%
unnest(cols = value) %>%
#Note this line was edited from the original code to solve an error caused by dlogitnorm(), which uses 'mu'=' and 'sigma' as arguments
**#mutate(dens = eval(call(first(dist), x = value, mean = first(mean), sd = first(sd))))**
mutate(dens = ifelse(dist != "dlogitnorm", eval(call(first(dist), x = value, mean = first(mean), sd = first(sd))), NA)
dens = ifelse(dist == "dlogitnorm", eval(call(first(dist), x = value, mu = first(mean), sigma =first(sd), log=FALSE)), dens))
*# Plot histograms and assumed densities*
pso_ipd %>%
pivot_longer(c(weight, durnpso, bsa), names_to = "covariate", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(aes(y = stat(density)),
binwidth = function(x) diff(range(x)) / nclass.Sturges(x),
boundary = 0,
fill = "grey50") +
geom_line(aes(y = dens), data = ipd_summary,
colour = "darkred", size = 0.5) +
facet_wrap(~studyc + covariate, scales = "free", ncol = 3) +
theme_multinma()

Creating a geographic file for use with tmap and coming up with error when coding shapefile

I am trying to reproduce a map I found here: http://zevross.com/blog/2018/10/02/creating-beautiful-demographic-maps-in-r-with-the-tidycensus-and-tmap-packages/
I am using RStudio and am running the following code:
library(ggplot2) # For plotting
library(tidycensus) # For downloading Census data
library(tmap) # For creating tmap
library(tmaptools) # For reading and processing spatial data related to tmap
library(dplyr) # For data wrangling
library(sf) # For reading, writing and working with spatial objects
census_api_key("enter your API key here", overwrite = TRUE)
dat12 <- get_acs("county", table = "B27001", year = 2012,
output = "tidy", state = NULL, geometry = FALSE) %>%
rename(`2012` = estimate) %>%
select(-NAME, -moe)
dat16 <- get_acs("county", table = "B27001", year = 2016,
output = "tidy", state = NULL, geometry = TRUE, shift_geo = TRUE) %>%
rename(`2016` = estimate) %>%
select(-moe)
dat <- left_join(dat16, dat12, by = c("GEOID", "variable"))
st_geometry(dat) <- NULL # This drops the geometry and leaves a table
head(dat)
dat <- mutate(dat,
cat = case_when(
variable %in% paste0("B27001_0",
c("09","12","37","40")) ~ "pop1834",
variable %in% paste0("B27001_0",
c("11","14","39","42")) ~ "pop1834ni")) %>%
filter(!is.na(cat))
# Create long version
dat <- tidyr::gather(dat, year, estimate, c(`2012`, `2016`))
# Group the data by our new categories and sum
dat <- group_by(dat, GEOID, NAME, year, cat) %>%
summarize(estimate = sum(estimate)) %>%
ungroup() %>%
tidyr::spread(cat, estimate)
dat <- mutate(dat, est = (pop1834ni/pop1834) * 100) %>%
select(-c(pop1834, pop1834ni)) %>%
tidyr::spread(year, est) %>%
mutate(diff = `2016`-`2012`)
head(dat)
datlong <- select(dat, -diff) %>%
tidyr::gather(year, estimate, c(`2012`, `2016`)) %>%
group_by(year) %>%
mutate(med = round(median(estimate, na.rm = TRUE), 1))
ggplot(datlong, aes(estimate)) +
geom_histogram(fill = "firebrick2",
color = "white", bins = 60) +
xlab("Uninsured adults ages 18-34 by county (%)") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~year, ncol = 1) +
geom_vline(aes(xintercept = med,
group = year), lty = "dashed") +
geom_text(aes(label = paste("Median = ", med), x = med, y = 55))
d10 <- top_n(dat, 10, diff) %>%
mutate(type = "Insured population decreased",
difftemp = diff)
i10 <- top_n(dat, -10, diff) %>%
mutate(type = "Insured population increased",
difftemp = abs(diff))
id10 <- bind_rows(list(i10, d10)) %>%
arrange(desc(difftemp))
ggplot(id10) +
geom_col(aes(x = forcats::fct_reorder(NAME, difftemp),
y = difftemp, fill = type)) +
coord_flip() +
scale_fill_manual(values = c("firebrick2", "cyan4")) +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.title = element_blank()) +
ggtitle("Counties with the greatest change (+/-) in
insured population, ages 18-34, 2012-2016") +
ylab("Difference in % insured (2016 - 2012)") +
xlab("")
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
Up until the last bit of code, the one that begins with shp, everything runs perfect. Once,
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
is run, I get the following error:
Error in select(GEOID, NAME) : object 'GEOID' not found
I have checked dat16 and dat. GEOID and NAME are present there. I am not sure what is wrong with the SELECT function as I have not loaded another library which may interfere with it. Any help would be appreciated.
I see now what was missing, a %>% (pipe) following the 'filter':
shp <- dat16 %>%
filter(variable == "B27001_001") %>% # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(
uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff
)

Plotting multiple plots with two discrete variables - how to include all discrete variables in both axes

I have a dataset that looks like this:
test<-data.frame("M"=c("a","b","c","a","b","b","c","a","b","c"),
"N"=c(1,3,4,6,6,7,7,8,8,8),
"X"=c(0,1,0,1,1,0,1,0,1,1),
"Y"=c(1,1,0,0,1,0,1,1,1,0))
I'm making a simple plot where I want X and Y on the y axis, M on the x axis, each grid colored if the value of X or Y is 1 and empty if the value of X or Y is 0. I'm repeating this for each categories in N (the categories of N are 1 to 5, 6, 7, 8), then stacking all plots together. Right now, I'm doing this with the following code.
test <- test[order(test$N),]
test1 <- test[c(1:3),]
test2 <- test[c(4:5),]
test3 <- test[c(6:7),]
test4 <- test[c(8:10),] # I'm doing this to "separate" categories of `N` manually
p1 <- test1[,c(1,3:4)] %>%
gather(col_name, value, -M) %>%
ggplot(aes(factor(M), col_name, fill = value == 1))+
geom_tile(colour = 'black')+
scale_fill_manual(values = c('FALSE' = 'white', 'TRUE' = 'red'))
p2 <- test2[,c(1,3:4)] %>%
gather(col_name, value, -M) %>%
ggplot(aes(factor(M), col_name, fill = value == 1))+
geom_tile(colour = 'black')+
scale_fill_manual(values = c('FALSE' = 'white', 'TRUE' = 'yellow'))
p3 <- test3[,c(1,3:4)] %>%
gather(col_name, value, -M) %>%
ggplot(aes(factor(M), col_name, fill = value == 1))+
geom_tile(colour = 'black')+
scale_fill_manual(values = c('FALSE' = 'white', 'TRUE' = 'green'))
p4 <- test4[,c(1,3:4)] %>%
gather(col_name, value, -M) %>%
ggplot(aes(factor(M), col_name, fill = value == 1))+
geom_tile(colour = 'black')+
scale_fill_manual(values = c('FALSE' = 'white', 'TRUE' = 'blue'))
grid.arrange(p1, p2, p3, p4, ncol = 1)
I'm attaching an image of what I have right now. I want to fix these plots so that I would have the same factors of M for all four plots (right now, only p1 and p4 have all three factors (a, b and c) in the x axis but I want to add factor c to p2 and a to p3 so that all x axes are identical to each other. Can anyone give me suggestions on how to do this?
(Also, I'm suspecting that the current way I'm plotting things is probably not the most quickest/easiest way to go, if anyone has suggestions on how to improve things it'd be really helpful!)
To continue using grid.arrange(), instead of facet_wrap(), do the following:
Make M a factor:
test$M <- factor(test$M)
Add the following to each of your plots:
scale_x_discrete(limits = levels(test$M))
Maybe one approach I can suggest you is using facets after applying a smart trick to group your values and avoid splitting in different dataframes. Here the code as an option for you (The colors will be the same across the facets in base of TRUE/FALSE values):
library(tidyverse)
#Code
test %>% mutate(Var=lead(N)) %>%
mutate(Diff=Var-N,Diff=ifelse(row_number()==1,0,Diff)) %>%
mutate(Group=ifelse(Diff==0,N,NA)) %>%
fill(Group) %>% select(-c(N,Var,Diff)) %>%
group_by(Group) %>% mutate(NG=paste0('p',cur_group_id())) %>% ungroup() %>%
select(-Group) %>%
pivot_longer(cols = -c(NG,M)) %>%
ggplot(aes(factor(M), name, fill = value == 1,group=value))+
geom_tile(colour = 'black')+
facet_wrap(.~NG,ncol = 1)+
scale_fill_manual('value',values=c('tomato','cyan3'))+
xlab('M')
Output:
The othe option would be patchwork with a customized function:
library(tidyverse)
library(patchwork)
#Code
data <- test %>% mutate(Var=lead(N)) %>%
mutate(Diff=Var-N,Diff=ifelse(row_number()==1,0,Diff)) %>%
mutate(Group=ifelse(Diff==0,N,NA)) %>%
fill(Group) %>% select(-c(N,Var,Diff)) %>%
group_by(Group) %>% mutate(NG=paste0('p',cur_group_id())) %>% ungroup() %>%
select(-Group) %>%
mutate(M=factor(M,levels = unique(M),ordered = T)) %>%
pivot_longer(cols = -c(NG,M))
#List
List <- split(data,data$NG)
#Function
myfun <- function(x)
{
#Test for color
val <- unique(x$NG)
#Conditioning for color
if(val=='p1') {vcolor=c('FALSE' = 'white', 'TRUE' = 'red')} else
if(val=='p2') {vcolor=c('FALSE' = 'white', 'TRUE' = 'yellow')} else
if(val=='p3') {vcolor=c('FALSE' = 'white', 'TRUE' = 'green')} else
{vcolor=c('FALSE' = 'white', 'TRUE' = 'blue')}
#Update data
x <- x %>% mutate(M=factor(M,levels = c('a','b','c'),ordered = T)) %>% complete(M=M)
#Plot
G <- ggplot(x,aes(factor(M), name, fill = (value == 1 & !is.na(value))))+
geom_tile(colour = 'black')+
scale_fill_manual('value',values=vcolor)+
xlab('M')+
scale_y_discrete(limits=c('X','Y'))+
theme_bw()+
ggtitle(val)
return(G)
}
#Apply
Lplot <- lapply(List,myfun)
#Wrap
GF <- wrap_plots(Lplot,ncol = 1)
Output:
Something like this?
test<-data.frame("M"=c("a","b","c","a","b","b","c","a","b","c"),
"N"=c(1,3,4,6,6,7,7,8,8,8),
"X"=c(0,1,0,1,1,0,1,0,1,1),
"Y"=c(1,1,0,0,1,0,1,1,1,0))
library(tidyverse)
test = mutate(test, N2 = cut(N, breaks = c(0,5:100)))
m = pivot_longer(test, c(X, Y))
ggplot(m, aes(M, name,fill=factor(value))) +
geom_tile(colour = 'black') +
facet_wrap(~N2, scales = 'free') +
scale_fill_manual(values = c(`0` = 'white', `1` = 'red'))

ggplot error using rmarkdown - object ´percent´ not found

So I have been making this scatterplot using ggplot in R.
By using this code as listed below in a regular r script in RStudio I am able to produce the plot that I want to without any errors.
The problem is when I am trying to use the same code in a chunk using rmarkdown to knit to PDF.
I get an error saying: Error in check_breaks_labels(breaks, labels): object percent not found.
Any suggestions? Hope the reproducable example is ok.
library(tidyquant)
library(timetk)
library(ggplot2)
SPY <- tq_get("SPY", from = '2010-01-01',
to = "2020-04-04",
get = "stock.prices")
FXI <- tq_get("FXI", from = '2010-01-01',
to = "2020-04-04",
get = "stock.prices")
QQQ <- tq_get("QQQ", from = '2010-01-01',
to = "2020-04-04",
get = "stock.prices")
SPY_monthly_returns <- SPY %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
col_rename = "SPY_ret")
FXI_monthly_returns <- FXI %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
col_rename = "FXI_ret")
QQQ_monthly_returns <- QQQ %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
col_rename = "QQQ_ret")
SPY_monthly_mean_ret <- SPY_monthly_returns %>%
select(SPY_ret) %>%
.[[1]] %>%
mean(na.rm = TRUE)
FXI_monthly_mean_ret <- FXI_monthly_returns %>%
select(FXI_ret) %>%
.[[1]] %>%
mean(na.rm = TRUE)
QQQ_monthly_mean_ret <- QQQ_monthly_returns %>%
select(QQQ_ret) %>%
.[[1]] %>%
mean(na.rm = TRUE)
SPY_monthly_sd_ret <- SPY_monthly_returns %>%
select(SPY_ret) %>%
.[[1]] %>%
sd()
FXI_monthly_sd_ret <- FXI_monthly_returns %>%
select(FXI_ret) %>%
.[[1]] %>%
sd()
QQQ_monthly_sd_ret <- QQQ_monthly_returns %>%
select(QQQ_ret) %>%
.[[1]] %>%
sd()
d <- data.frame(meanret = c(SPY_monthly_mean_ret,FXI_monthly_mean_ret,QQQ_monthly_mean_ret), sd = c(SPY_monthly_sd_ret,FXI_monthly_sd_ret,QQQ_monthly_sd_ret), names = c("SPY","FXI","QQQ"))
ggplot(d, aes(sd,meanret, color= ticker)) +
geom_point(size=1) + geom_text(aes(label=names)) +
ggtitle("Monthly Risk-Return Plot") + xlab("Volatility") +
ylab("Mean Return") + theme_bw() +
scale_y_continuous(label = percent, limits = c(0, 0.02)) +
scale_x_continuous(label = percent, limits = c(0, 0.08))
The solution was to require(scales) and after that specify "ticker" as I had not done that in my code chunk.
Both were suggested really quickly which is very helpful to me who is learning while writing my bachelor thesis. Thanks a lot!
You forgot to quote percent.
ggplot(d, aes(sd,meanret, color= ticker)) +
geom_point(size=1) + geom_text(aes(label=names)) +
ggtitle("Monthly Risk-Return Plot") + xlab("Volatility") +
ylab("Mean Return") + theme_bw() +
scale_y_continuous(label = "percent", limits = c(0, 0.02)) +
scale_x_continuous(label = "percent", limits = c(0, 0.08))
When not using quotes, ggplot is looking for an object named percent that should hold a string. Because of this, you could also do
p <- ggplot(...)
mylabel = "This is my label"
p + scale_x_continuous(label = "mylabel", limits = c(0, 0.08))

How can we data wrangling to obtain shown ratio/proportion chart shown

Goal is to produce a visualization indicating ratio.
Please help us how can we produce such ratio chart (high lighted) in R ?
library(tidyverse)
# Dataset creation
df <- data.frame(cls = c(rep("A",4),rep("B",4)),
grd = c("A1",rep("A2",3),rep(c("B1","B2"), 2)),
typ = c(rep("m",2),rep("o",2),"m","n",rep("p",2)),
pnts = c(rep(1:4,2)))
df
#### Data wrangling
df1 <- df %>%
group_by(cls) %>%
summarise(cls_pct = sum(pnts))
df1
df2 <- df %>%
group_by(cls,grd) %>%
summarize(grd_pct = sum(pnts))
df2
df3 <- df %>%
group_by(cls,grd,typ) %>%
summarise(typ_pct = sum(pnts))
df3
#### Attempt to combine all df1,df2,df3
# but mutate and summarise are mixing up leading to wrong results
df3 %>%
group_by(cls,grd) %>%
mutate(grd_pct = sum(typ_pct)) %>%
group_by(cls) %>%
mutate(cls_pct = sum(grd_pct))
Attempt to visualize all the ratios in 1 chart
data %>%
pivot_longer(cols = -c(cls:pnts),
names_to = "per_cat",
values_to = "percent") %>%
ggplot(aes(cls,percent, col = typ, fill = grd)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw()
plot of the same.
EDIT -- added formula version with more useful output for visualization.
ORIG: At this point it may be worth making a function to reduce copying and pasting, but this may get you what you need:
library(tidyverse)
df %>%
group_by(cls) %>%
mutate(per1 = sum(pnts),
per1_pct = per1 / sum(per1)) %>%
group_by(cls, grd) %>%
mutate(per2 = sum(pnts),
per2_pct = per2 / sum(per2)) %>%
group_by(cls, grd, typ) %>%
mutate(per3 = sum(pnts),
per3_pct = per3 / sum(per3)) %>%
ungroup()
EDIT: Here's a general function to calculate the stats for a given grouping, making it easier to combine a few groupings together in long format better suited for visualization.
df_sum <- function(df, level, ...) {
df %>%
group_by(...) %>%
summarize(grp_ttl = sum(pnts)) %>%
mutate(ttl = sum(grp_ttl),
pct = grp_ttl / ttl) %>%
ungroup() %>%
mutate(level = {{ level }} )
}
df_sum(df, level = 1, cls) %>%
bind_rows(df_sum(df, level = 2, cls, grd)) %>%
bind_rows(df_sum(df, level = 3, cls, grd, typ)) %>%
mutate(label = coalesce(as.character(typ), # This grabs the first non-NA
as.character(grd),
as.character(cls))) -> df_summed
df_summed %>%
ggplot(aes(level, grp_ttl)) +
geom_col(color = "white") +
geom_text(aes(label = paste0(label, "\n", grp_ttl, "/", ttl)),
color = "white",
position = position_stack(vjust = 0.5)) +
scale_x_reverse() + # To make level 1 at the top
coord_flip() # To switch from vertical to horizontal orientation

Resources