Highlight positions without data in facet_wrap ggplot - r

When facetting barplots in ggplot the x-axis includes all factor levels. However, not all levels may be present in each group. In addition, zero values may be present, so from the barplot alone it is not possible to distinguish between x-axis values with no data and those with zero y-values. Consider the following example:
library(tidyverse)
set.seed(43)
site <- c("A","B","C","D","E") %>% sample(20, replace=T) %>% sort()
year <- c("2010","2011","2012","2013","2014","2010","2011","2012","2013","2014","2010","2012","2013","2014","2010","2011","2012","2014","2012","2014")
isZero = rbinom(n = 20, size = 1, prob = 0.40)
value <- ifelse(isZero==1, 0, rnorm(20,10,3)) %>% round(0)
df <- data.frame(site,year,value)
ggplot(df, aes(x=year, y=value)) +
geom_bar(stat="identity") +
facet_wrap(~site)
This is fish census data, where not all sites were fished in all years, but some times no fish were caught. Hence the need to differentiate between the two situations. For example, there was no catch at site C in 2010 and it was not fished in 2011, and the reader cannot tell the difference. I would like to add something like "no data" to the plot for 2011. Maybe it is possible to fill the rows where data is missing, generate another column with the desired text to be added and then include this via geom_text?

So here is an example of your proposed method:
# Tabulate sites vs year, take zero entries
tab <- table(df$site, df$year)
idx <- which(tab == 0, arr.ind = T)
# Build new data.frame
missing <- data.frame(site = rownames(tab)[idx[, "row"]],
year = colnames(tab)[idx[, "col"]],
value = 1,
label = "N.D.") # For 'no data'
ggplot(df, aes(year, value)) +
geom_col() +
geom_text(data = missing, aes(label = label)) +
facet_wrap(~site)
Alternatively, you could also let the facets omit unused x-axis values:
ggplot(df, aes(x=year, y=value)) +
geom_bar(stat="identity") +
facet_wrap(~site, scales = "free_x")

Related

Add labels for selected observations in ggplot2 histogram at the same height as the bins

I'd like to add an "id" annotation to certain observations in a histogram.
So far, I'm able to add the annotation with no problem, but I'd like the 'y' position of my annotations to be the count of the bin + 1 (for aesthetic reasons).
This is what I have so far:
library(tidyverse)
library(ggrepel)
selected_obs <- c("S10", "S100", "S245", "S900")
set.seed(0)
values <- rnorm(1000)
plot_df <- tibble(id = paste0("S", 1:1000),
values = values) %>%
mutate(obs_labels = ifelse(id %in% selected_obs, id, NA))
ggplot(plot_df, aes(values)) +
geom_histogram(binwidth = 0.3, color = "white") +
geom_label_repel(aes(label = obs_labels, y = 100))
I've seen multiple answers dealing with annotating the count for each bin using geom_text(stat = count", aes(y=..count.., label=..count..).
Based on that, I've tried these two work-arounds, but no success:
geom_label_repel(stat = "count", aes(label = obs_labels, y = ..count..)) yields:
"Error: geom_label_repel requires the following missing aesthetics: label"
geom_label_repel(aes(label = obs_labels, y = ..count..)) yields "Error: Aesthetics must be valid computed stats. Problematic aesthetic(s): y = ..count...
Did you map your stat in the wrong layer?".
Anybody that can shed some light here?
That may be a mildly misleading visualisation, because you are labelling a unique ID, but with the positioning of this label to the count height you are suggesting that this ID was counted that often. Anyways.
The most straight forward option is to manually calculate the bin to which your ID belongs, then count this bin, and then use this data in order to set the x and y for your labels.
Unfortunately, I have to use R online and cannot create a nice reprex, therefore including a screenshot. But the code should be reproducible, as it is running online
library(tidyverse)
library(ggrepel)
selected_obs <- c("S10", "S100", "S245", "S900")
set.seed(0)
values <- rnorm(1000)
plot_df <- tibble(id = paste0("S", 1:1000),
values = values) %>%
mutate(obs_labels = ifelse(id %in% selected_obs, id, NA),
bins = as.factor( as.numeric( cut(values, 30)))) # cutting into 30 bins
label_df<- plot_df %>% filter(id %in% selected_obs) %>% left_join(plot_df, by = 'bins') %>%
group_by(values = values.x, obs_labels = obs_labels.x) %>% count
ggplot(plot_df, aes(values)) +
geom_histogram(color = "white") + # removed your bin argument, as to default to 30
geom_label(data = label_df, aes(label = obs_labels, y = n))
The label positions are not quite perfect - this is because I chose to cut into 30 equal bins and the binning may be slightly different between cut and histogram. This may need some tweaking, depending on the size of your bins, and if you include upper/lower margins.
P.S. Credit to cut into equal bins goes to this answer by user pedrosaurio

Avoid repetitive, similar analysis and plots

I have a table with many variables. One of the variables contains year information: from 1999 till 2010.
I need to do for each year the same analysis, for instance, to plot a graph, a histogram, etc.
Currently, I subset the data so that each year goes into a data frame(table) and I do the analysis in turn for each year. This is very inefficient:
dates <- (sample(seq(as.Date('1999/01/01'), as.Date('2010/01/01'), by="day"), 50, replace = TRUE))
dt<-data.table( YEAR = format.Date(dates,"%Y"),
Var1=sample(0:100, 50, rep=TRUE),
Var2 =sample(0:500, 50, rep=TRUE)
)
year_1999<-dt[YEAR=="1999"]
plot_1999<- ggplot(year_1999, aes (x=Var1))+
geom_line(aes(y=Var2), size=1, color="blue") +
labs(y="V2", x="V1", title="Year 1999")
plot_1999
How can I better write this in a compact way? I suppose I need a function but I have no idea how to.
Instead of repeating the code several times, we can specify the 'YEAR' in facet_wrap
library(ggplot2)
ggplot(dt, aes(x = Var1, y = Var2)) +
geom_line(aes(size = 1, color = "blue")) +
labs(y = "V2", x = "V1") +
facet_wrap(~ YEAR)
Try this if you want to create a separate plot object for each unique year in dt$YEAR:
for (i in unique(dt$YEAR)) {
year <- dt[YEAR==i]
plot <- ggplot(year, aes (x=Var1))+
geom_line(aes(y=Var2), size=1, color="blue") +
labs(y="V2", x="V1", title="Year 1999")
assign(paste("plot", i, sep=""), plot)
}

How to graph "before and after" measures using ggplot with connecting lines and subsets?

I’m totally new to ggplot, relatively fresh with R and want to make a smashing ”before-and-after” scatterplot with connecting lines to illustrate the movement in percentages of different subgroups before and after a special training initiative. I’ve tried some options, but have yet to:
show each individual observation separately (now same values are overlapping)
connect the related before and after measures (x=0 and X=1) with lines to more clearly illustrate the direction of variation
subset the data along class and id using shape and colors
How can I best create a scatter plot using ggplot (or other) fulfilling the above demands?
Main alternative: geom_point()
Here is some sample data and example code using genom_point
x <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) # 0=before, 1=after
y <- c(45,30,10,40,10,NA,30,80,80,NA,95,NA,90,NA,90,70,10,80,98,95) # percentage of ”feelings of peace"
class <- c(0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1) # 0=multiple days 1=one day
id <- c(1,1,2,3,4,4,4,4,5,6,1,1,2,3,4,4,4,4,5,6) # id = per individual
df <- data.frame(x,y,class,id)
ggplot(df, aes(x=x, y=y), fill=id, shape=class) + geom_point()
Alternative: scale_size()
I have explored stat_sum() to summarize the frequencies of overlapping observations, but then not being able to subset using colors and shapes due to overlap.
ggplot(df, aes(x=x, y=y)) +
stat_sum()
Alternative: geom_dotplot()
I have also explored geom_dotplot() to clarify the overlapping observations that arise from using genom_point() as I do in the example below, however I have yet to understand how to combine the before and after measures into the same plot.
df1 <- df[1:10,] # data before
df2 <- df[11:20,] # data after
p1 <- ggplot(df1, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
p2 <- ggplot(df2, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
grid.arrange(p1,p2, nrow=1) # GridExtra package
Or maybe it is better to summarize data by x, id, class as mean/median of y, filter out ids producing NAs (e.g. ids 3 and 6), and connect the points by lines? So in case if you don't really need to show variability for some ids (which could be true if the plot only illustrates tendencies) you can do it this way:
library(ggplot)
library(dplyr)
#library(ggthemes)
df <- df %>%
group_by(x, id, class) %>%
summarize(y = median(y, na.rm = T)) %>%
ungroup() %>%
mutate(
id = factor(id),
x = factor(x, labels = c("before", "after")),
class = factor(class, labels = c("one day", "multiple days")),
) %>%
group_by(id) %>%
mutate(nas = any(is.na(y))) %>%
ungroup() %>%
filter(!nas) %>%
select(-nas)
ggplot(df, aes(x = x, y = y, col = id, group = id)) +
geom_point(aes(shape = class)) +
geom_line(show.legend = F) +
#theme_few() +
#theme(legend.position = "none") +
ylab("Feelings of peace, %") +
xlab("")
Here's one possible solution for you.
First - to get the color and shapes determined by variables, you need to put these into the aes function. I turned several into factors, so the labs function fixes the labels so they don't appear as "factor(x)" but just "x".
To address multiple points, one solution is to use geom_smooth with method = "lm". This plots the regression line, instead of connecting all the dots.
The option se = FALSE prevents confidence intervals from being plotted - I don't think they add a lot to your plot, but play with it.
Connecting the dots is done by geom_line - feel free to try that as well.
Within geom_point, the option position = position_jitter(width = .1) adds random noise to the x-axis so points do not overlap.
ggplot(df, aes(x=factor(x), y=y, color=factor(id), shape=factor(class), group = id)) +
geom_point(position = position_jitter(width = .1)) +
geom_smooth(method = 'lm', se = FALSE) +
labs(
x = "x",
color = "ID",
shape = 'Class'
)

Drawing several "numeric" lines using ggplot

I have a dataset which contains 200 different groups, which can take some a between 0 and 200. I would like to draw a line for every group, so a total of 200 lines and have the legend to be "numeric". I know how to do this with a factor, but cant get it to work. Not the best example:
library(tidyverse)
df <- data.frame(Day = 1:100)
df <- df %>% mutate(A = Day + runif(100,1,400) + rnorm(100,3,400) + 2500,
B = Day + rnorm(100,2,900) + -5000 ,
C = Day + runif(100,1,50) + rnorm(100,1,1000) -500,
D = (A+B+C)/5 - rnorm(100, 3,450) - 2500)
df <- gather(df, "Key", "Value", -Day)
df$Key1 <- apply(df, 1, function(x) which(LETTERS == x[2]))
ggplot(df, aes(Day, Value, col = Key)) + geom_line() # I would to keep 4 lines, but would like have the following legend
ggplot(df, aes(Day, Value, col = Key1)) + geom_line() # Not correct lines
ggplot(df, aes(Day, Value)) + geom_line(aes(col = Key1)) # Not correct lines
Likely a duplicate, but I cant find the answer and guess there is something small that is incorrect.
Is this what you mean? I'm not sure since you say you want 200 lines, but in your code you say you want 4 lines.
ggplot(df, aes(Day, Value, group = Key, col=Key1)) + geom_line()
Using group gives you the different lines, using col gives you the different colours.

superpose densities, non exclusive subsets

I need to have several density functions onto a single plot. Each density corresponds to a subset of my overall dataset. The subsets are defined by the value taken by one of the variables in the dataset.
Concretely, I would like to draw a density function for 1, 3, and 10 years horizons. Of course, the 10 years horizons includes the shorter ones. Likewise, the 3 year horizon density should be constructed taking data from the last year.
The subsets need to correspond to data[period == 1,], data[period <= 3, ], data[period == 10,].
I have managed to do so by adding geom_densitys on top of each other, i.e., by redefining the data each time.
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green")
It works fine but I feel like this is not the right way to do it (and indeed, it makes e.g., the creation of a legend cumbersome).
On the other hand, doing like that :
ggplot(data, aes(x=BEST_CUR_EV_TO_EBITDA, color=period)) +
geom_density(alpha=.2, fill="blue")
won't do because then the periods are taken to be mutually exclusive.
Is there a way to specify aes(color) based on the value taken by period where subsets overlap?
Running code:
library(data.table)
library(lubridate)
library(ggplot2)
YEARS <- 10
today <- Sys.Date()
lastYr <- Sys.Date()-years(1)
last3Yr <- Sys.Date()-years(3)
start.date = Sys.Date()-years(YEARS)
date = seq(start.date, Sys.Date(), by=1)
BEST_CUR_EV_TO_EBITDA <- rnorm(length(date), 3,1)
data <- cbind.data.frame(date, BEST_CUR_EV_TO_EBITDA)
data <- cbind.data.frame(data, period = rep(10, nrow(data)))
subPeriods <- function(aDf, from, to, value){
aDf[aDf$date >= from & aDf$date <= to, "period"] = value
return(aDf)
}
data <- subPeriods(data, last3Yr, today, 3)
data <- subPeriods(data, lastYr, today, 1)
data <- data.table(data)
colScale <- scale_colour_manual(
name = "horizon"
, values = c("1 Y" = "grey", "3 Y" = "red", "10 Y" = "green"))
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green") +
colScale
One of the ways to deal with dependent grouping is to create an independent grouping based on the existing groups. The way I'd opted to do it below is by creating three new columns (period_one, period_three and period_ten) with mutate function, where
period_one= BEST_CUR_EV_TO_EBITDA values for period==1
period_three= BEST_CUR_EV_TO_EBITDA values for period<=1
period_ten= BEST_CUR_EV_TO_EBITDA values for all periods
These columns were then converted into the long-format using gather function, where the columns (period_one, period_three and period_ten) are stacked in "period" variable, and the corresponding values in the column "val".
df2 <- data %>%
mutate(period_one=ifelse(period==1, BEST_CUR_EV_TO_EBITDA, NA),
period_three=ifelse(period<=3, BEST_CUR_EV_TO_EBITDA, NA),
period_ten=BEST_CUR_EV_TO_EBITDA) %>%
select(date, starts_with("period_")) %>%
gather(period, val, period_one, period_three, period_ten)
The ggplot is straightforward with long format consisting of independent grouping:
ggplot(df2, aes(val, fill=period)) + geom_density(alpha=.2)

Resources