Overlaying boxplot with a lineplot - r

I have some fake data representing the answering times of different users answering an online survey.
The dataset has three variables: the id of the respondent (user), the name of the question (question) and the answering time for each question (time).
n <- 1000
dat <- data.frame(user = 1:n,
question = sample(paste("q", 1:4, sep = ""), size = n, replace = TRUE),
time = round(rnorm(n, mean = 10, sd=4), 0)
)
pltSingleRespondent <- function(df, highlightUsers){
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') + coord_flip() +
ggtitle("Answering time per question")
}
pltSingleRespondent(dat, c(1, 31) )
I was creating a function that plots a boxplot with the answering times for each question. However, now I'd like to overlay that plot with the answering times of specific respondents (highlightUsers). The following image shows an example:
Can someone please explain me how to do this?

I think the most direct way to do this is to subset your data within a call to geom_line.
I'll start with a different set of random data, since the sample data in the question does not include all questions for a user.
set.seed(2021)
dat <- expand.grid(user = factor(1:50), question = paste0("q", 1:4))
dat$time <- round(rnorm(200, mean = 10, sd = 4), 0)
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') + coord_flip() +
ggtitle("Answering time per question") +
geom_line(aes(color = user, group = user), size = 2,
data = ~ subset(., user %in% c(1L, 34L)))
You can functionize it however you want. If you're using dplyr, you can use dplyr::filter instead of subset with no other change.
Also, I chose to factor(user), since otherwise ggplot2 tends to think its data is continuous (for color=user). You can choose to use or not use this, though you may need more wrangling to get it to be discrete.

Slightly different approach. Add a column to the data that indicates the highlighted users and map that variable to geom_line. Use scale_color_discrete(na.translate = FALSE) to color only the non-NA values.
library(dplyr)
library(ggplot2)
pltSingleRespondent <- function(df, highlightUsers) {
df %>%
mutate(User = factor(ifelse(user %in% highlightUsers, user, NA))) %>%
ggplot(aes(question, time)) +
geom_boxplot(fill = "orange") +
geom_line(aes(color = User, group = User)) +
ggtitle("Answering time per question") +
scale_color_discrete(na.translate = FALSE) +
coord_flip() +
theme_bw()
}
Using the example data from #r2evans
pltSingleRespondent(dat, c(1, 34))

Related

How to plot time intervals on a "one-year" scale illustrating days that are within the interval?

Context, question and reproducible example:
Context: I am trying to compare time intervals on a "one-year" scale to visualize the days covered by each interval regardless of the years.
Problem: when geom_segment() is used, it draws a straight line between the start and the end, even when the "start" is after the "end" on the plot. When an interval has days in two different years, it highlights the wrong ones, i.e. the one not covered by the interval (see the 1st plot below).
Question: is it possible to obtain the second plot without manually creating the df_exp object? Any tips and tricks are welcome!
library(lubridate)
library(ggplot2)
# dummy dataset
df <- data.frame(
time_int = c(interval("2010-03-01", "2010-06-15"), interval("2015-10-23", "2016-02-20")),
obs = c("A", "B")
)
# current result
ggplot(data = df) +
geom_segment(aes(x = `year<-`(int_start(time_int), 0000),
xend = `year<-`(int_end(time_int), 0000),
y = obs, yend = obs)) +
labs(x = "Current segment output")
# expected output
df_exp <- data.frame(
time_int = c(interval("0000-03-01", "0000-06-15"),
interval("0000-10-23", "0000-12-31"),
interval("0000-01-01", "0000-02-20")),
obs = c("A", "B", "B")
)
ggplot(data = df_exp) +
geom_segment(aes(x = int_start(time_int),
xend = int_end(time_int),
y = obs, yend = obs)) +
labs(x = "Expected segment output")
Possible start:
Even if the output is very close to the expected one, I feel like creating these vectors of explicit days is not very efficient, maybe the is a smarter way.
# Possible solution: make days explicit
library(dplyr)
library(tibble)
purrr::map(.x = split(df, ~obs), .f = ~ seq(int_start(.x$time_int), int_end(.x$time_int), "1 day")) %>%
enframe() %>%
tidyr::unnest(value) %>%
ggplot(data = .) +
geom_point(aes(x = `year<-`(value, 0000), y = name))
Created on 2022-10-18 with reprex v2.0.2

Is there a way I could plot t = 300, 350, 450, and 500 lines in one graph?

enter image description hereI wanted to plot multiple lines in one graph but I couldn't figure out which code to use. Also, is there a way I could assign colors to each of the lines? Just new to Rstudio and was assigned to pick up someones work so I've been doing a lot of trial and error but I haven't been lucky for the past few days. Hope someone could help me with this! Thank you so much
ecdf.shift <- function(OUR_threshold, des_cap = 40, nint = 10000){
#create some empty vectors for later use in the loop
ecdf_med = c()
ecdf_obs = c()
for (i in 1:length(OUR_threshold)){
# filter out the OUR threshold data, then select only the capture column and create a ecdf function
ecdf_fun <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
ecdf()
# extract the ecdf data and put in tibble dataframe, then create a linear interpolation of the curve.
ecdf_data <- tibble(TSS_con = environment(ecdf_fun)$x, prob = environment(ecdf_fun)$y)
ecdf_interpol <- approx(x = ecdf_data$TSS_con, y = ecdf_data$prob, n = nint)
# find the vector numbers in x which correspond with the desired capture. Then find correlate the vectornumbers with probability numbers in the y vectors. Take the median value in case multiple hits. Put this number in a vector with designed vectornumber as ditacted by the loopnumber i.
ecdf_med[i] <- median(ecdf_interpol$y[(round(ecdf_interpol$x,1) == des_cap)])
# calculate the number of observations when the filtering takes place.
ecdf_obs[i] <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
length()
# Flush the ecdf data. The ecdf is encoded as a function with global paramaters, so you want to reset them everytime the loop is done to avoid pesky bugs to appear.
rm(ecdf_data)
}
#create a tibble dataframe with all the loop data.
ecdf_out <- tibble(OUR_ratio_cutoff = OUR_threshold, prob = (ecdf_med)*100, nobs = ecdf_obs)
return(ecdf_out)
}
ratio_threshold <- seq(0,115, by = 5)
t = ecdf_MLSS_target <- 400 %>%
ecdf.shift(ratio_threshold, .) %>%
filter(nobs > 2) %>%
ggplot(aes( x = OUR_ratio_cutoff, y = prob)) +
geom_line() +
geom_point() +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,300, by = 5),
expand = c(0,0)) +
scale_x_continuous(limits = c(0,120),
breaks = seq(0,110, by = 10),
expand = c(0,0)) +
labs(x = "ESS mg TSS/L",
y = "Probability of contactor MLSS > 400 mg TSS/L ")
plot(t)
Easiest would be to loop over your different t values first and bring the resulting data frames into one big data frame, and use this for your plot. Your code is not fully reproducible (it requires data that we do not have, i.e. HRP_rESS_no). So I have stripped down the function to the core - creating a data frame which makes different "lines" depending on your t value. I just used it as slope.
I hope the idea is clear.
library(tidyverse)
ecdf.shift <- function(OUR_threshold, t) {
data.frame(x = OUR_threshold, y = t * OUR_threshold)
}
ratio_threshold <- seq(0, 115, by = 5)
t_df <-
map(1:5, function(t) ecdf.shift(ratio_threshold, t)) %>%
bind_rows(, .id = "t")
ggplot(t_df, aes(x, y, color = t)) +
geom_line() +
geom_point()
Created on 2020-05-07 by the reprex package (v0.3.0)

How to plot a(n unknown) number of data series as geom_line in same chart

My first Q here, so please go lightly if I'm out of step anywhere.
I'm trying to code R to produce a single chart to contain a number of data series lines. The number of data series may vary but will be provided in the data frame. I have tried to rearrange another thread's content to print the geom_line , but not successfully.
The logic is:
#desire to replace loop of 1:5 with ncol(df)
print(ggplot(df,aes(x=time))
for (i in 1:5) {
print (+ geom_line(aes(y=df[,i]))
}
#functioning geom point loops ggplot production:
for (i in 1:5) {
print(ggplot(df,aes(x=time,y=df[,i]))+geom_point())
}
#functioning multi-line ggplot where n is explicit:
ggplot(data=df, aes(x=time), group=1) +
geom_line(aes(y=df$`3`))+
geom_line(aes(y=df$`4`))
The functioning example code produces n number of point charts, 5 in this case. I would like just one chart to contain n line series.
This may be similar to How to plot n dimensional matrix? for which there are currently no relevant answers
Any contributions much appreciated, thanks
You can use gather from tidyverse "world" to do that.
As you didn't supply a sample data I used mtcars.
I created two data.frames one with 3 columns one with 9. In each one of them I plotted all of the variables against the variable mpg.
library(tidyverse)
df3Columns <- mtcars[, 1:4]
df9Columns <- mtcars[, 1:10]
df3Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
df9Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
Edit - using the sample data in comments.
library(tidyverse)
df %>%
rownames_to_column("time") %>%
gather(var, value, -time) %>%
ggplot(aes(time, value, group = var, color = var)) +
geom_line()
Sample data:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
To strictly answer your question, you can simply store your ggplot in a variable and add the geom_line one by one:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
g <- ggplot(df, aes(x = 1:nrow(df)))
for (i in colnames(df))
{
g <- g + geom_line(y = df[,i])
}
g <- g + scale_y_continuous(limits = c(min(df), max(df)))
print(g)
However, this is not a very convenient solution. I would highly recommend to refactor your data frame to be more ggplot style.
df.ultimate <- data.frame(time = numeric(), value = numeric(), group = character())
for (i in colnames(df))
{
df.ultimate <- rbind(df.ultimate, data.frame(time = 1:nrow(df), value = df[, i], group = i))
}
g <- ggplot(df.ultimate, aes(x = time, y = value, color = group))
g <- g + geom_line()
print(g)
A one-line solution:
ggplot(data.frame(time = rep(1:nrow(df), ncol(df)),
value = as.vector(as.matrix(df)),
group = rep(colnames(df), each = nrow(df))),
aes(x = time, y = value, color = group)) + geom_line()

ggplot2 - Two color series in area chart

I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)

Plotting a continuous of time data in R

I am trying to plot the distribution of turtle nesting activity over a night using ggplot, but I want to exclude the times from 8am - 6pm. Also, I need the x-axis to start at 7pm and end at 7am.
My code is
ggplot(sub.1) + geom_bar(aes(x = sub.1$time)) + scale_x_continuous(expand = c(0, 0), limits = c(0, 23), breaks = seq(0, 23, 1)) + xlab("Hour") + ylab ("Frequency")
Any assistance would really be appreciated.
Try this. I'll post some comments in there when I get a chance. Hope it helps!
# Example Data
SampleHours <- sample(1:23, 3000, replace = TRUE)
# Keep wanted Hours
IncludedHours <- c(19:23, 1:7)
Index <- SampleHours %in% IncludedHours
# Create dataframe
sub.1 <- data.frame(Hour = SampleHours[Index])
# Change the factor levels
sub.1$Hour <- as.factor(sub.1$Hour)
FactorLevels <- c(19:23, 1:7)
levels(sub.1$Hour) <- FactorLevels
# Plot
library(ggplot2)
ggplot(sub.1) +
geom_bar(aes(x = sub.1$Hour)) +
xlab("Hour") +
ylab ("Frequency")
EDIT
Changed a part to identify which hours to keep, not which ones to exclude. I think it makes it easier to follow
My approach used dplyr. First, I generated some fake data:
times = sample(seq(0,23), 10000, replace=T)
nest = sample(c(0,1), 10000, replace=T)
data = data.frame(times, nest)
Then, I used dplyr to pipe the results:
library(dplyr)
data %>% filter(times>18 | times<8) %>%
transform(times=factor(times, levels=c(19:23,0:7))) %>%
ggplot() + geom_bar(aes(x=times)) +
xlab("Hour") +
ylab ("Frequency")
The filter call selects the hours; the transform serves to do the same as #William did to create the order 19-7.

Resources