R: identify outliers and mark them in a boxplot - r

I have the following fake data representig the answering times (in seconds) of different users in an online questionnaire:
n <- 1000
dat <- data.frame(user = 1:n,
question = sample(paste("q", 1:10, sep = ""), size = 10, replace = TRUE),
time = round(rnorm(n, mean = 10, sd=4), 0)
)
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
ggtitle("Answering time per question")
Then, I am plotting the answering times as boxplots for each question. But how can I first calculate a column with a binary variable showing whether a case is an outlier or not [defined as median(time) +/- 3 * mad(time) ] within each question?

library(dplyr)
dat %>%
group_by(question) %>%
mutate(outlier = abs(time - median(time)) > 3*mad(time) ) %>%
ungroup() %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
geom_point(data = . %>% filter(outlier), color = "red") +
ggtitle("Answering time per question")
By first grouping within each question, the calculation is applied for each row compared to the median and mad for that question.

Related

How to add number of valid observations of each group at each timepoint to my linechart

I have the evolution of the mean values for two groups. But as the number of valid observations changes at each timepoint, I want to add to the graph at each timepoint the number of valid values for each group. The aim is to make the reader see that the mean over time are not calculated on the same number of individuals
mydata<-data.frame(
ID=1:10,
groupe=c(rep("A",5),rep("B",5)),
value1=c(50,49,47,46,44,39,37,36,30,30),
value2=c(43,40,42,36,25,37,36,35,30,28),
value3=c(32,30,38,32,NA,34,36,32,27,NA),
value4=c(24,25,30,NA,NA,30,32,28,NA,28),
value5=c(24,22,NA,NA,NA,25,27,NA,NA,NA)
)
library(dplyr)
mydata2<-mydata %>%
group_by(groupe) %>%
summarise(mean_value1 = mean(value1),
mean_value2 = mean(value2),
mean_value3 = mean(value3,na.rm=T),
mean_value4 = mean(value4,na.rm=T),
mean_value5 = mean(value5,na.rm=T)
)
mydata2Lg<-mydata2%>%pivot_longer(
cols = mean_value1 :mean_value5,
names_to = "time",values_to = "mean",
names_prefix = "mean_value"
)
mydata2Lg$groupe<-as.factor(mydata2Lg$groupe)
ggplot(mydata2Lg,aes(x=time, y=mean, group=groupe,color=groupe))+
geom_line(aes(linetype=groupe),size=1)+
geom_point(aes(shape=groupe))
I'm sorry for not giving a clear visual indication of what I want. I hope you understand what I mean.
Below the graphic
A typical way to show the uncertainty caused by different sample size is to use error bars or a ribbon to indicate the standard error. This gives a nice visual intuition of the uncertainty introduced by both the spread of the data and the sample size. However, you can also add labels of counts too. You just need to summarize your data appropriately.
For completeness, here is your data represented with both a standard error ribbon and labels of the number of samples at each time point:
library(tidyverse)
mydata %>%
pivot_longer(value1:value5) %>%
group_by(groupe, name) %>%
summarize(count = sum(!is.na(value)),
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE)) %>%
mutate(time = as.numeric(gsub("\\D", "", name)),
upper = mean + sd/sqrt(count),
lower = mean - sd/sqrt(count)) %>%
ggplot(aes(time, mean, color = groupe)) +
geom_ribbon(aes(ymin = lower, ymax = upper, fill = groupe),
color = NA, alpha = 0.2) +
geom_point() +
geom_line() +
geom_label(aes(label = paste0("n = ", count),
y = mean + ifelse(groupe == "A", 1,-1)),
key_glyph = draw_key_blank) +
scale_color_manual(values = c("orangered3", "deepskyblue4")) +
scale_fill_manual(values = c("orangered3", "deepskyblue4")) +
labs(title = 'Mean values for each group over time \u00B1 standard error',
subtitle = expression(italic("Labels show sample size at each point"))) +
theme_light(base_size = 16)

How to plot line graph of normalized differences from binned data with ggplot?

I have several sets of data that I calculate binned normalized differences for. The results I want to plot within a single line plot using ggplot. The lines representing different combinations of the paired differences are supposed to be distinguished by colors and line types.
I am stuck on taking the computed values from the bins (would be y-axis values now), and plotting these onto an x-axis.
Below is the code I use for importing the data and calculating the normalized differences.
# Read data from column 3 as data table for different number of rows
# you could use replicate here for test
# dat1 <- data.frame(replicate(1,sample(25:50,10000,rep=TRUE)))
# dat2 <- data.frame(replicate(1,sample(25:50,9500,rep=TRUE)))
dat1 <- fread("/dir01/a/dat01.txt", header = FALSE, data.table=FALSE, select=c(3))
dat2 <- fread("/dir02/c/dat02.txt", header = FALSE, data.table=FALSE, select=c(3))
# Change column names
colnames(dat1) <- c("Dat1")
colnames(dat2) <- c("Dat2")
# Perhaps there is a better way to compute the following as all-in-one? I have broken these down step by step.
# 1) Sum for each bin
bin1 = cut(dat1$Dat1, breaks = seq(25, 50, by = 2))
sum1 = tapply(dat1$Dat1, bin1, sum)
bin2 = cut(dat2$Dat2, breaks = seq(25, 50, by = 2))
sum2 = tapply(dat2$Dat2, bin2, sum)
# 2) Total sum of all bins
sumt1 = sum(sum1)
sumt2 = sum(sum2)
# 3) Divide each bin by total sum of all bins
sumn1 = lapply(sum1, `/`, sumt1)
sumn2 = lapply(sum2, `/`, sumt2)
# 4) Convert to data frame as I'm not sure how to difference otherwise
df_sumn1 = data.frame(sumn1)
df_sumn2 = data.frame(sumn2)
# 5) Difference between the two as percentage
dbin = (df_sumn1 - df_sumn2)*100
How can I plot those results using ggplot() and geom_line()?
I want
dbin values on the x-axis ranging from 25-50
different colors and line types for the lines
Here is what I tried:
p1 <- ggplot(dbin, aes(x = ?, color=Data, linetype=Data)) +
geom_line() +
scale_linetype_manual(values=c("solid")) +
scale_x_continuous(limits = c(25, 50)) +
scale_color_manual(values = c("#000000"))
dput(dbin) outputs:
structure(list(X.25.27. = -0.0729132928804117, X.27.29. = -0.119044772581772,
X.29.31. = 0.316016473225017, X.31.33. = -0.292812782147632,
X.33.35. = 0.0776336591308158, X.35.37. = 0.0205584754637611,
X.37.39. = -0.300768421159599, X.39.41. = -0.403235174844081,
X.41.43. = 0.392510458816457, X.43.45. = 0.686758883448307,
X.45.47. = -0.25387105113263, X.47.49. = -0.0508324553382303), class = "data.frame", row.names = c(NA,
-1L))
Edit
The final piece of code that works, using only the dbin and plots multiple dbins:
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinA = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
diff_data2 <-
full_join(
calc_bin_props(data = dat3),
calc_bin_props(dat4),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinB = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]
# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
geom_line() +
scale_linetype_manual(values=c("solid", "longdash")) +
scale_color_manual(values = c("black", "black"))
dev.off()
library(tidyverse)
Creating example data as shown in question, but adding different probabilities to the two sample() calls, to create so visible difference
between the two sets of randomized data.
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()
Using dplyr we can handle this within data.frames (tibbles) without
the need to switch to other datatypes.
Let’s define a function that can be applied to both datasets to get
the preprocessing done.
We use base::cut() to create
a new column that pairs each value with its bin. We then group the data
by bin, calculate the sum for each bin and finally divide the bin sums
by the total sum.
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
Now we call calc_bin_props() on both datasets and join them by bin.
This gives us a dataframe with the columns bin, sum.x and sum.y.
The latter two are correspond to the bin sums derived from dat1 and
dat2. With the mutate() line we calculate the differences between the
two columns.
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
mutate(dbin = (sum.x - sum.y),
bin = as.numeric(as.character(bin))) %>%
select(-starts_with("trsh"))
Before we feed the data into ggplot() we convert it to the long
format using pivot_longer() this allows us to instruct ggplot() to
plot the results for sum.x, sum.y and dbin as separate lines.
diff_data %>%
pivot_longer(-bin) %>%
ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
geom_line() +
scale_linetype_manual(values=c("longdash", "solid", "solid")) +
scale_color_manual(values = c("black", "purple", "green"))

Overlaying boxplot with a lineplot

I have some fake data representing the answering times of different users answering an online survey.
The dataset has three variables: the id of the respondent (user), the name of the question (question) and the answering time for each question (time).
n <- 1000
dat <- data.frame(user = 1:n,
question = sample(paste("q", 1:4, sep = ""), size = n, replace = TRUE),
time = round(rnorm(n, mean = 10, sd=4), 0)
)
pltSingleRespondent <- function(df, highlightUsers){
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') + coord_flip() +
ggtitle("Answering time per question")
}
pltSingleRespondent(dat, c(1, 31) )
I was creating a function that plots a boxplot with the answering times for each question. However, now I'd like to overlay that plot with the answering times of specific respondents (highlightUsers). The following image shows an example:
Can someone please explain me how to do this?
I think the most direct way to do this is to subset your data within a call to geom_line.
I'll start with a different set of random data, since the sample data in the question does not include all questions for a user.
set.seed(2021)
dat <- expand.grid(user = factor(1:50), question = paste0("q", 1:4))
dat$time <- round(rnorm(200, mean = 10, sd = 4), 0)
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') + coord_flip() +
ggtitle("Answering time per question") +
geom_line(aes(color = user, group = user), size = 2,
data = ~ subset(., user %in% c(1L, 34L)))
You can functionize it however you want. If you're using dplyr, you can use dplyr::filter instead of subset with no other change.
Also, I chose to factor(user), since otherwise ggplot2 tends to think its data is continuous (for color=user). You can choose to use or not use this, though you may need more wrangling to get it to be discrete.
Slightly different approach. Add a column to the data that indicates the highlighted users and map that variable to geom_line. Use scale_color_discrete(na.translate = FALSE) to color only the non-NA values.
library(dplyr)
library(ggplot2)
pltSingleRespondent <- function(df, highlightUsers) {
df %>%
mutate(User = factor(ifelse(user %in% highlightUsers, user, NA))) %>%
ggplot(aes(question, time)) +
geom_boxplot(fill = "orange") +
geom_line(aes(color = User, group = User)) +
ggtitle("Answering time per question") +
scale_color_discrete(na.translate = FALSE) +
coord_flip() +
theme_bw()
}
Using the example data from #r2evans
pltSingleRespondent(dat, c(1, 34))

How do I get a single percentage/proportion plot using ggplot for separate groups?

df <- data.frame(k = sample(1:3, 100, replace = TRUE),
g = sample(1:2, 100, replace = TRUE, prob = c(0.3, 0.7)))
In this data frame I have two groups g which members are in one of three conditions k.
Now, I want to see the proportions of the conditions k in both groups.
ggplot(df, aes(x = k, fill = as.factor(g), y = (..count..)/sum(..count..))) +
geom_bar(position=position_dodge())
That looks nice at first but there is a problem. The group 2 is larger than group 1. Therefore the proportions are not right: It looks as if all conditions were more likely in group 2 than in group 1. I need to calculate the y = (..count..)/sum(..count..) for both groups separately. How do I do this?
Here's how you can do it:
library(tidyverse)
df %>%
group_by(g) %>%
count(k) %>%
mutate(share = n / sum(n)) %>%
ggplot(aes(x = k, fill = as.factor(g), y = share)) +
geom_col(position = position_dodge())

Having trouble plotting means of subgroups of data

I have data that describes the a series of observations (sound level) grouped by date and hour. I want to plot the mean sound level per hour for each day with sound level on the Y axis and hour on the X axis and a line graph for each day. Example data:
Hour Date SPL
1 18-May 107.9868
2 18-May 106.5656
1 19-May 107.4321
2 19-May 107.8993
I have played around with the group_by function but I'm not sure out to do any better than this:
spl_mean <- group_by(sound, Hour) %>%
summarize(count = n(), Mean = mean(SPL, na.rm = T))
ggplot(data=spl_mean) + geom_line(aes(x = Hour, y = Mean, group = 1), size = 2)
Which obviously just gives mean for SPL by hour but doesn't preserve the days subgroup.
Use library dplyr for calculate mean per hour and day, and then library ggplot2 to plot your result.
df %>%
group_by(Date, Hour) %>%
summarise(SPL_mean = mean(SPL, na.rm = T) %>%
arrange(Date, Hour) %>%
ggplot(aes(x = Hour, y = SPL_mean, color = Date) + geom_line()
Using ggplot and where D is day, H is hour and V is volume.
# setup for demo
library('tidyverse')
df <- tibble(
'D' = c(1:5,1:5),
'H' = rep(c(1,2), each = 5),
'V' = rnorm(10, 100, 5))
# Figure
ggplot(data = df) +
geom_line(mapping = aes(x = H, y = V, group = D, color = D))
This is telling it to do hour on the x, volume on the y and plot different days individually.
If you need to get per day and hour means first then group_by day and hour, then summarise:
df %>%
group_by(D, H) %>%
summarize(MV = mean(V)) %>%
ggplot() +
geom_line(mapping = aes(x = H, y = MV, group = D, color = D))
Then go make it pretty with labs, theme, scales etc.

Resources