This question already has answers here:
Labeling Outliers of Boxplots in R
(6 answers)
Closed 1 year ago.
I have a data frame below where I want to highlight for each day which employees were outliers in terms of time spent.
Emp_ID 3 is consistently an outlier on 1st , 2nd and 3rd of January amongst all employees. In my actual dataset there are thousands of employees altogether.
How to show them visually in terms of some plot?
df <- data.frame(date = as.Date(c("2020-01-01","2020-01-01","2020-01-01","2020-01-01",
"2020-01-02","2020-01-02","2020-01-02","2020-01-02",
"2020-01-03","2020-01-03","2020-01-03","2020-01-03")),
Emp_Id = c(1,2,3,4,1,2,3,4,1,2,3,4),
time = c(5,2,80,3,3,1,90,80,5,6,75,7))
date Emp_Id time
2020-01-01 1 5
2020-01-01 2 2
2020-01-01 3 80
2020-01-01 4 3
2020-01-02 1 3
2020-01-02 2 1
2020-01-02 3 90
2020-01-02 4 80
2020-01-03 1 5
2020-01-03 2 6
2020-01-03 3 75
2020-01-03 4 7
This answer will depend on your chosen metric, and how you want to define it.
Here is an example that will show you employees who use more than twice the mean time. You can build on this to add various degrees of metrics, e.g. more than the mean time, more than twice the mean time, etc. The important thing is to choose a meaningful metric.
In the example, only outliers are labeled, and a horizontal line is shown as to where the limit is to satisfy the condition for outlier.
# Example data from question
df <- data.frame(date = as.Date(c("2020-01-01","2020-01-01","2020-01-01","2020-01-01",
"2020-01-02","2020-01-02","2020-01-02","2020-01-02",
"2020-01-03","2020-01-03","2020-01-03","2020-01-03")),
Emp_Id = c(1,2,3,4,1,2,3,4,1,2,3,4),
time = c(5,2,80,3,3,1,90,80,5,6,75,7))
library(dplyr)
library(ggplot2)
# Create our data with chosen metric for outlier
emp_data = df %>%
mutate(date = as.factor(date)) %>%
group_by(date) %>%
mutate(metric = mean(time) * 2) %>%
mutate(outlier = ifelse(time > metric, TRUE, FALSE))
# Visualize it
ggplot(data = emp_data, aes(x = as.factor(date), y = time, label = Emp_Id, col = outlier, group = date)) +
geom_point() +
geom_text(data = filter(emp_data, outlier == TRUE), aes(label=Emp_Id),hjust=2, vjust=0) +
facet_wrap(~date, scales = "free") +
geom_hline(aes(yintercept = metric)) +
labs(x = "Date", y = "Time", col = "Outlier") +
theme_classic()
Created on 2021-04-09 by the reprex package (v0.3.0)
Related
I have two separate data frames - each representing a feature (activity, and sleep) and the amount of days that each of these features were recorded by each id number. The amount of days need to reflect on the y-axis and the feature itself needs to reflect on the x-axis.
I managed to draw the boxplots separately, showing the outliers clearly esp for the one set, however if I want to place the two boxplots next to each other, the outliers do not show up clearly. Also, how do I get the names of the two features (activity and sleep) on my x-axis?
The dataframe for the "sleep "feature:
head(idday)
A tibble: 6 x 2
id days
<dbl> <int>
1 1503960366 25
2 1644430081 4
3 1844505072 3
4 1927972279 5
5 2026352035 28
6 2320127002 1
The dataframe for the "activity "feature:
head(iddaya)
A tibble: 6 x 2
id days
<dbl> <int>
1 1503960366 31
2 1624580081 31
3 1644430081 30
4 1844505072 31
5 1927972279 31
6 2022484408 31
My attempt for sleep:
ggplot(idday, aes(y = days), boxwex = 0.05) +
stat_boxplot(geom = "errorbar",
width = 0.2) +
geom_boxplot(alpha=0.9, outlier.color="red")
and for activity:
ggplot(iddaya, aes(y = days), boxwex = 0.05) +
stat_boxplot(geom = "errorbar",
width = 0.2) +
geom_boxplot(alpha=0.9, outlier.color="red")
I then combined them:
boxplot(summary(idday$days), summary(iddaya$days))
In this final image the outliers do not show clearly, and I want to name my x-axis and y-axis.
There are several ways to achieve your task. One way could be:
If your dataframes are coalled df_sleep and df_activity then we could combine them in a named list and add a new column feature, then plot:
df_sleep
df_activity
library(tidyverse)
bind_rows(list(sleep = df_sleep, activity = df_activity), .id = 'feature') %>%
ggplot(aes(x = feature, y=days, fill=feature))+
geom_boxplot()
If you want to compare these two boxplots with each other I recommend to use the same range for your y-axis. To achieve this you first have to combine both data frames. You can do this with inner_join() from the dplyr package.
data_combined <- inner_join(idday, iddaya,
by = "id",
suffix = c("_sleep", "_activity"))
Then you need to transform your data frame into long-format with pivot_longer() from the tidyr package:
data_combined_long <- data_combined %>%
pivot_longer(days_sleep:days_activity,
names_to = "features",
names_prefix = "days_",
values_to = "days")
After that you can again use ggplot() to create your boxplot. But now you have to define that you want your x-axis to represent your features:
ggplot(data_combined_long, aes(y = days, x = features), boxwex = 0.05)+
stat_boxplot(geom = "errorbar",
width = 0.5) +
geom_boxplot(alpha=0.9, outlier.color="red")
Your plot should then look like this:
I have a dataset about accidents in the UK. Among other variables it contains the month of the accident and the severity (ranging from 1 to 3). Thus, you can imagine the dataset like this:
ID
Month
Accident_Severity
1
01
3
2
01
2
3
04
1
4
07
2
I would like to produce a bar chart with the months on the x-axis and the relative share of accidents out of the given severity class that happend in this month on the y-axis. This means each month should have three bars, let's say red, blue and green. Summing the relative share indicated by all bars of one color should equal to 100% for each color. I.e. if blue means Accident_Severity = 2 and the blue bar indicates 10% for January, this would mean 10% of all accidents with severity of 2 happend in january.
I managed to get these numbers as a table doing the following:
pivot_rel <- df %>%
select(month, Accident_Severity) %>%
group_by(month) %>%
table()
for (i in c(1,2,3)) {
for (j in seq(1,12)) {
pivot_rel[j,i] <- round(pivot_rel[j,i]/sum_severity[i],3)
}
}
pivot_rel
pivot_rel
However, i cannot use the object with ggplot. When trying I receive the error: "Fehler: data must be a data frame, or other object coercible by fortify(), not an S3 object with class table"
How do I visualize this table or is there an easier way to do what I try to achieve? Many Thanks!
Use xtabs to table the data and colSums to get the proportions. Then, with packages ggplot2 and scales, plot the graph.
library(ggplot2)
library(scales)
tbl <- xtabs( ~ Month + Accident_Severity, df1)
t(tbl)/colSums(tbl)
# Month
#Accident_Severity 1 4 7
# 1 0.0 1.0 0.0
# 2 0.5 0.0 0.5
# 3 1.0 0.0 0.0
as.data.frame(t(tbl)/colSums(tbl)) |>
ggplot(aes(factor(Month), Freq, fill = factor(Accident_Severity))) +
geom_col(position = position_dodge()) +
scale_fill_manual(values = c("red", "green", "blue")) +
scale_y_continuous(labels = percent_format()) +
xlab("Month") +
guides(fill = guide_legend(title = "Accident Severity"))
Data
df1 <- read.table(text = "
ID Month Accident_Severity
1 01 3
2 01 2
3 04 1
4 07 2
", header = TRUE)
A simple fix would be to change table to dataframe which can be used with ggplot.
pivot_rel <- as.data.frame.matrix(pivot_rel)
However, you might also go a step back and use count instead of table to generate the frequency counts of month and Accident_Severity.
library(dplyr)
pivot_rel <- df %>% count(month, Accident_Severity)
Using proportions on xtabs and base barplot.
proportions(xtabs( ~ Month + Accident_Severity, d), margin=2) |>
as.data.frame() |>
with(barplot(Freq ~ Accident_Severity + Month, beside=T, col=2:4,
main='Relative Frequencies',
legend.text=sort(unique(d$Accident_Severity)),
args.legend=list(title='Accident_Severity')))
Data:
d <- read.table(header=T, text='
ID Month Accident_Severity
1 01 3
2 01 2
3 04 1
4 07 2')
I am working with longitudinal data and assess the utilization of a policy over 13 months. In oder to get some barplots with the different months on my x-axis, I converted my data from wide Format to Long Format.
So now, my dataset looks like this
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
I thought, after reshaping I could easily use my newly created "month" variable as a factor and plot some graphs. However, it does not work out and tells me it's a list or an atomic vector. Transforming it into a factor did not work out - I would desperately Need it as a factor.
Does anybody know how to turn it into a factor?
Thank you very much for your help!
EDIT.
The OP's graph code was posted in a comment. Here it is.
library(ggplot2)
ggplot(data, aes(x = hours, y = month)) + geom_density() + labs(title = 'Distribution of hours')
# Loading ggplot2
library(ggplot2)
# Placing example in dataframe
data <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
# Converting month to factor
data$month <- factor(data$month, levels = 1:12, labels = 1:12)
# Plotting grouping by id
ggplot(data, aes(x = month, y = hours, group = id, color = factor(id))) + geom_line()
# Plotting hour density by month
ggplot(data, aes(hours, color = month)) + geom_density()
The problem seems to be in the aes. geom_density only needs a x value, if you think about it a little, y doesn't make sense. You want the density of the x values, so on the vertical axis the values will be the values of that density, not some other values present in the dataset.
First, read in the data.
Indirekte_long <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
Now graph it.
library(ggplot2)
g <- ggplot(Indirekte_long, aes(hours))
g + geom_density() + labs(title = 'Distribution of hours')
I have data and I have been able to put it into a ggplot graph (time series data). The data is over 12 years and there are specific spikes in the data for certain periods (the data is in weeks). I would like to try and color code one particular week of each year where the spikes begin but do not know where to begin.
The idea that I have is that the spike occurs in January when the superbowl happens!, that would be the week column 2001-01-01 - 2001-31-01 Is it possible to subset a period using ggplot and color code the graph accordingly. So for the superbowl week use a different color?
i.e. each year 2001 - 2012 color code Jan (01-01) to (01-31) red for example. That is 4 weeks of data. What I currently have is;
df[, .(df_sales = (sum(qty) * (EUR))), by = week] %>%
ggplot(aes(x = week, y = df_sales)) +
labs(x = 'wks', title = 'TS plot of qty x eur')
Which gives me a nice plot but I would like to color code the spikes (i.e. my hypothsis that they occur in January, week of the superbowl). I can post the graph for clarification if necessary.
ID unit qty NA EUR KEY identity week
1: 1123539 1147 1 GR 2.39 652159 10090100003 2001-08-20
2: 3102228 1129 1 GR 2.15 257871 10090100003 2001-04-16
3: 3321265 1129 1 GR 2.15 257871 10090100003 2001-04-16
4: 3321265 1122 1 GR 2.15 257871 10090100004 2001-02-26
5: 1120774 1151 1 GR 2.39 213290 10090100005 2001-09-17
6: 1145763 1157 1 GR 2.39 213290 10090100005 2001-10-29
EDIT: I attach the graph for clarification
EDIT2: I attach the new graph
You can just use a second geom in conjunction with subset like this :
library(lubridate)
ggplot(df, aes(x = week, y = df_sales)) +
geom_bar(stat = "identity") +
geom_bar(data = subset(df, month(week) == "1"), stat = "identity", col = "red") +
labs(x = 'wks', title = 'TS plot of qty x eur')
Here we use lubridate::month to check which row belongs to a week in January.
For some fictional random data:
I have been given a challenging problem and was hoping for some recommendations.
I have activity data that I would like to display graphically and am looking for a package or program that could be used to solve my problem (preferably R).
The data is count of movements (Activity) collected hourly (Time of day) for 3 weeks (Calendar Date) or more with associated variables (Food/Vegetation).
Typically, as Ive been told the data can be processed and graphed in a program called Clocklab that is a Matlab product. However, the added complication is the desire to plot this data according to a classification of feeding groups. I was trying to find an equitable program/package in R for this but have come up short.
What the data looks like is simply:
Activity time of day Food type Calendar Date
0 01:00 B 03/24/2007
13 02:00 --- 03/24/2007
0 03:00 B 03/24/2007
0 04:00 B 03/24/2007
: : : :
1246 18:00 C 03/24/2007
3423 19:00 C 03/24/2007
: : : :
0 00:00 --- 03/25/2007
This data is circadian, circular, activity budgeting and I would like to have a graph that may be 3-D in nature that will show the diet selection and how much activity is associated with that diet plotted over time for multiple days/weeks. I would do this by individual and then at a population level. Ive a link to the program and example plot of what is typically produced by the program Clocklab.
Absent real data, this is the best I can come up with. No special packages required, just ggplot2 and plyr:
#Some imagined data
dat <- data.frame(time = factor(rep(0:23,times = 20)),
count = sample(200,size = 480,replace = TRUE),
grp = sample(LETTERS[1:3],480,replace = TRUE))
head(dat)
time count grp
1 0 79 A
2 1 19 A
3 2 9 C
4 3 11 A
5 4 123 B
6 5 37 A
dat1 <- ddply(dat,.(time,grp),summarise,tot = sum(count))
> head(dat1)
time grp tot
1 0 A 693
2 0 B 670
3 0 C 461
4 1 A 601
5 1 B 890
6 1 C 580
ggplot(data = dat1,aes(x = time,y = tot,fill = grp)) +
geom_bar(stat = "identity",position = "stack") +
coord_polar()
I just coded the hours of the day as integers 0-23, and simply grabbed some random values for Activity counts. But this seems like it's generally what you're after.
Edit
A few more options based on comments:
#Force some banding patterns
xx <- sample(10,9,replace = TRUE)
dat <- data.frame(time = factor(rep(0:23,times = 20)),
day = factor(rep(1:20,each = 24),levels = 20:1),
count = rep(c(xx,rep(0,4)),length.out = 20*24),
grp = sample(LETTERS[1:3],480,replace = TRUE))
Options one using faceting:
ggplot(dat,aes(x = time,y = day)) +
facet_wrap(~grp,nrow = 3) +
geom_tile(aes(alpha = count))
Option two using color (i.e. fill):
ggplot(dat,aes(x = time,y = day)) +
geom_tile(aes(alpha = count,fill = grp))