I have a dataset about accidents in the UK. Among other variables it contains the month of the accident and the severity (ranging from 1 to 3). Thus, you can imagine the dataset like this:
ID
Month
Accident_Severity
1
01
3
2
01
2
3
04
1
4
07
2
I would like to produce a bar chart with the months on the x-axis and the relative share of accidents out of the given severity class that happend in this month on the y-axis. This means each month should have three bars, let's say red, blue and green. Summing the relative share indicated by all bars of one color should equal to 100% for each color. I.e. if blue means Accident_Severity = 2 and the blue bar indicates 10% for January, this would mean 10% of all accidents with severity of 2 happend in january.
I managed to get these numbers as a table doing the following:
pivot_rel <- df %>%
select(month, Accident_Severity) %>%
group_by(month) %>%
table()
for (i in c(1,2,3)) {
for (j in seq(1,12)) {
pivot_rel[j,i] <- round(pivot_rel[j,i]/sum_severity[i],3)
}
}
pivot_rel
pivot_rel
However, i cannot use the object with ggplot. When trying I receive the error: "Fehler: data must be a data frame, or other object coercible by fortify(), not an S3 object with class table"
How do I visualize this table or is there an easier way to do what I try to achieve? Many Thanks!
Use xtabs to table the data and colSums to get the proportions. Then, with packages ggplot2 and scales, plot the graph.
library(ggplot2)
library(scales)
tbl <- xtabs( ~ Month + Accident_Severity, df1)
t(tbl)/colSums(tbl)
# Month
#Accident_Severity 1 4 7
# 1 0.0 1.0 0.0
# 2 0.5 0.0 0.5
# 3 1.0 0.0 0.0
as.data.frame(t(tbl)/colSums(tbl)) |>
ggplot(aes(factor(Month), Freq, fill = factor(Accident_Severity))) +
geom_col(position = position_dodge()) +
scale_fill_manual(values = c("red", "green", "blue")) +
scale_y_continuous(labels = percent_format()) +
xlab("Month") +
guides(fill = guide_legend(title = "Accident Severity"))
Data
df1 <- read.table(text = "
ID Month Accident_Severity
1 01 3
2 01 2
3 04 1
4 07 2
", header = TRUE)
A simple fix would be to change table to dataframe which can be used with ggplot.
pivot_rel <- as.data.frame.matrix(pivot_rel)
However, you might also go a step back and use count instead of table to generate the frequency counts of month and Accident_Severity.
library(dplyr)
pivot_rel <- df %>% count(month, Accident_Severity)
Using proportions on xtabs and base barplot.
proportions(xtabs( ~ Month + Accident_Severity, d), margin=2) |>
as.data.frame() |>
with(barplot(Freq ~ Accident_Severity + Month, beside=T, col=2:4,
main='Relative Frequencies',
legend.text=sort(unique(d$Accident_Severity)),
args.legend=list(title='Accident_Severity')))
Data:
d <- read.table(header=T, text='
ID Month Accident_Severity
1 01 3
2 01 2
3 04 1
4 07 2')
Related
This question already has answers here:
Labeling Outliers of Boxplots in R
(6 answers)
Closed 1 year ago.
I have a data frame below where I want to highlight for each day which employees were outliers in terms of time spent.
Emp_ID 3 is consistently an outlier on 1st , 2nd and 3rd of January amongst all employees. In my actual dataset there are thousands of employees altogether.
How to show them visually in terms of some plot?
df <- data.frame(date = as.Date(c("2020-01-01","2020-01-01","2020-01-01","2020-01-01",
"2020-01-02","2020-01-02","2020-01-02","2020-01-02",
"2020-01-03","2020-01-03","2020-01-03","2020-01-03")),
Emp_Id = c(1,2,3,4,1,2,3,4,1,2,3,4),
time = c(5,2,80,3,3,1,90,80,5,6,75,7))
date Emp_Id time
2020-01-01 1 5
2020-01-01 2 2
2020-01-01 3 80
2020-01-01 4 3
2020-01-02 1 3
2020-01-02 2 1
2020-01-02 3 90
2020-01-02 4 80
2020-01-03 1 5
2020-01-03 2 6
2020-01-03 3 75
2020-01-03 4 7
This answer will depend on your chosen metric, and how you want to define it.
Here is an example that will show you employees who use more than twice the mean time. You can build on this to add various degrees of metrics, e.g. more than the mean time, more than twice the mean time, etc. The important thing is to choose a meaningful metric.
In the example, only outliers are labeled, and a horizontal line is shown as to where the limit is to satisfy the condition for outlier.
# Example data from question
df <- data.frame(date = as.Date(c("2020-01-01","2020-01-01","2020-01-01","2020-01-01",
"2020-01-02","2020-01-02","2020-01-02","2020-01-02",
"2020-01-03","2020-01-03","2020-01-03","2020-01-03")),
Emp_Id = c(1,2,3,4,1,2,3,4,1,2,3,4),
time = c(5,2,80,3,3,1,90,80,5,6,75,7))
library(dplyr)
library(ggplot2)
# Create our data with chosen metric for outlier
emp_data = df %>%
mutate(date = as.factor(date)) %>%
group_by(date) %>%
mutate(metric = mean(time) * 2) %>%
mutate(outlier = ifelse(time > metric, TRUE, FALSE))
# Visualize it
ggplot(data = emp_data, aes(x = as.factor(date), y = time, label = Emp_Id, col = outlier, group = date)) +
geom_point() +
geom_text(data = filter(emp_data, outlier == TRUE), aes(label=Emp_Id),hjust=2, vjust=0) +
facet_wrap(~date, scales = "free") +
geom_hline(aes(yintercept = metric)) +
labs(x = "Date", y = "Time", col = "Outlier") +
theme_classic()
Created on 2021-04-09 by the reprex package (v0.3.0)
I am a newie here, sorry for not writing the question right :p
1, the aim is to plot a graph about the mean NDVI value during a time period (8 dates were chosen from 2019-05 to 2019-10) of my study site (named RB1). And plot vertical lines to show the date with a grass cutting event.
2, Now I had calculated the NDVI value for these 8 chosen dates and made a CSV file.
(PS. the "cutting" means when the grassland on the study site has been cut, so the corresponding dates should be show as a vertical line, using geom_vline)
infor <- read_csv("plotting information.csv")
infor
# A tibble: 142 x 3
date NDVI cutting
<date> <dbl> <lgl>
1 2019-05-12 NA NA
2 2019-05-13 NA NA
3 2019-05-14 NA NA
4 2019-05-15 NA NA
5 2019-05-16 NA NA
6 2019-05-17 0.787 TRUE
# ... with 132 more rows
3, the problem is, when I do the ggplot, first I want to keep the x-axis as the whole time period (2019-05 to 2019-10) but of course not show all dates in between, otherwise there will be way too much dates show on the x-axis). So, I do the scale_x_discrte(breaks=, labels=) to show the specific dates with NDVI values.
Second I also want to show the dates that the grasses were cut geom_vline.
BUT, it seems like the precondition for scale_x_discrte is to factor my date, while the precondition for geom_vline is to keep the date as nummeric.
these two calls seems to be contradictory.
y1 <- ggplot(infor, aes(factor(date), NDVI, group = 1)) +
geom_point() +
geom_line(data=infor[!is.na(infor$NDVI),]) +
scale_x_discrete(breaks = c("2019-05-17", "2019-06-18", "2019-06-26", "2019-06-28","2019-07-23","2019-07-28", "2019-08-27","2019-08-30", "2019-09-21"),
labels = c("0517","0618","0626","0628","0723","0728", "0827","0830","0921")))
y2 <- ggplot(infor, aes(date, NDVI, group = 1)) +
geom_point() +
geom_line(data=infor[!is.na(infor$NDVI),]))
when I add the geom_vline in the y1, vertical lines do not show on my plot:
y1 + geom_vline
when I add it in the y2, vertical lines were showed, but the dates (x axis) are weird (not show as the y1 because we donot run the scale_x_ here)
y2 + geom_vline
y1 +
geom_vline(data=filter(infor,cutting == "TRUE"), aes(xintercept = as.numeric(date)), color = "red", linetype ="dashed")
Would be appreciated if you can help!
thanks in advance! :D
I agree with the comment about leaving dates as dates. In this case, you can specify the x-intercept of geom_vline as a date.
Given basic data:
df <- tribble(
~Date, ~Volume, ~Cut,
'1-1-2010', 123456, 'FALSE',
'5-1-2010', 789012, 'TRUE',
'9-1-2010', 5858585, 'TRUE',
'12-31-2010', 2543425, 'FALSE'
)
I set the date and then pull the subset for Cut=='TRUE' into a new object:
df <- mutate(df, Date = lubridate::mdy(Date))
d2 <- filter(df, Cut == 'TRUE') %>% pull(Date)
And finally use the object to specify intercepts:
df %>%
ggplot(aes(x = Date, y = Volume)) +
geom_vline(xintercept = d2) +
geom_line()
I am working with longitudinal data and assess the utilization of a policy over 13 months. In oder to get some barplots with the different months on my x-axis, I converted my data from wide Format to Long Format.
So now, my dataset looks like this
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
I thought, after reshaping I could easily use my newly created "month" variable as a factor and plot some graphs. However, it does not work out and tells me it's a list or an atomic vector. Transforming it into a factor did not work out - I would desperately Need it as a factor.
Does anybody know how to turn it into a factor?
Thank you very much for your help!
EDIT.
The OP's graph code was posted in a comment. Here it is.
library(ggplot2)
ggplot(data, aes(x = hours, y = month)) + geom_density() + labs(title = 'Distribution of hours')
# Loading ggplot2
library(ggplot2)
# Placing example in dataframe
data <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
# Converting month to factor
data$month <- factor(data$month, levels = 1:12, labels = 1:12)
# Plotting grouping by id
ggplot(data, aes(x = month, y = hours, group = id, color = factor(id))) + geom_line()
# Plotting hour density by month
ggplot(data, aes(hours, color = month)) + geom_density()
The problem seems to be in the aes. geom_density only needs a x value, if you think about it a little, y doesn't make sense. You want the density of the x values, so on the vertical axis the values will be the values of that density, not some other values present in the dataset.
First, read in the data.
Indirekte_long <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
Now graph it.
library(ggplot2)
g <- ggplot(Indirekte_long, aes(hours))
g + geom_density() + labs(title = 'Distribution of hours')
I have data and I have been able to put it into a ggplot graph (time series data). The data is over 12 years and there are specific spikes in the data for certain periods (the data is in weeks). I would like to try and color code one particular week of each year where the spikes begin but do not know where to begin.
The idea that I have is that the spike occurs in January when the superbowl happens!, that would be the week column 2001-01-01 - 2001-31-01 Is it possible to subset a period using ggplot and color code the graph accordingly. So for the superbowl week use a different color?
i.e. each year 2001 - 2012 color code Jan (01-01) to (01-31) red for example. That is 4 weeks of data. What I currently have is;
df[, .(df_sales = (sum(qty) * (EUR))), by = week] %>%
ggplot(aes(x = week, y = df_sales)) +
labs(x = 'wks', title = 'TS plot of qty x eur')
Which gives me a nice plot but I would like to color code the spikes (i.e. my hypothsis that they occur in January, week of the superbowl). I can post the graph for clarification if necessary.
ID unit qty NA EUR KEY identity week
1: 1123539 1147 1 GR 2.39 652159 10090100003 2001-08-20
2: 3102228 1129 1 GR 2.15 257871 10090100003 2001-04-16
3: 3321265 1129 1 GR 2.15 257871 10090100003 2001-04-16
4: 3321265 1122 1 GR 2.15 257871 10090100004 2001-02-26
5: 1120774 1151 1 GR 2.39 213290 10090100005 2001-09-17
6: 1145763 1157 1 GR 2.39 213290 10090100005 2001-10-29
EDIT: I attach the graph for clarification
EDIT2: I attach the new graph
You can just use a second geom in conjunction with subset like this :
library(lubridate)
ggplot(df, aes(x = week, y = df_sales)) +
geom_bar(stat = "identity") +
geom_bar(data = subset(df, month(week) == "1"), stat = "identity", col = "red") +
labs(x = 'wks', title = 'TS plot of qty x eur')
Here we use lubridate::month to check which row belongs to a week in January.
For some fictional random data:
I have the following:
> ArkHouse2014 <- read.csv(file="C:/Rwork/ar14.csv", header=TRUE, sep=",")
> ArkHouse2014
DISTRICT GOP DEM
1 AR-60 3,951 4,001
2 AR-61 3,899 4,634
3 AR-62 5,130 4,319
4 AR-100 6,550 3,850
5 AR-52 5,425 3,019
6 AR-10 3,638 5,009
7 AR-32 6,980 5,349
What I would like to do is make a barplot (or series of barplots) to compare the totals in the second and third columns on the y-axis while the x-axis would display the information in the first column.
It seems like this should be very easy to do, but most of the information on making barplots that I can find has you make a table from the data and then barplot that, e.g.,
> table(ArkHouse2014$GOP)
2,936 3,258 3,508 3,573 3,581 3,588 3,638 3,830 3,899 3,951 4,133 4,166 4,319 4,330 4,345 4,391 4,396 4,588
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4,969 5,130 5,177 5,343 5,425 5,466 5,710 5,991 6,070 6,100 6,234 6,490 6,550 6,980 7,847 8,846
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I don't want the counts of how many have each total, I'd like to just represent the quantities visually. I feel pretty stupid not being able to figure this out, so thanks in advance for any advice you have to offer me.
Here's an option using libraries reshape2 and ggplot2:
I first read your data (with dec = ","):
df <- read.table(header=TRUE, text="DISTRICT GOP DEM
1 AR-60 3,951 4,001
2 AR-61 3,899 4,634
3 AR-62 5,130 4,319
4 AR-100 6,550 3,850
5 AR-52 5,425 3,019
6 AR-10 3,638 5,009
7 AR-32 6,980 5,349", dec = ",")
Then reshape it to long format:
library(reshape2)
df_long <- melt(df, id.var = "DISTRICT")
Then create a barplot using ggplot:
library(ggplot2)
ggplot(df_long, aes(x = DISTRICT, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge")
or if you want the bars stacked:
ggplot(df_long, aes(x = DISTRICT, y = value, fill = variable)) +
geom_bar(stat = "identity")