Pasting the RMSE and MAE next to a forecast in ggplot2 - r

I have some locations with products that I need to forecast. It's around 300 location-product combinations. One plot is always the location and all the products (like in the image). Now I want to print the RMSE and MAE for each product next to the graph in the plot. They are calculated and stored in data_accuracy in their own columns, but I have no Idea how to print them into the plot. Is there a feasible way of doing it?
Ps: I know about the out of bounds error. It just happens with this example data but the results work as normal.
library(tidyverse)
library(tsibble)
library(fable)
library(lubridate)
data_ts <- data.frame(sales_year = c("2015-01-01", "2015-01-01", "2015-01-01", "2016-01-01",
"2016-01-01", "2016-01-01", "2017-01-01", "2017-01-01", "2017-01-01", "2018-01-01", "2018-01-01",
"2018-01-01", "2019-01-01", "2019-01-01", "2019-01-01"), product = c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c"), sales = c(6, 11, 13, 6, 2, 6, 11, 12, 10, 4, 12, 2,
17, 6, 8))
data_ts <- data_ts%>%
mutate(sales_year = year(sales_year))
data_ts <- tsibble(data_ts, index = sales_year, key = product)
data_train <- data_ts %>%
filter(sales_year < "2018-01-01")
data_ses <- data_train %>%
model(ETS(sales ~ error("A") + trend("N") + season("N")))
data_ses_fc <- data_ses %>%
forecast(h = 1)
data_accuracy <- data_ses_fc %>%
accuracy(data_ts)
data_ses_fc %>%
autoplot(data_ts) +
geom_line(aes(y = .fitted), col="red",
data = augment(data_ses))

Something like this?
data_accuracy <- data_ses_fc %>%
accuracy(data_ts) %>%
mutate(res = paste0("RMSE: ",round(RMSE,2),"\n",
"MAE: ",round(MAE,2)))
data_ses_fc %>%
autoplot(data_ts) +
geom_line(aes(y = .fitted), col="red",
data = augment(data_ses)) +
geom_text(data=data_accuracy,aes(x = 2016,y=12,group=product,label=res))

Related

R ggplot legend with Waffle chart

library(tidyverse)
library(waffle)
df_2 <- structure(list(group = c(2, 2, 2, 1, 1, 1),
parts = c("A", "B", "C", "A", "B", "C"),
values = c(1, 39, 60, 14, 15, 71)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
df_2 %>% ggplot(aes(label = parts)) +
geom_pictogram(
n_rows = 10, aes(color = parts, values = values),
family = "fontawesome-webfont",
flip = TRUE
) +
scale_label_pictogram(
name = "Case",
values = c("male"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
scale_color_manual(
name = "Case",
values = c("A" = "red", "B" = "green", "C" = "grey85"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
facet_grid(~group)
With the above code, I got the legend what I expected:
However, when I replaced df_2 with the following df_1 dataframe, I was unable to combine two legends.
df_1 <- structure(list(group = c(2, 2, 2, 1, 1, 1),
parts = c("A", "B", "C", "A", "B", "C"),
values = c(0, 0, 100, 0, 0, 100)),
row.names = c(NA,-6L), class = c("tbl_df", "tbl", "data.frame"))
I kind of know the cause of the problem (0 values) but I would like to keep the legend the same as the graph above. Any suggestions would be appreciated.
To make it clear, the package "waffle" referred to here is not the CRAN package "waffle", but the GitHub-only package:
remotes::install_github("hrbrmstr/waffle")
library(waffle)
You will also need a way of displaying the pictograms, such as:
library(emojifont)
load.fontawesome()
Now, as with any other discrete scale, if you want to add values that are not present in the (post-stat) data, you need to use the limits argument:
df_1 %>% ggplot(aes(label = parts)) +
geom_pictogram(
n_rows = 10, aes(color = parts, values = values),
family = "fontawesome-webfont",
flip = TRUE
) +
scale_label_pictogram(
name = "Case",
values = c("male"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C"),
limits = c("A", "B", "C")
) +
scale_color_manual(
name = "Case",
values = c("A" = "red", "B" = "green", "C" = "grey85"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
facet_grid(~group)
It is a bit tricky, but what you could do is say let's add 1 to all values so it will plot it like before. But using ggplot_build to remove from each case one row to get it in the right amount like this:
library(tidyverse)
library(waffle)
library(ggplot2)
library(dplyr)
library(emojifont)
library(waffle)
library(extrafont)
p <- df_1 %>% ggplot(aes(label = parts)) +
geom_pictogram(
n_rows = 10, aes(color = parts, values = values+1),
family = "fontawesome-webfont",
flip = TRUE
) +
scale_label_pictogram(
name = "Case",
values = c("male"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
scale_color_manual(
name = "Case",
values = c("A" = "red", "B" = "green", "C" = "grey85"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
facet_grid(~group)
q <- ggplot_build(p)
q$data[[1]] <- q$data[[1]] %>%
group_by(PANEL) %>%
slice(4:n())
q <- ggplot_gtable(q)
plot(q)
Created on 2022-10-20 with reprex v2.0.2

Problem with 'mutate()' input 'data' in ANOVA (rstatix)

This is driving me crazy. I am using anova_test from rstatix and it's telling me that my columns aren't there when they clearly are.
This is what my dataframe looks like:
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
df <- data.frame(ID, Form, Pen, Time)
ID, Form, and Pen are factors, while Time is numeric. So each subject completed forms A and B with Red, Blue, and Green pens, and I measured how long each took in completing the form.
This is a fake dataset that I've purposefully come up with to ask this question. In reality, this dataframe is derived from a larger dataset with several more variables. Each variable has a lot more observations (so not just one datapoint for subject 1 & Form A & Red Pen, as in this example, but multiple), so I've summarized them to get mean Time.
df <- original.df %>% dplyr::select(ID, Form, Pen, Time)
df <- df %>% dplyr::group_by(ID, Form, Pen) %>% dplyr::summarise(Time = mean(Time))
df <- df %>% convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
I wanted to test the main and interaction effects, so I'm doing a 2 by 3 repeated measures ANOVA (a two-way ANOVA, because Form and Pen are two independent variables).
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
and I KEEP getting this error:
Error: Problem with `mutate()` input `data`.
x Can't subset columns that don't exist.
x Columns `ID` and `Form` don't exist.
ℹ Input `data` is `map(.data$data, .f, ...)`.
WHY?! Any help would be greatly appreciated. I've been searching solutions for HOURS and I'm getting pretty frustrated.
Thank you for adding the additional details to the post - based on what you've provided it looks like you need to ungroup your df before passing it to anova_test(), e.g.
#install.packages("rstatix")
library(rstatix)
library(tidyverse)
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
original.df <- data.frame(ID, Form, Pen, Time)
df <- original.df %>%
dplyr::select(ID, Form, Pen, Time)
df <- df %>%
dplyr::group_by(ID, Form, Pen) %>%
dplyr::summarise(Time = mean(Time))
df <- df %>%
convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
df <- ungroup(df)
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
You can see whether a dataframe is grouped using str(), e.g. str(df) before and after ungrouped() shows you the difference. Please let me know if you are still getting errors after making this change

Plotting based on occurrence in group

I would to make a bar chart that plots the bar as a proportion of the total group rather than the usual percentage. For a var to "count" it only needs to occur once in a group. For example in this df where id is the grouping variable
df <-
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", NA, "b", "c", "d", "e", "a", "a", "a"))
The a bars would be:
a = 2/3 # since a occurs in 2 out of 3 groups
b = 1/3
c = 1/3
d = 1/3
e = 1/3
If I understand you correctly, a one-liner would suffice:
ggplot(distinct(df)) + geom_bar(aes(vars, stat(count) / n_distinct(df$id)))
Working answer:
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", "a", "b", "c", "d", "e", "a", "a", "a")) %>%
group_by(id) %>%
distinct(vars) %>%
ungroup() %>%
add_count(vars) %>%
mutate(prop = n / n_distinct(id)) %>%
distinct(vars, .keep_all = T) %>%
ggplot(aes(vars, prop)) +
geom_col()

Find overlapping rows between dataframes using dplyr?

df1 <- data_frame(time1 = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
time2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"))
df2 <- data_frame(time = sort(runif(100, 0, 10)),
C = rbinom(100, 1, 0.5))
For every row in df1, I want to find the rows in df2 that overlap for time, then assign the median C value for this group of df2 rows to a new column in df1. I'm sure there's some simple way to do this with dplyr's between function, but I'm new to R and haven't been able to figure it out. Thanks!
Here's a way, using the merge function to basically do a SQL style cross join, then using the between function:
library(tidyverse)
merge(df1, df2, all = TRUE) %>%
rowwise() %>%
mutate(time_between = between(time, time1, time2)) %>%
filter(time_between) %>%
group_by(time1, time2, id) %>%
summarise(med_C = median(C))
Using the filter function may result in losing some rows from df1, so an alternative method would be:
merge(df1, df2, all = TRUE) %>%
rowwise() %>%
mutate(time_between = between(time, time1, time2)) %>%
group_by(time1, time2, id) %>%
summarise(med_C = median(ifelse(time_between, C, NA), na.rm = TRUE))
You can do this in base R with sapply:
df1$median_c <- sapply(seq_along(df1$id), function(i) {
median(df2$C[df2$time > df1$time1[i] & df2$time < df1$time2[i]])
})

Plot observation number (label) in outlier points

I have this boxplot with outliers, i need to plot the number of the line that contain the outlier observation, to make it easy to go in the data set and find where the value, somebody can help me?
set.seed(1)
a <- runif(10,1,100)
b <-c("A","A","A","A","A","B","B","B","B","B")
t <- cbind(a,b)
bp <- boxplot(a~b)
text(x = 1, y = bp$stats[,1] + 2, labels = round(bp$stats[,1], 2))
text(x = 2, y = bp$stats[,2] + 2, labels = round(bp$stats[,2], 2))
What is the point of t <- cbind(a, b)? That makes a character matrix and converts your numbers to character strings? You don't use it anyway. If you want a single data structure use data.frame(a, b) which will make a a factor and leave b numeric. I do not get the plot you do with set.seed(1) so I'll provide slightly different data. Note the use of the pos= and offset= arguments in text(). Be sure to read the manual page to see what they are doing:
a <- c(99.19, 59.48, 48.95, 18.17, 75.73, 45.94, 51.61, 21.55, 37.41,
59.98, 57.91, 35.54, 4.52, 64.64, 75.03, 60.21, 56.53, 53.08,
98.52, 51.26)
b <- c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B")
bp <- boxplot(a~b)
text(x = 1, y = bp$stats[,1], labels = round(bp$stats[, 1], 2),
pos=c(1, 3, 3, 1, 3), offset=.2)
text(x = 2, y = bp$stats[, 2], labels = round(bp$stats[, 2], 2),
pos=c(1, 3, 3, 1, 3), offset=.2)
obs <- which(a %in% bp$out)
text(bp$group, bp$out, obs, pos=4)

Resources