Inventory Projection Calculation in R - r

I am trying to replace an obsolete Excel report currently used for sales forecasting and inventory projections by our supply chain team and I am using R for this.
The desired output is a data frame with one of the columns being the projected closing inventory positions for each week across a span of N weeks.
The part I am struggling with is the recursive calculation for the closing inventory positions. Below is a subset of the data frame with dummy data where "stock_projection" is the desire result.
I've just started learning about recursion in R so I am not really sure on how to implement this here. Any help will be much appreciated!
week
forecast
opening_stock
stock_projection
1
10
100
100
2
11
89
3
12
77
4
10
67
5
11
56
6
10
46
7
12
34
8
11
23
9
9
14
10
12
2
Update
I have managed to modify the solution explained here and have replicated the above outcome:
inventory<- tibble(week = 1, opening_stock = 100)
forecast<- tibble(week = 2:10, forecast = c(11, 12, 10, 11, 10, 12, 11, 9, 12) )
dat <- full_join(inventory, forecast)
dat2 <- dat %>%
mutate(forecast = -forecast) %>%
gather(transaction, value, -week) %>%
arrange(week) %>%
mutate(value = replace_na(value, 0))
dat2 %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
group_by(week) %>%
summarise(stock_projection = last(value))
Despite working like a charm, I am wondering whether there is another way to achieve this?

I think in the question above, you don't have to worry too much about recursion because the stock projection looks just like the opening stock minus the cumulative sum of the forecast. You could do that with:
library(dplyr)
dat <- tibble(
week = 1:10,
forecast = c(10,11,12,10,11,10,12,11,9,12),
opening_stock = c(100, rep(NA, 9))
)
dat <- dat %>%
mutate(fcst = case_when(week == 1 ~ 0,
TRUE ~ forecast),
stock_projection = case_when(
week == 1 ~ opening_stock,
TRUE ~ opening_stock[1] - cumsum(fcst))) %>%
dplyr::select(-fcst)
dat
# # A tibble: 10 × 4
# week forecast opening_stock stock_projection
# <int> <dbl> <dbl> <dbl>
# 1 1 10 100 100
# 2 2 11 NA 89
# 3 3 12 NA 77
# 4 4 10 NA 67
# 5 5 11 NA 56
# 6 6 10 NA 46
# 7 7 12 NA 34
# 8 8 11 NA 23
# 9 9 9 NA 14
# 10 10 12 NA 2

Related

How to use R to create baseball Splits from Game Logs

I'm trying to use R to recreate Baseball Splits as found on MLB.com. The splits are created from Game Logs and provide different cuts of the data. For example, home games vs. away games, day games vs. night games, August vs. September and many more all in one convenient table. I believe the ratios (AVG, OBP SLG) can all be added via mutate once the basic splits have been totaled.
My Question is, what's the best and most efficient way to create these splits and how should the data be shaped. The game log obviously has additional (hidden) column(s) that contain the Split topics. The nature of the problem leads me to believe purrr might be a tool to employ but I can't quite wrap my mind around how to approach this one.
Here is how I believe the data should be shaped and a link to a sample game log. I would appreciate any thoughts, ideas or solutions to this problem.
Links and images of Game Logs and Splits for National outfielder Juan Soto are set forth below.
Game Logs: Juan Soto Game Log
Splits: Juan Soto Game Splits
Splits
I've gone through the dataset, although I'm not sure if the sum values match, and neither the averages relative to the images above.
You're right about mutating for creating the values you suggest.
However, hopefully my approach can help you get what you're after.
library(tidyverse)
library(data.table)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- fread(game.splits, fill = TRUE)
game.splits.pivot <- game.splits
game.splits.pivot$Var1 <- ifelse(game.splits.pivot$Var1 %in% "HOME", 1, 0)
game.splits.pivot$Var2 <- ifelse(game.splits.pivot$Var2 %in% "NIGHT", 3, 2)
game.splits.pivot$Var3 <- ifelse(game.splits.pivot$Var3 %in% "SEPTEMBER", 5, 4)
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:16, 20))
colnames(game.splits.pivot)[19] <- "name_c"
game.splits.pivot <- game.splits.pivot[, -c(17, 18)]
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:3, 17))
#test
game.splits.pivot_test <- game.splits.pivot[, -c(1, 2, 3)]
game.splits.pivot_test <- aggregate(value ~ name_c + name, game.splits.pivot_test, sum)
game.splits.pivot_test <- game.splits.pivot_test %>% pivot_wider(names_from = name, values_from = value)
lc_name <- tibble(name_c = 0:5, split = c("HOME", "AWAY", "DAY", "NIGHT", "AUGUST", "SEPTEMBER"))
game.splits.pivot_test <- game.splits.pivot_test %>%
inner_join(lc_name, by = "name_c") %>%
arrange(name_c) %>%
select(-name_c)
game.splits.pivot_test <- game.splits.pivot_test[, c(14, 3, 9, 6, 1, 2, 7, 10, 4, 8, 12, 11, 5, 13)]
A look into the dataset:
# A tibble: 6 x 14
split AB R H `2B` `3B` HR RBI BB IBB SO SB CS TB
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 HOME 88 24 32 5 0 9 23 15 5 12 1 2 64
2 AWAY 66 15 22 9 0 4 14 26 7 16 5 0 43
3 DAY 29 21 18 4 0 5 17 12 4 3 4 0 37
4 NIGHT 125 18 36 10 0 8 20 29 8 25 2 2 70
5 AUGUST 90 21 33 6 0 11 25 13 1 13 1 1 72
6 SEPTEMBER 64 18 21 8 0 2 12 28 11 15 5 1 35
This turned out to be more straight-forward than I had thought. The following solution relies upon pivot_longer to shape the data and summarise_if to tally the splits - no rbinds or purrr needed.
library(tidyverse)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- read_csv(game.splits)
game.splits %>%
pivot_longer(Var1:Var3, names_to = "split") %>%
group_by(split) %>%
arrange(split) %>%
select(split, value, everything()) %>%
ungroup() %>%
select(split, value, everything()) %>%
select(-Date, -OPP) %>%
mutate(value = str_c(split, "_", value)) %>%
group_by(value) %>%
summarise_if(is.numeric, sum) %>%
mutate(value= str_replace(value, "(Var\\d_)",""))
#> # A tibble: 6 x 14
#> value AB R H TB `2B` `3B` HR RBI BB IBB SO SB
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AWAY 88 24 32 64 5 0 9 23 15 5 12 1
#> 2 HOME 66 15 22 43 9 0 4 14 26 7 16 5
#> 3 DAY 29 21 18 37 4 0 5 17 12 4 3 4
#> 4 NIGHT 125 18 36 70 10 0 8 20 29 8 25 2
#> 5 AUGUST 90 21 33 72 6 0 11 25 13 1 13 1
#> 6 SEPTE~ 64 18 21 35 8 0 2 12 28 11 15 5
Created on 2021-03-03 by the reprex package (v0.3.0)

paste together first and last value by group

I have a df that looks like this:
group sequence link
90 1 11|S1
90 2 10|S1
90 3 12|10
91 1 9|10
91 2 13|9
93 1 15|20
...
How can I store the first and last value of the linkvariable in each group as a new variable?
Desired output is:
group sequence link Key
90 1 11|S1 11|S1, 12|10
90 2 10|S1 11|S1, 12|10
90 3 12|10 11|S1, 12|10
91 1 9|10 9|10, 13|9
91 2 13|9 9|10,13|9
93 1 15|20
....
You could do:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
Key = paste(link[1], link[n()], sep = ", ")
)
Though that wouldn't match your desired output. In your example data frame, you have e.g. the group 91 where there's only 1 value. The above code would give you 9|10 repeatedly both as beginning and end.
If you'd like to only display one value in such cases, you can do:
df %>%
group_by(group) %>%
mutate(
Key = case_when(
n() > 1 ~ paste(link[1], link[n()], sep = ", "),
TRUE ~ as.character(link)
)
)
I think you could use arrange() and slice() to find the first/last links in your data. My solution is a lengthier than #arg0naut91's, but is perhaps more intuitive.
Create toy data frame...
df <- data.frame(group=rep(letters,3), # create toy data frame
sequence=rep(1:3,26),
link=sample(9:13,78,T)) %>%
arrange(group,sequence) %>% # arrange data
group_by(group,link) %>% sample_n(1) %>% # remove any duplicate link values (to create uneven sequence var)
ungroup() %>% arrange(group,sequence) # arrange again to view
glimpse(df)
Find first and last links. Add them as new columns to the data frame.
df <- df %>% arrange(group,link) %>% group_by(group) %>%
slice(1) %>% mutate(link.first=link) %>% # find first link for each group
select(group,link.first) %>% left_join(df,.) # add to original data frame
df <- df %>% arrange(group,link) %>% group_by(group) %>%
slice(n()) %>% mutate(link.last=link) %>% # find last link for each group
select(group,link.last) %>% left_join(df,.) # add to original data frame
df %>% mutate(key=paste(link.first,link.last,sep=', ')) # paste links to form key
# A tibble: 62 x 6
group sequence link link.first link.last key
<fct> <int> <int> <int> <int> <chr>
1 a 1 10 10 12 10, 12
2 a 2 12 10 12 10, 12
3 b 2 9 9 11 9, 11
4 b 3 11 9 11 9, 11
5 c 1 13 9 13 9, 13
6 c 2 12 9 13 9, 13
7 c 3 9 9 13 9, 13
8 d 1 9 9 13 9, 13
9 d 3 13 9 13 9, 13
10 e 1 11 9 11 9, 11
Since I used sample() with replacement to generate the data, there may be some group's with only one row (i.e., the same first and last link values), which can be filtered out.
df %>% filter(link.first==link.last)
# A tibble: 2 x 5
group sequence link link.first link.last
<fct> <int> <int> <int> <int>
1 k 2 9 9 9
2 z 1 9 9 9
df %>% count(group) %>% filter(n==1)

programming R loop

I need help with programming R. I have data.frame B with one column
x<- c("300","300","300","400","400","400","500","500","500"....etc.) **2 milion rows**
and I need create next columns with rank. Next columns should look as
y<- c(1,2,3,1,2,3,1,2,3,......etc. )
I used cycle with for
B$y[1]=1
for (i in 2:length(B$x))
{
B$y[i]<-ifelse(B$x[i]==B$x[i-1], B$y[i-1]+1, 1)
}
The process ran for 4 hours.
So I need help anything speed up or anything else.
Thanks for your answer.
Here is a solution with base R:
B <- data.frame(x = rep(c(300, 400, 400), sample(c(5:10), 3)))
B
B$y <- ave(B$x, B$x, FUN=seq_along)
Here's an approach with dplyr that takes about 0.2 seconds on 2 million rows.
First I make sample data:
n = 2E6 # number of rows in test
library(dplyr)
sample_data <- data.frame(
x = round(runif(n = n, min = 1, max = 100000), digits = 0)
) %>%
arrange(x) # Optional, added to make output clearer so that each x is adjacent to the others that match.
Then I group by x and make y show which # occurrence of x it is within that group.
sample_data_with_rank <- sample_data %>%
group_by(x) %>%
mutate(y = row_number()) %>%
ungroup()
head(sample_data_with_rank, 20)
# A tibble: 20 x 2
x y
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 1 13
14 1 14
15 1 15
16 2 1
17 2 2
18 2 3
19 2 4
20 2 5

Rolling sum over multiple columns in r

I am working on R with a dataset that looks like this:
Screen shot of dataset
test=data.frame("1991" = c(1,5,3), "1992" = c(4,3,3), "1993" = c(10,5,3), "1994" = c(1,1,1), "1995" = c(2,2,6))
test=plyr::rename(test, c("X1991"="1991", "X1992"="1992", "X1993"="1993", "X1994"="1994", "X1995"="1995"))
What I want to do is that I want to create variables called Pre1991, Pre1992, Pre1993, ... and these variables would store the cumulated values up to that year, e.g.
Pre1991 = test$1991
Pre1992 = test$1991 + test$1992
Pre1993 = test$1991 + test$1992 + test$1993
so on.
My real dataset has variables from year 1900-2017 so I can't do this manually. I want to write a for loop but it didnt work.
for (i in 1900:2017){
x = paste0("Pre",i)
df[[x]] = rowSums(df[,(colnames(df)<=i)])
}
Can someone please help to review my code/ suggest other ways to do it? Thanks!
Edit 1:
Thanks so much! And I'm wondering if there's a way that I can use cumsum function in a reverse direction? For example, if I am interested in what happened after that particular year:
Post1991 = test$1992 + test$1993 + test$1994 + test$1995 + ...
Post1992 = test$1993 + test$1994 + test$1995 + ...
Post1993 = test$1994 + test$1995 + ...
This is a little inefficient in that it is converting from a data.frame to a matrix and back, but ...
as.data.frame(t(apply(as.matrix(test), 1, cumsum)))
# 1991 1992 1993 1994 1995
# 1 1 5 15 16 18
# 2 5 8 13 14 16
# 3 3 6 9 10 16
If your data has other columns that are not year-based, such as
test$quux <- LETTERS[3:5]
test
# 1991 1992 1993 1994 1995 quux
# 1 1 4 10 1 2 C
# 2 5 3 5 1 2 D
# 3 3 3 3 1 6 E
then subset on both sides:
test[1:5] <- as.data.frame(t(apply(as.matrix(test[1:5]), 1, cumsum)))
test
# 1991 1992 1993 1994 1995 quux
# 1 1 5 15 16 18 C
# 2 5 8 13 14 16 D
# 3 3 6 9 10 16 E
EDIT
In reverse, just use repeated rev:
as.data.frame(t(apply(as.matrix(test), 1, function(a) rev(cumsum(rev(a)))-a)))
# 1991 1992 1993 1994 1995
# 1 17 13 3 2 0
# 2 11 8 3 2 0
# 3 13 10 7 6 0
Using tidyverse we can gather and calculate before then spreading again. For this to work data will need to be arranged.
library(tidyverse)
test <- data.frame("1991" = c(1, 5, 3),
"1992" = c(4, 3, 3),
"1993" = c(10, 5, 3),
"1994" = c(1, 1, 1),
"1995" = c(2, 2, 6))
test <- plyr::rename(test, c("X1991" = "1991",
"X1992" = "1992",
"X1993" = "1993",
"X1994" = "1994",
"X1995" = "1995"))
Forwards
test %>%
mutate(id = 1:nrow(.)) %>% # adding an ID to identify groups
gather(year, value, -id) %>% # wide to long format
arrange(id, year) %>%
group_by(id) %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
spread(year, value) %>% # long to wide format
select(-id) %>%
setNames(paste0("pre", names(.))) # add prefix to columns
## A tibble: 3 x 5
# pre1991 pre1992 pre1993 pre1994 pre1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1. 5. 15. 16. 18.
# 2 5. 8. 13. 14. 16.
# 3 3. 6. 9. 10. 16.
Reverse direction
As your definition specifies its not strictly the reverse order, its the reverse order excluding itself which would be the cumulative lagged sum.
test %>%
mutate(id = 1:nrow(.)) %>%
gather(year, value, -id) %>%
arrange(id, desc(year)) %>% # using desc() to reverse sorting
group_by(id) %>%
mutate(value = cumsum(lag(value, default = 0))) %>% # lag cumsum
ungroup() %>%
spread(year, value) %>%
select(-id) %>%
setNames(paste0("post", names(.)))
## A tibble: 3 x 5
# post1991 post1992 post1993 post1994 post1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 17. 13. 3. 2. 0.
# 2 11. 8. 3. 2. 0.
# 3 13. 10. 7. 6. 0.
We can use rowCumsums from matrixStats
library(matrixStats)
test[] <- rowCumsums(as.matrix(test))
test
# 1991 1992 1993 1994 1995
#1 1 5 15 16 18
#2 5 8 13 14 16
#3 3 6 9 10 16

Manually calculate variance from count data for categorical ratings

I am trying to manually calculate the variance (and mean) from categorical rating count data.
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always
1 A 4 NA 17 10 3 2 7
2 B 12 10 5 12 21 14 NA
3 C 17 20 12 17 NA 12 18
4 D NA 15 6 NA 16 20 23
Each categorical rating has an equivalent numeric value (1:7). I have calculated the average numerical rating for each Item as follows:
Rating_wt <- 1:7 # Vector of weights for each frequency rating
Rating.wt.mat <- rep(Rating_wt,each=dim(Data[,2:8])[1])
Data$Avg_rating <- rowSums(Data[,2:8]*Rating.wt.mat,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg_rating
1 A 4 NA 17 10 3 2 7 3.976744
2 B 12 10 5 12 21 14 NA 3.837838
3 C 17 20 12 17 NA 12 18 3.739583
4 D NA 15 6 NA 16 20 23 5.112500
I would like to also calculate the variance for each Average and store that as a new variable in Data.
I believe I need to subtract the Average for each item from each numeric rating and multiply that value by the count in each respective cell, then sum those results across rows, then divide by the total counts in each row.
But, I can't figure out how to set up the element-wise calculations to accomplish that.
Conceptually, I think it should be something like this:
Data$Rating_var <- rowSums((Numeric_Rating - Avg_rating)*Value,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE))
Where Numeric_Rating corresponds to Rating_wt:
Never = 1
Rarely = 2
Occasionally = 3
Sometimes = 4
Frequently = 5
Usually = 6
Always = 7
and Value is the corresponding cell for each Numeric_Rating by Item intersection.
I'd suggest you try to reshape your dataset before you apply your calculations, as it will be easier.
library(dplyr)
library(tidyr)
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data %>%
gather(category, value, -Item) %>% # reshape dataset
mutate(Rating = recode(category, "Never"=1,"Rarely" = 2,"Occasionally" = 3,
"Sometimes" = 4,"Frequently" = 5,
"Usually" = 6,"Always" = 7)) %>% # assign rating
group_by(Item) %>% # for each item
mutate(Avg = sum(Rating*value, na.rm=T) / sum(value, na.rm=T), # calculate Avg
variance = sum(abs(Rating - Avg)*value, na.rm=T) / sum(value, na.rm=T)) %>% # calculate Variance using the Avg
ungroup() %>% # forget the grouping
select(-Rating) %>% # no need the rating any more
spread(category, value) %>% # reshape back to original form
select_(.dots = c(names(Data), "Avg", "variance")) # get columns in the desired order
# # A tibble: 4 x 10
# Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg variance
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 4 NA 17 10 3 2 7 3.976744 1.326122
# 2 B 12 10 5 12 21 14 NA 3.837838 1.530314
# 3 C 17 20 12 17 NA 12 18 3.739583 1.879991
# 4 D NA 15 6 NA 16 20 23 5.112500 1.529062
Try to run the piped process step by step to see how it works, especially if you're not familiar with the dplyr and tidyr syntax.

Resources