I'm trying to use R to recreate Baseball Splits as found on MLB.com. The splits are created from Game Logs and provide different cuts of the data. For example, home games vs. away games, day games vs. night games, August vs. September and many more all in one convenient table. I believe the ratios (AVG, OBP SLG) can all be added via mutate once the basic splits have been totaled.
My Question is, what's the best and most efficient way to create these splits and how should the data be shaped. The game log obviously has additional (hidden) column(s) that contain the Split topics. The nature of the problem leads me to believe purrr might be a tool to employ but I can't quite wrap my mind around how to approach this one.
Here is how I believe the data should be shaped and a link to a sample game log. I would appreciate any thoughts, ideas or solutions to this problem.
Links and images of Game Logs and Splits for National outfielder Juan Soto are set forth below.
Game Logs: Juan Soto Game Log
Splits: Juan Soto Game Splits
Splits
I've gone through the dataset, although I'm not sure if the sum values match, and neither the averages relative to the images above.
You're right about mutating for creating the values you suggest.
However, hopefully my approach can help you get what you're after.
library(tidyverse)
library(data.table)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- fread(game.splits, fill = TRUE)
game.splits.pivot <- game.splits
game.splits.pivot$Var1 <- ifelse(game.splits.pivot$Var1 %in% "HOME", 1, 0)
game.splits.pivot$Var2 <- ifelse(game.splits.pivot$Var2 %in% "NIGHT", 3, 2)
game.splits.pivot$Var3 <- ifelse(game.splits.pivot$Var3 %in% "SEPTEMBER", 5, 4)
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:16, 20))
colnames(game.splits.pivot)[19] <- "name_c"
game.splits.pivot <- game.splits.pivot[, -c(17, 18)]
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:3, 17))
#test
game.splits.pivot_test <- game.splits.pivot[, -c(1, 2, 3)]
game.splits.pivot_test <- aggregate(value ~ name_c + name, game.splits.pivot_test, sum)
game.splits.pivot_test <- game.splits.pivot_test %>% pivot_wider(names_from = name, values_from = value)
lc_name <- tibble(name_c = 0:5, split = c("HOME", "AWAY", "DAY", "NIGHT", "AUGUST", "SEPTEMBER"))
game.splits.pivot_test <- game.splits.pivot_test %>%
inner_join(lc_name, by = "name_c") %>%
arrange(name_c) %>%
select(-name_c)
game.splits.pivot_test <- game.splits.pivot_test[, c(14, 3, 9, 6, 1, 2, 7, 10, 4, 8, 12, 11, 5, 13)]
A look into the dataset:
# A tibble: 6 x 14
split AB R H `2B` `3B` HR RBI BB IBB SO SB CS TB
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 HOME 88 24 32 5 0 9 23 15 5 12 1 2 64
2 AWAY 66 15 22 9 0 4 14 26 7 16 5 0 43
3 DAY 29 21 18 4 0 5 17 12 4 3 4 0 37
4 NIGHT 125 18 36 10 0 8 20 29 8 25 2 2 70
5 AUGUST 90 21 33 6 0 11 25 13 1 13 1 1 72
6 SEPTEMBER 64 18 21 8 0 2 12 28 11 15 5 1 35
This turned out to be more straight-forward than I had thought. The following solution relies upon pivot_longer to shape the data and summarise_if to tally the splits - no rbinds or purrr needed.
library(tidyverse)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- read_csv(game.splits)
game.splits %>%
pivot_longer(Var1:Var3, names_to = "split") %>%
group_by(split) %>%
arrange(split) %>%
select(split, value, everything()) %>%
ungroup() %>%
select(split, value, everything()) %>%
select(-Date, -OPP) %>%
mutate(value = str_c(split, "_", value)) %>%
group_by(value) %>%
summarise_if(is.numeric, sum) %>%
mutate(value= str_replace(value, "(Var\\d_)",""))
#> # A tibble: 6 x 14
#> value AB R H TB `2B` `3B` HR RBI BB IBB SO SB
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AWAY 88 24 32 64 5 0 9 23 15 5 12 1
#> 2 HOME 66 15 22 43 9 0 4 14 26 7 16 5
#> 3 DAY 29 21 18 37 4 0 5 17 12 4 3 4
#> 4 NIGHT 125 18 36 70 10 0 8 20 29 8 25 2
#> 5 AUGUST 90 21 33 72 6 0 11 25 13 1 13 1
#> 6 SEPTE~ 64 18 21 35 8 0 2 12 28 11 15 5
Created on 2021-03-03 by the reprex package (v0.3.0)
Related
I am trying to replace an obsolete Excel report currently used for sales forecasting and inventory projections by our supply chain team and I am using R for this.
The desired output is a data frame with one of the columns being the projected closing inventory positions for each week across a span of N weeks.
The part I am struggling with is the recursive calculation for the closing inventory positions. Below is a subset of the data frame with dummy data where "stock_projection" is the desire result.
I've just started learning about recursion in R so I am not really sure on how to implement this here. Any help will be much appreciated!
week
forecast
opening_stock
stock_projection
1
10
100
100
2
11
89
3
12
77
4
10
67
5
11
56
6
10
46
7
12
34
8
11
23
9
9
14
10
12
2
Update
I have managed to modify the solution explained here and have replicated the above outcome:
inventory<- tibble(week = 1, opening_stock = 100)
forecast<- tibble(week = 2:10, forecast = c(11, 12, 10, 11, 10, 12, 11, 9, 12) )
dat <- full_join(inventory, forecast)
dat2 <- dat %>%
mutate(forecast = -forecast) %>%
gather(transaction, value, -week) %>%
arrange(week) %>%
mutate(value = replace_na(value, 0))
dat2 %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
group_by(week) %>%
summarise(stock_projection = last(value))
Despite working like a charm, I am wondering whether there is another way to achieve this?
I think in the question above, you don't have to worry too much about recursion because the stock projection looks just like the opening stock minus the cumulative sum of the forecast. You could do that with:
library(dplyr)
dat <- tibble(
week = 1:10,
forecast = c(10,11,12,10,11,10,12,11,9,12),
opening_stock = c(100, rep(NA, 9))
)
dat <- dat %>%
mutate(fcst = case_when(week == 1 ~ 0,
TRUE ~ forecast),
stock_projection = case_when(
week == 1 ~ opening_stock,
TRUE ~ opening_stock[1] - cumsum(fcst))) %>%
dplyr::select(-fcst)
dat
# # A tibble: 10 × 4
# week forecast opening_stock stock_projection
# <int> <dbl> <dbl> <dbl>
# 1 1 10 100 100
# 2 2 11 NA 89
# 3 3 12 NA 77
# 4 4 10 NA 67
# 5 5 11 NA 56
# 6 6 10 NA 46
# 7 7 12 NA 34
# 8 8 11 NA 23
# 9 9 9 NA 14
# 10 10 12 NA 2
I'm trying to group a dataset and get the first and highest values based on two separate measures of time and speed. So I need the time and speed for the earliest record in each group and then the time and speed for the fastest record in each group. I've got this far but need some help...
library(tidyverse)
group <- c(1,1,1,1,1,2,2,3,3,4,4,4,4,4,4)
time <- c(1,6,4,5,7,12,10,2,3,8,9,11,13,14,15)
speed <- c(17,6, 99, 34, 12, 5, 67, 43, 23, 12, 15, 78, 61, 78, 20)
data = data.frame(group, time, speed)
summary = data %>%
group_by(group) %>%
summarise(
firstTime = # lowest time
HighestSpeedTime = , # time for highest speed
firstSpeed = , #speed for lowest time
highestSpeed = max(speed), # highest speed
)
Update:
This should work: In group 4 we have ties therefore 2 rows:(we have at two time points the highest speed)!
library(dplyr)
data %>%
group_by(group) %>%
summarise(
firstTime = min(time), # lowest time
HighestSpeedTime = time[which(speed==max(speed))], # time for highest speed
firstSpeed = speed[which(time==min(time))],#speed for lowest time
highestSpeed = max(speed) # highest speed
)
output:
group firstTime HighestSpeedTime firstSpeed highestSpeed
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 4 17 99
2 2 10 10 67 67
3 3 2 2 43 43
4 4 8 11 12 78
5 4 8 14 12 78
Does this work?
library(tidyverse)
group <- c(1,1,1,1,1,2,2,3,3,4,4,4,4,4,4)
time <- c(1,6,4,5,7,12,10,2,3,8,9,11,13,14,15)
speed <- c(17,6, 99, 34, 12, 5, 67, 43, 23, 12, 15, 78, 61, 78, 20)
data = data.frame(group, time, speed)
summary <- data |>
arrange(group, time) |>
group_by(group) |>
summarise(
firsttime = min(time),
highest_speed = max(speed)
) |>
left_join(data, by = c("group", "highest_speed" = "speed")) |>
group_by(group) |>
slice(1) |>
rename(highest_speed_time = time) |>
left_join(data, by = c("group", "firsttime" = "time")) |>
rename(first_speed = speed)
summary
# group firsttime highest_speed highest_speed_time first_speed
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 99 4 17
# 2 10 67 10 67
# 3 2 43 2 43
# 4 8 78 11 12
here is a data.table approach
library(data.table)
setDT(data)
temp <- data[data[, .I[speed == max(speed)], by = .(group)]$V1]
setnames(temp, new = c("group", "maxSpeedTime", "maxSpeed"))
# join together
data[, .(firstTime = time[1],
firstSpeed = speed[1]),
by = .(group)][temp, on = .(group)]
# group firstTime firstSpeed maxSpeedTime maxSpeed
# 1: 1 1 17 4 99
# 2: 2 12 5 10 67
# 3: 3 2 43 2 43
# 4: 4 8 12 11 78
# 5: 4 8 12 14 78
Another solution, with a chained inner_join:
library(tidyverse)
data %>%
group_by(group) %>%
summarise(firstTime = min(time)) %>%
inner_join(data,by=c("group", "firstTime"="time")) %>%
rename(firstSpeed=speed) %>%
inner_join(
data %>%
group_by(group) %>%
summarise(highestSpeed = max(speed)) %>%
inner_join(data,by=c("group", "highestSpeed"="speed"))
) %>%
relocate(highestTime=time, .before="highestSpeed")
#> Joining, by = "group"
#> # A tibble: 5 × 5
#> group firstTime firstSpeed highestTime highestSpeed
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 17 4 99
#> 2 2 10 67 10 67
#> 3 3 2 43 2 43
#> 4 4 8 12 11 78
#> 5 4 8 12 14 78
An alternative solution, based on purrr::map_dfr:
library(tidyverse)
data %>%
group_split(group) %>%
map_dfr(
~ data.frame(
group = .x$group[1],
firstTime = .x$time[min(.x$time) == .x$time],
firstSpeed = .x$speed[min(.x$time) == .x$time],
highestTime = .x$time[max(.x$speed) == .x$speed],
highestSpeed = .x$speed[max(.x$speed) == .x$speed]))
#> group firstTime firstSpeed highestTime highestSpeed
#> 1 1 1 17 4 99
#> 2 2 10 67 10 67
#> 3 3 2 43 2 43
#> 4 4 8 12 11 78
#> 5 4 8 12 14 78
And more succinctly:
library(tidyverse)
data %>%
group_split(group) %>%
map_dfr(~ data.frame(
group = integer(), firstTime = integer(), firstSpeed = integer(),
highestTime = integer(), highestSpeed = integer()) %>%
add_row(!!!setNames(c(.x$group[1],.x[min(.x$time) == .x$time, -1],
.x[max(.x$speed) == .x$speed, -1]), names(.))))
#> group firstTime firstSpeed highestTime highestSpeed
#> 1 1 1 17 4 99
#> 2 2 10 67 10 67
#> 3 3 2 43 2 43
#> 4 4 8 12 11 78
#> 5 4 8 12 14 78
I have a dataframe (sample) as given below
(8K rows and 1.6K sellers)
# creat dataframe
df <- data.frame(name =c('Tom', 'Tom', 'Tom',Tom','Tom','jack','jack','jack','jack','jack','Malik'),
week = c(1, 2, 3, 4, 5, 1, 2, 3, 4,5,1),
sell = c(20, 21, 19, 18, 23,24, 36, 35, 46, 50,44),
demand=c(28, 16, 43,NaN,NaN,30, 35, 35, 72,NaN, 60)
)
df$`demand-sell` <- df$demand - df$sell
df
Expected Output_function:
**For week = 4**<br/>
In which I would like to fill NaN values of demand for week = 4 with sum of remaining demand (demand - sell) of week = 1,2,3 of the same seller (name)<br/><br/>
**Note:**<br/>
If week=4 demand is not NaN then add week=4 demand in (demand - sell) of week = 1,2,3<br/>
{<b>ex</b> in case of name = jack }<br/><br/>
**For week = 5:**
<br/>In which I would like to fill NaN values of demand for week = 5 with sum of remaining demand (demand - sell) of week = 1,2,3,4 of the same seller (name)
<br/>**Note:**
<br/>If week=5 demand is not NaN then add week=5 demand in (demand - sell) of week = 1,2,3,4<br/>
Expected Output (sample data)
Update with the correct answer:
The issue was the lack of is.nan(demand)statement:
Here is the correct answer:
df %>%
mutate(`demand-sell` = demand - sell) %>%
group_by(name) %>%
mutate(demand=case_when(week == 4 & is.nan(demand) ~ sum(`demand-sell`[1:3]),
week == 4 & !is.nan(demand) ~ demand + sum(`demand-sell`[1:3]),
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 4 ~ demand-sell,
TRUE ~ `demand-sell`)) %>%
mutate(demand = case_when(week == 5 ~ `demand-sell`[4],
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 5 ~ demand-sell,
TRUE ~ `demand-sell`))
Correct output:
name week sell demand `demand-sell`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Tom 1 20 28 8
2 Tom 2 21 16 -5
3 Tom 3 19 43 24
4 Tom 4 18 27 9
5 Tom 5 23 9 -14
6 jack 1 24 30 6
7 jack 2 36 35 -1
8 jack 3 35 35 0
9 jack 4 46 77 31
10 jack 5 50 31 -19
11 Malik 1 44 60 16
First answer:
Here is a solution: At least for Tom it is correct. I don't know if your excepted output for jack is correct.
If the logic for all name is the same it should look like this:
df %>%
mutate(`demand-sell` = demand - sell) %>%
group_by(name) %>%
mutate(demand=case_when(week == 4 ~ sum(`demand-sell`[1:3]),
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 4 ~ demand-sell,
TRUE ~ `demand-sell`)) %>%
mutate(demand = case_when(week == 5 ~ `demand-sell`[4],
TRUE ~ demand)) %>%
mutate(`demand-sell`= case_when(week == 5 ~ demand-sell,
TRUE ~ `demand-sell`))
Output:
name week sell demand `demand-sell`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Tom 1 20 28 8
2 Tom 2 21 16 -5
3 Tom 3 19 43 24
4 Tom 4 18 27 9
5 Tom 5 23 9 -14
6 jack 1 24 30 6
7 jack 2 36 35 -1
8 jack 3 35 35 0
9 jack 4 46 5 -41
10 jack 5 50 -41 -91
11 Malik 1 44 60 16
I'm having a data.frame (df), see example, that contains information about people. Based on a key column (sleutel), I know if people live together (e.g. form a family) or not. Now, I need to create new columns with information about the 'head' of the family.
name sex gzverh sleutel gzhfd lft
1 Loekens Man 6 1847LS 9 3 49
2 Kemel Vrouw 5 1847LK 10 2 18
3 Kemel Man 5 1847LK 10 2 22
4 Boersma Vrouw 4 1847LK 10 2 52
5 Kemel Man 2 1847LK 10 1 54
So for example: row 5, Kemel, Male and gzhfd 1 (= head of the family Kemel). He is married to mrs. Boersma (same key). I want to mutate a new column (lfthb) with the age of the head of the family for all family members. So should become something like:
name sex gzverh sleutel gzhfd lft lfthb
1 Loekens Man 6 1847LS 9 3 49 NA
2 Kemel Vrouw 5 1847LK 10 2 18 54
3 Kemel Man 5 1847LK 10 2 22 54
4 Boersma Vrouw 4 1847LK 10 2 52 54
5 Kemel Man 2 1847LK 10 1 54 54
I tried multiple ways with dplyr using multiple combinations of group_by, case_when and if_else statements. And I manage to mutate the column for the head of the family itself. But not for the other members.
For example, evidently only changes the value for the head itself:
df <- df %>% mutate(lfthb = case_when(sleutel == lag(sleutel) & gzhfd == 1 ~ lft))
But how to include the gzhfd == 1 after the ~?
dput of example data:
structure(list(naam = c("Loekens", "Kemel", "Kemel", "Boersma",
"Kemel"), gesl = c("Man", "Vrouw", "Man", "Vrouw", "Man"), gzverh = c(6L,
5L, 5L, 4L, 2L), sleutel = c("1847LS 9", "1847LK 10", "1847LK 10",
"1847LK 10", "1847LK 10"), gzhfd = c(3, 2, 2, 2, 1), lft = c(49,
18, 22, 52, 54)), row.names = c(NA, 5L), class = "data.frame")
A combination of replace and ifelse will do the job, i.e.,
library(tidyverse)
df %>%
group_by(sleutel) %>%
mutate(lfthb = ifelse(any(gzhfd == 1), replace(lft, gzhfd != 1, lft[gzhfd == 1]), NA))
which gives,
# A tibble: 5 x 7
# Groups: sleutel [2]
naam gesl gzverh sleutel gzhfd lft lfthb
<chr> <chr> <int> <chr> <dbl> <dbl> <dbl>
1 Loekens Man 6 1847LS 9 3 49 NA
2 Kemel Vrouw 5 1847LK 10 2 18 54
3 Kemel Man 5 1847LK 10 2 22 54
4 Boersma Vrouw 4 1847LK 10 2 52 54
5 Kemel Man 2 1847LK 10 1 54 54
As #Ronak mentions, we can omit the replace part
df %>%
group_by(sleutel) %>%
mutate(lfthb = if (any(gzhfd == 1)) lft[gzhfd == 1] else NA)
A data.table approach (returns -INF instead of NA for the group with missing data):
dt<-df %>% as.data.table() %>%
.[gzhfd==1, lfthb := lft, by="sleutel"] %>%
.[,lfthb:= max(lfthb,na.rm = T), by="sleutel"]
I have a set of data like below:
BETA_LACT R I S
- 23 25 91
- 30 0 109
- 0 0 136
+ 73 0 0
+ 14 0 59
+ 0 0 49
I want to convert the data to the format below:
R_- I_- S_- R_+ I_+ S_+
23 25 91 73 0 0
30 0 109 14 0 59
0 0 136 0 0 49
I tried spread() but failed, could anybody help me?
I suspect your problem using spread and gather is that there is nothing your sample data to suggest which rows should be collapsed. As a human, I can observe that you wish to combine rows 1 and 4, 2 and 5, etc. However, there are no other columns or "keys" persay in your dataset to indicate this.
One solution would be to add an index column as I show in the second example below using group_by and mutate. The following reprex (reproducible example) shows both a non-working example analogous to your case and a working example.
library(tidyr)
library(dplyr)
example_data <- data.frame(
categ = rep(1:3, 3),
x = 1:9,
y = 11:19,
z = 21:29
)
# won't work
example_data %>%
gather(var, value, -categ) %>%
unite(new_col_name, var, categ) %>%
spread(new_col_name, value)
#> Error: Duplicate identifiers for rows (1, 4, 7), (2, 5, 8), (3, 6, 9), (10, 13, 16), (11, 14, 17), (12, 15, 18), (19, 22, 25), (20, 23, 26), (21, 24, 27)
# will work
example_data %>%
group_by(categ) %>%
mutate(id = row_number()) %>%
gather(var, value, -categ, -id) %>%
unite(new_col_name, var, categ) %>%
spread(new_col_name, value)
#> # A tibble: 3 x 10
#> id x_1 x_2 x_3 y_1 y_2 y_3 z_1 z_2 z_3
#> * <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 1 2 3 11 12 13 21 22 23
#> 2 2 4 5 6 14 15 16 24 25 26
#> 3 3 7 8 9 17 18 19 27 28 29
(As a sidenote, please check out the reprex package! This helps you make a "reproducible example" and ask better questions which will facilitate better community support. Notice how easy it is to copy the above code and run it locally.)