how to convert different levels in one row - r

I have a set of data like below:
BETA_LACT R I S
- 23 25 91
- 30 0 109
- 0 0 136
+ 73 0 0
+ 14 0 59
+ 0 0 49
I want to convert the data to the format below:
R_- I_- S_- R_+ I_+ S_+
23 25 91 73 0 0
30 0 109 14 0 59
0 0 136 0 0 49
I tried spread() but failed, could anybody help me?

I suspect your problem using spread and gather is that there is nothing your sample data to suggest which rows should be collapsed. As a human, I can observe that you wish to combine rows 1 and 4, 2 and 5, etc. However, there are no other columns or "keys" persay in your dataset to indicate this.
One solution would be to add an index column as I show in the second example below using group_by and mutate. The following reprex (reproducible example) shows both a non-working example analogous to your case and a working example.
library(tidyr)
library(dplyr)
example_data <- data.frame(
categ = rep(1:3, 3),
x = 1:9,
y = 11:19,
z = 21:29
)
# won't work
example_data %>%
gather(var, value, -categ) %>%
unite(new_col_name, var, categ) %>%
spread(new_col_name, value)
#> Error: Duplicate identifiers for rows (1, 4, 7), (2, 5, 8), (3, 6, 9), (10, 13, 16), (11, 14, 17), (12, 15, 18), (19, 22, 25), (20, 23, 26), (21, 24, 27)
# will work
example_data %>%
group_by(categ) %>%
mutate(id = row_number()) %>%
gather(var, value, -categ, -id) %>%
unite(new_col_name, var, categ) %>%
spread(new_col_name, value)
#> # A tibble: 3 x 10
#> id x_1 x_2 x_3 y_1 y_2 y_3 z_1 z_2 z_3
#> * <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 1 2 3 11 12 13 21 22 23
#> 2 2 4 5 6 14 15 16 24 25 26
#> 3 3 7 8 9 17 18 19 27 28 29
(As a sidenote, please check out the reprex package! This helps you make a "reproducible example" and ask better questions which will facilitate better community support. Notice how easy it is to copy the above code and run it locally.)

Related

R: creating a longitudinal dataset using tidyr

I am looking to generate a longitudinal dataset. I have generated my pat numbers and treatment groups:
library(dplyr)
set.seed(420)
Pat_TNO <- 1001:1618
data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
My timepoints (in days) are:
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
How can I pivot this dataset using the vector timepoint_days, so I have 10 rows per participant and column names Pat_TNO, trt, timepoint_days.
You can use the unnest function from tidyr to achieve what you want.
Here is the code
library(dplyr)
library(tidyr)
set.seed(420)
Pat_TNO <- 1001:1618
x <- data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
x %>%
mutate(timepoint_days = list(timepoint_days)) %>%
unnest()
Output
# A tibble: 6,180 × 3
Pat_TNO trt timepoint_days
<int> <int> <dbl>
1 1001 1 14
2 1001 1 28
3 1001 1 42
4 1001 1 56
5 1001 1 70
6 1001 1 84
7 1001 1 112
8 1001 1 140
9 1001 1 168
10 1001 1 364
# … with 6,170 more rows
Here I used the mutate function to add a column with a list containing timepoint_days in every row. And then unnest collapses each row to get 10 rows per participant.

Inventory Projection Calculation in R

I am trying to replace an obsolete Excel report currently used for sales forecasting and inventory projections by our supply chain team and I am using R for this.
The desired output is a data frame with one of the columns being the projected closing inventory positions for each week across a span of N weeks.
The part I am struggling with is the recursive calculation for the closing inventory positions. Below is a subset of the data frame with dummy data where "stock_projection" is the desire result.
I've just started learning about recursion in R so I am not really sure on how to implement this here. Any help will be much appreciated!
week
forecast
opening_stock
stock_projection
1
10
100
100
2
11
89
3
12
77
4
10
67
5
11
56
6
10
46
7
12
34
8
11
23
9
9
14
10
12
2
Update
I have managed to modify the solution explained here and have replicated the above outcome:
inventory<- tibble(week = 1, opening_stock = 100)
forecast<- tibble(week = 2:10, forecast = c(11, 12, 10, 11, 10, 12, 11, 9, 12) )
dat <- full_join(inventory, forecast)
dat2 <- dat %>%
mutate(forecast = -forecast) %>%
gather(transaction, value, -week) %>%
arrange(week) %>%
mutate(value = replace_na(value, 0))
dat2 %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
group_by(week) %>%
summarise(stock_projection = last(value))
Despite working like a charm, I am wondering whether there is another way to achieve this?
I think in the question above, you don't have to worry too much about recursion because the stock projection looks just like the opening stock minus the cumulative sum of the forecast. You could do that with:
library(dplyr)
dat <- tibble(
week = 1:10,
forecast = c(10,11,12,10,11,10,12,11,9,12),
opening_stock = c(100, rep(NA, 9))
)
dat <- dat %>%
mutate(fcst = case_when(week == 1 ~ 0,
TRUE ~ forecast),
stock_projection = case_when(
week == 1 ~ opening_stock,
TRUE ~ opening_stock[1] - cumsum(fcst))) %>%
dplyr::select(-fcst)
dat
# # A tibble: 10 × 4
# week forecast opening_stock stock_projection
# <int> <dbl> <dbl> <dbl>
# 1 1 10 100 100
# 2 2 11 NA 89
# 3 3 12 NA 77
# 4 4 10 NA 67
# 5 5 11 NA 56
# 6 6 10 NA 46
# 7 7 12 NA 34
# 8 8 11 NA 23
# 9 9 9 NA 14
# 10 10 12 NA 2

extrapolate position of vehicle by adding new rows to tibbles

I have time t, speed v and position x of a vehicle along with other informations: info1, info2, status
How can I extend the tibble by linearly extrapolating the position based on last speed (v = 14) for given timestamps. So the added rows are copy of the last row except with the predicted positions and status changed to "predicted"`.
Is it possible to do it without using loop.
tbl <- tibble(info1 = rep("a", 3),
info2 = rep("b", 3),
t = c(1, 2, 3),
v = c(12, 13, 14),
x = c(12, 24, 37),
status = rep("real", 3))
timestamps <- c(4, 5, 6, 8) # timestamps does not linearly increase
# desired output
tbl_desired <- tibble(info1 = rep("a", 7),
info2 = rep("b", 7),
t = c(1, 2, 3, 4, 5, 6, 8),
v = c(12, 13, 14, 14, 14, 14, 14),
x = c(12, 24, 37, 51, 65, 79, 107),
status = c(rep("real", 3), rep("predicted", 4)))
The conditions are not clear especially the predicted values in 'x'. Below code works in the following way
Extracts the last row (slice_tail, n = 1)
update the columns 't', 'v', 'x' (summarise)
Bind the rows from the original data (bind_rows)
library(dplyr)
tbl %>%
slice_tail(n = 1) %>%
summarise(info1, info2, t = timestamps, v = v,
x = (x + cumsum(c(1, diff(t)) *
rep(last(v), length(t)))), status = 'predicted') %>%
bind_rows(tbl, .)
-output
# A tibble: 7 × 6
info1 info2 t v x status
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 a b 1 12 12 real
2 a b 2 13 24 real
3 a b 3 14 37 real
4 a b 4 14 51 predicted
5 a b 5 14 65 predicted
6 a b 6 14 79 predicted
7 a b 8 14 107 predicted
If there are many columns, after sliceing the last row, use mutate to update only the columns that needs to be changed and wrap in a list whereever the length is greater than 1, then unnest the list column
library(tidyr)
tbl %>%
slice_tail(n = 1) %>%
mutate(t = list(timestamps), v = v,
x = list((x + cumsum(c(1, diff(timestamps)) *
rep(last(v), length(timestamps))))), status = 'predicted') %>%
unnest(where(is.list)) %>%
bind_rows(tbl, .)
-output
# A tibble: 7 × 6
info1 info2 t v x status
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 a b 1 12 12 real
2 a b 2 13 24 real
3 a b 3 14 37 real
4 a b 4 14 51 predicted
5 a b 5 14 65 predicted
6 a b 6 14 79 predicted
7 a b 8 14 107 predicted
Or use add_row and then fill the NA rows with previous non-NA for those columns not specified in the add_row
library(tibble)
tbl %>%
add_row(t = timestamps, v = last(.$v),
x = (last(.$x) + cumsum(c(1, diff(timestamps)) *
rep(last(.$v), length(timestamps)))), status = 'predicted') %>%
fill(everything())
-output
# A tibble: 7 × 6
info1 info2 t v x status
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 a b 1 12 12 real
2 a b 2 13 24 real
3 a b 3 14 37 real
4 a b 4 14 51 predicted
5 a b 5 14 65 predicted
6 a b 6 14 79 predicted
7 a b 8 14 107 predicted

Identify the highest number of consecutive numbers in a dataframe and add rows with NA

Here is a reproducible example of the situation I need help for. I have a database (db1) in which weekly ratings of behavioral outcomes are recorded. The variable "Week" corresponds to the number of the week from the beginning of the year (e.g., Week = 1 indicates the week between January 1st and 7th, and so on...) and the variable "Score" to the value obtained by the subject on the criterion measure. In the real data set, I have several participants and a different number of ratings for each subject; however, in this example there is only one subject to make things easier.
library(magrittr)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1
# Week Score
#1 14 34
#2 18 21
#3 19 45
#4 20 32
#5 21 56
#6 23 45
#7 24 23
#8 25 48
What I need to do is to identify the highest number of ratings occurred in consecutive weeks in the database. In the example, the highest number is 4 because the ratings were consecutive from week 18 to 21. Here I added a column for demonstration, but it might not be necessary for the solution.
x2 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y2 <- c(34, 21, 45, 32, 56, 45, 23, 48)
z2 <- c(1, 1, 2, 3, 4, 1, 2, 3)
db2 <- cbind(x2, y2, z2) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db2
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 23 45 1
#7 24 23 2
#8 25 48 3
Finally, because every subject has to have a total of five consecutive ratings, I need to add a row with a missing datum where the highest number of consecutive weeks is below five (so that I can impute the missing data later on). However, there might be ratings before and after the sequence. If that is the case, I want to add the row based on the minimal distance between the first or last week of the longest series of consecutive weeks from the other existing rating. In the example, that means that the row with missing datum will be added after 21 because there are 4 missing weeks between week 14 and 18 whereas only 1 between week 21 and 23.
x3 <- c(14, 18, 19, 20, 21, 22, 23, 24, 25)
y3 <- c(34, 21, 45, 32, 56, NA, 45, 23, 48)
z3 <- c(1, 1, 2, 3, 4, 5, 1, 2, 3)
db3 <- cbind(x3, y3, z3) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db3
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 22 NA 5
#7 23 45 1
#8 24 23 2
#9 25 48 3
For your information, this is not going to be part of the main statistical analyses but rather one of several ways I want to use to test the sensitivity of my model. So do not worry about whether it makes sense from a methodological point of view. In addition, if possible, a tidyverse solution would be greatly appreciated.
Thanks so much to anyone who will take the time.
The code is relatively easier, if you want to do it just for max group and if more than one, just for one.
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5) %>%
group_by(grp = cumsum(consecutive == 1)) %>%
filter(sum(dummy) > 0) %>% #filter out group(s) with max consecutive
ungroup() %>% select(-dummy) %>%
filter(grp == min(grp)) %>% # filter out first such group, if there are more than 1
complete(consecutive = 1:5) %>%
select(-grp) %>%
mutate(Week = first(Week) + consecutive -1)
# A tibble: 5 x 3
consecutive Week Score
<dbl> <dbl> <dbl>
1 1 18 21
2 2 19 45
3 3 20 32
4 4 21 56
5 5 22 NA
OLD ANSWER Another tidyverse strategy (this can be modified to suit your additional column requirements which you have not given in sample)
library(tidyverse)
db1
#> Week Score
#> 1 14 34
#> 2 18 21
#> 3 19 45
#> 4 20 32
#> 5 21 56
#> 6 23 45
#> 7 24 23
#> 8 25 48
library(data.table)
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5,
dummy2 = rleid(dummy)) %>%
group_split(dummy2, .keep = F) %>%
map_if( ~.x$dummy[[1]], ~.x %>% complete(consecutive = seq(max(consecutive), 5, 1), fill = list(Week = 1)) %>%
mutate(Week = cumsum(Week))) %>%
map_dfr(~.x %>% select(-dummy))
#> # A tibble: 9 x 3
#> Week Score consecutive
#> <dbl> <dbl> <dbl>
#> 1 14 34 1
#> 2 18 21 1
#> 3 19 45 2
#> 4 20 32 3
#> 5 21 56 4
#> 6 22 NA 5
#> 7 23 45 1
#> 8 24 23 2
#> 9 25 48 3
Created on 2021-06-10 by the reprex package (v2.0.0)
if I understand correctly
library(data.table)
library(tidyverse)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1 %>%
mutate(grp = cumsum(c(0, diff(Week)) > 1)) %>%
group_by(grp) %>%
mutate(n_grp = n()) %>%
ungroup() %>%
filter(n_grp == max(n_grp, na.rm = TRUE)) %>%
complete(grp,
n_grp,
nesting(Week = seq(from = first(Week), length = 5))) %>%
select(-c(grp, n_grp)) %>%
rows_upsert(db1, by = c("Week", "Score"))
#> # A tibble: 9 x 2
#> Week Score
#> <dbl> <dbl>
#> 1 18 21
#> 2 19 45
#> 3 20 32
#> 4 21 56
#> 5 22 NA
#> 6 14 34
#> 7 23 45
#> 8 24 23
#> 9 25 48
Created on 2021-06-10 by the reprex package (v2.0.0)
You can also use the following solution. Midway through this solution before we use add_row to add your additional rows, we can filter the whole data set for we use group_split I filtered the whole data set to keep only those groups with the maximum observations which means they have longer consecutive Weeks than others. So after we split by grouping variable we may end of with 2 or more groups of equal consecutive Weeks so then you can choose whichever your like based on your preference:
library(dplyr)
library(purrr)
library(tibble)
db1 %>%
mutate(Consecutive = +(Week - lag(Week, default = first(Week)) == 1),
grp = cumsum(Consecutive == 0)) %>%
group_by(grp) %>%
mutate(Consecutive = row_number()) %>%
group_by(grp, .drop = TRUE) %>%
add_count() %>%
ungroup() -> db2 # We create our grouping variable `grp` here
db2 %>%
filter(n == max(n)) %>%
group_split(grp) %>%
map_dfr(~ add_row(.x, Week = .x$Week[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
Consecutive = .x$Consecutive[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
grp = .x$grp[1])) %>%
bind_rows(db2 %>%
filter(n != max(n))) %>%
select(-c(grp, n)) %>%
arrange(Week)
# A tibble: 9 x 3
Week Score Consecutive
<dbl> <dbl> <dbl>
1 14 34 1
2 18 21 1
3 19 45 2
4 20 32 3
5 21 56 4
6 22 NA 5
7 23 45 1
8 24 23 2
9 25 48 3

How to use R to create baseball Splits from Game Logs

I'm trying to use R to recreate Baseball Splits as found on MLB.com. The splits are created from Game Logs and provide different cuts of the data. For example, home games vs. away games, day games vs. night games, August vs. September and many more all in one convenient table. I believe the ratios (AVG, OBP SLG) can all be added via mutate once the basic splits have been totaled.
My Question is, what's the best and most efficient way to create these splits and how should the data be shaped. The game log obviously has additional (hidden) column(s) that contain the Split topics. The nature of the problem leads me to believe purrr might be a tool to employ but I can't quite wrap my mind around how to approach this one.
Here is how I believe the data should be shaped and a link to a sample game log. I would appreciate any thoughts, ideas or solutions to this problem.
Links and images of Game Logs and Splits for National outfielder Juan Soto are set forth below.
Game Logs: Juan Soto Game Log
Splits: Juan Soto Game Splits
Splits
I've gone through the dataset, although I'm not sure if the sum values match, and neither the averages relative to the images above.
You're right about mutating for creating the values you suggest.
However, hopefully my approach can help you get what you're after.
library(tidyverse)
library(data.table)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- fread(game.splits, fill = TRUE)
game.splits.pivot <- game.splits
game.splits.pivot$Var1 <- ifelse(game.splits.pivot$Var1 %in% "HOME", 1, 0)
game.splits.pivot$Var2 <- ifelse(game.splits.pivot$Var2 %in% "NIGHT", 3, 2)
game.splits.pivot$Var3 <- ifelse(game.splits.pivot$Var3 %in% "SEPTEMBER", 5, 4)
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:16, 20))
colnames(game.splits.pivot)[19] <- "name_c"
game.splits.pivot <- game.splits.pivot[, -c(17, 18)]
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:3, 17))
#test
game.splits.pivot_test <- game.splits.pivot[, -c(1, 2, 3)]
game.splits.pivot_test <- aggregate(value ~ name_c + name, game.splits.pivot_test, sum)
game.splits.pivot_test <- game.splits.pivot_test %>% pivot_wider(names_from = name, values_from = value)
lc_name <- tibble(name_c = 0:5, split = c("HOME", "AWAY", "DAY", "NIGHT", "AUGUST", "SEPTEMBER"))
game.splits.pivot_test <- game.splits.pivot_test %>%
inner_join(lc_name, by = "name_c") %>%
arrange(name_c) %>%
select(-name_c)
game.splits.pivot_test <- game.splits.pivot_test[, c(14, 3, 9, 6, 1, 2, 7, 10, 4, 8, 12, 11, 5, 13)]
A look into the dataset:
# A tibble: 6 x 14
split AB R H `2B` `3B` HR RBI BB IBB SO SB CS TB
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 HOME 88 24 32 5 0 9 23 15 5 12 1 2 64
2 AWAY 66 15 22 9 0 4 14 26 7 16 5 0 43
3 DAY 29 21 18 4 0 5 17 12 4 3 4 0 37
4 NIGHT 125 18 36 10 0 8 20 29 8 25 2 2 70
5 AUGUST 90 21 33 6 0 11 25 13 1 13 1 1 72
6 SEPTEMBER 64 18 21 8 0 2 12 28 11 15 5 1 35
This turned out to be more straight-forward than I had thought. The following solution relies upon pivot_longer to shape the data and summarise_if to tally the splits - no rbinds or purrr needed.
library(tidyverse)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- read_csv(game.splits)
game.splits %>%
pivot_longer(Var1:Var3, names_to = "split") %>%
group_by(split) %>%
arrange(split) %>%
select(split, value, everything()) %>%
ungroup() %>%
select(split, value, everything()) %>%
select(-Date, -OPP) %>%
mutate(value = str_c(split, "_", value)) %>%
group_by(value) %>%
summarise_if(is.numeric, sum) %>%
mutate(value= str_replace(value, "(Var\\d_)",""))
#> # A tibble: 6 x 14
#> value AB R H TB `2B` `3B` HR RBI BB IBB SO SB
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AWAY 88 24 32 64 5 0 9 23 15 5 12 1
#> 2 HOME 66 15 22 43 9 0 4 14 26 7 16 5
#> 3 DAY 29 21 18 37 4 0 5 17 12 4 3 4
#> 4 NIGHT 125 18 36 70 10 0 8 20 29 8 25 2
#> 5 AUGUST 90 21 33 72 6 0 11 25 13 1 13 1
#> 6 SEPTE~ 64 18 21 35 8 0 2 12 28 11 15 5
Created on 2021-03-03 by the reprex package (v0.3.0)

Resources