I am working on R with a dataset that looks like this:
Screen shot of dataset
test=data.frame("1991" = c(1,5,3), "1992" = c(4,3,3), "1993" = c(10,5,3), "1994" = c(1,1,1), "1995" = c(2,2,6))
test=plyr::rename(test, c("X1991"="1991", "X1992"="1992", "X1993"="1993", "X1994"="1994", "X1995"="1995"))
What I want to do is that I want to create variables called Pre1991, Pre1992, Pre1993, ... and these variables would store the cumulated values up to that year, e.g.
Pre1991 = test$1991
Pre1992 = test$1991 + test$1992
Pre1993 = test$1991 + test$1992 + test$1993
so on.
My real dataset has variables from year 1900-2017 so I can't do this manually. I want to write a for loop but it didnt work.
for (i in 1900:2017){
x = paste0("Pre",i)
df[[x]] = rowSums(df[,(colnames(df)<=i)])
}
Can someone please help to review my code/ suggest other ways to do it? Thanks!
Edit 1:
Thanks so much! And I'm wondering if there's a way that I can use cumsum function in a reverse direction? For example, if I am interested in what happened after that particular year:
Post1991 = test$1992 + test$1993 + test$1994 + test$1995 + ...
Post1992 = test$1993 + test$1994 + test$1995 + ...
Post1993 = test$1994 + test$1995 + ...
This is a little inefficient in that it is converting from a data.frame to a matrix and back, but ...
as.data.frame(t(apply(as.matrix(test), 1, cumsum)))
# 1991 1992 1993 1994 1995
# 1 1 5 15 16 18
# 2 5 8 13 14 16
# 3 3 6 9 10 16
If your data has other columns that are not year-based, such as
test$quux <- LETTERS[3:5]
test
# 1991 1992 1993 1994 1995 quux
# 1 1 4 10 1 2 C
# 2 5 3 5 1 2 D
# 3 3 3 3 1 6 E
then subset on both sides:
test[1:5] <- as.data.frame(t(apply(as.matrix(test[1:5]), 1, cumsum)))
test
# 1991 1992 1993 1994 1995 quux
# 1 1 5 15 16 18 C
# 2 5 8 13 14 16 D
# 3 3 6 9 10 16 E
EDIT
In reverse, just use repeated rev:
as.data.frame(t(apply(as.matrix(test), 1, function(a) rev(cumsum(rev(a)))-a)))
# 1991 1992 1993 1994 1995
# 1 17 13 3 2 0
# 2 11 8 3 2 0
# 3 13 10 7 6 0
Using tidyverse we can gather and calculate before then spreading again. For this to work data will need to be arranged.
library(tidyverse)
test <- data.frame("1991" = c(1, 5, 3),
"1992" = c(4, 3, 3),
"1993" = c(10, 5, 3),
"1994" = c(1, 1, 1),
"1995" = c(2, 2, 6))
test <- plyr::rename(test, c("X1991" = "1991",
"X1992" = "1992",
"X1993" = "1993",
"X1994" = "1994",
"X1995" = "1995"))
Forwards
test %>%
mutate(id = 1:nrow(.)) %>% # adding an ID to identify groups
gather(year, value, -id) %>% # wide to long format
arrange(id, year) %>%
group_by(id) %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
spread(year, value) %>% # long to wide format
select(-id) %>%
setNames(paste0("pre", names(.))) # add prefix to columns
## A tibble: 3 x 5
# pre1991 pre1992 pre1993 pre1994 pre1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1. 5. 15. 16. 18.
# 2 5. 8. 13. 14. 16.
# 3 3. 6. 9. 10. 16.
Reverse direction
As your definition specifies its not strictly the reverse order, its the reverse order excluding itself which would be the cumulative lagged sum.
test %>%
mutate(id = 1:nrow(.)) %>%
gather(year, value, -id) %>%
arrange(id, desc(year)) %>% # using desc() to reverse sorting
group_by(id) %>%
mutate(value = cumsum(lag(value, default = 0))) %>% # lag cumsum
ungroup() %>%
spread(year, value) %>%
select(-id) %>%
setNames(paste0("post", names(.)))
## A tibble: 3 x 5
# post1991 post1992 post1993 post1994 post1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 17. 13. 3. 2. 0.
# 2 11. 8. 3. 2. 0.
# 3 13. 10. 7. 6. 0.
We can use rowCumsums from matrixStats
library(matrixStats)
test[] <- rowCumsums(as.matrix(test))
test
# 1991 1992 1993 1994 1995
#1 1 5 15 16 18
#2 5 8 13 14 16
#3 3 6 9 10 16
Related
I am trying to replace an obsolete Excel report currently used for sales forecasting and inventory projections by our supply chain team and I am using R for this.
The desired output is a data frame with one of the columns being the projected closing inventory positions for each week across a span of N weeks.
The part I am struggling with is the recursive calculation for the closing inventory positions. Below is a subset of the data frame with dummy data where "stock_projection" is the desire result.
I've just started learning about recursion in R so I am not really sure on how to implement this here. Any help will be much appreciated!
week
forecast
opening_stock
stock_projection
1
10
100
100
2
11
89
3
12
77
4
10
67
5
11
56
6
10
46
7
12
34
8
11
23
9
9
14
10
12
2
Update
I have managed to modify the solution explained here and have replicated the above outcome:
inventory<- tibble(week = 1, opening_stock = 100)
forecast<- tibble(week = 2:10, forecast = c(11, 12, 10, 11, 10, 12, 11, 9, 12) )
dat <- full_join(inventory, forecast)
dat2 <- dat %>%
mutate(forecast = -forecast) %>%
gather(transaction, value, -week) %>%
arrange(week) %>%
mutate(value = replace_na(value, 0))
dat2 %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
group_by(week) %>%
summarise(stock_projection = last(value))
Despite working like a charm, I am wondering whether there is another way to achieve this?
I think in the question above, you don't have to worry too much about recursion because the stock projection looks just like the opening stock minus the cumulative sum of the forecast. You could do that with:
library(dplyr)
dat <- tibble(
week = 1:10,
forecast = c(10,11,12,10,11,10,12,11,9,12),
opening_stock = c(100, rep(NA, 9))
)
dat <- dat %>%
mutate(fcst = case_when(week == 1 ~ 0,
TRUE ~ forecast),
stock_projection = case_when(
week == 1 ~ opening_stock,
TRUE ~ opening_stock[1] - cumsum(fcst))) %>%
dplyr::select(-fcst)
dat
# # A tibble: 10 × 4
# week forecast opening_stock stock_projection
# <int> <dbl> <dbl> <dbl>
# 1 1 10 100 100
# 2 2 11 NA 89
# 3 3 12 NA 77
# 4 4 10 NA 67
# 5 5 11 NA 56
# 6 6 10 NA 46
# 7 7 12 NA 34
# 8 8 11 NA 23
# 9 9 9 NA 14
# 10 10 12 NA 2
I'm looking for a tidy solution preferably using tidyverse
This question is in line with this answer, it does however have an added twist. My data has an overall grouping variable 'grp'. Within each such group, I want to perform calculations based on cumulative sum (cumsum) within sub-groups defined by 'trial', here X and Y.
However, for the calculations within both sub-groups, trial "X" and trial "Y", I need to use a single, common group-specific baseline, i.e. where trial is B.
My desired outcome is Value3 in the data set desired_outcome below:
# library(tidyverse)
# library(dplyr)
desired_outcome # see below I got this `desired_outcome`
# A tibble: 10 x 6
# Groups: grp [2]
grp trial yr value1 value2 Value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 7
5 A Y 2023 6 4 16
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 5
10 B Y 2023 4 6 14
My minimal working example. Data first,
tabl <- tribble(~grp, ~trial, ~yr, ~value1, ~value2,
'A', "B", 2021, 2, 0,
'A', "X", 2022, 3, 1,
'A', "X", 2023, 4, 2,
'A', "Y", 2022, 5, 3,
'A', "Y", 2023, 6, 4,
'B', "B", 2021, 0, 2,
'B', "X", 2022, 1, 3,
'B', "X", 2023, 2, 4,
'B', "Y", 2022, 3, 5,
'B', "Y", 2023, 4, 6) %>%
mutate(trial = factor(trial, levels = c("B", "X", "Y"))) %>%
arrange(grp, trial, yr)
Now, I need to use group_by(), but I can't group on trial as I need to use the baseline, B in calculations for both "X" and "Y".
undesired_outcome_tidier_code <- tabl %>%
group_by(grp) %>% # this do not work!
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1) %>%
select(-Value1.1, -Value2.1)
In undesired_outcome_tidier_code row 4-5 and 9-10 is, for obvious reasons, not using line 1 and 6, respectively, as base line. As shown here,
undesired_outcome_tidier_code
# A tibble: 10 x 6
# Groups: grp [2]
grp trial yr value1 value2 Value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 17
5 A Y 2023 6 4 26
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 15
10 B Y 2023 4 6 24
I am looking for a solution that gets me desired_outcome (see below) in a tidy way.
I can, in this smaller example, work my way around it, to get to my desired_outcome, but it's a cumbersome two step solution. There must be a better/tidier way.
step1 <- tabl %>% arrange(grp, trial, yr) %>% filter(trial != 'Y') %>%
group_by(grp) %>%
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1)
step2 <- tabl %>% arrange(grp, trial, yr) %>% filter(trial != 'X') %>%
group_by(grp) %>%
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1)
desired_outcome <- rbind(step1,
step2 %>% filter(trial != 'B')
) %>% select(-Value1.1, -Value2.1) %>% arrange(grp, trial, yr)
With the addition of purrr, you could do:
map(.x = c("X", "Y"),
~ tabl %>%
arrange(grp, trial, yr) %>%
filter(trial != .x) %>%
group_by(grp) %>%
mutate(value3 = cumsum(value1) + lag(cumsum(value2), default = 0))) %>%
reduce(full_join) %>%
arrange(grp, trial, yr)
grp trial yr value1 value2 value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 7
5 A Y 2023 6 4 16
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 5
10 B Y 2023 4 6 14
You can try with this.
calculate_value3 is a function that calculates value3 as you described. It does it for every letter of trial. It always includes the observation of the baseline. It doesn't matter if the letters will be different than X and Y. Note that baseline can be any letter you want, I set it up as "B" for now.
Inside the pipes, you go for a map-reduce solution. map will run the function calculate_value3 for each unique trial and reduce will set them all together with coalesce (which will replace all NAs --> this is why I initialize v3 as a vector of all NAs in calculate_value3)
calculate_value3 <- function(ut, # trial under examination
tr, # trial vector
v1, # value1 vector
v2, # value2 vector
baseline = "B"){ # baseline id
v3 <- rep_len(NA, length(tr))
ind <- ut == tr | baseline == tr
cumv1 <- cumsum(v1[ind])
cumlv2 <- cumsum(lag(v2[ind], default = 0))
v3[ind] <- cumv1 + cumlv2
v3
}
library(purrr)
tabl %>%
group_by(grp) %>%
mutate(value3 = reduce(
map(unique(trial), calculate_value3,
tr = trial, v1 = value1, v2 = value2),
coalesce)) %>%
ungroup()
#> # A tibble: 10 x 6
#> grp trial yr value1 value2 value3
#> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A B 2021 2 0 2
#> 2 A X 2022 3 1 5
#> 3 A X 2023 4 2 10
#> 4 A Y 2022 5 3 7
#> 5 A Y 2023 6 4 16
#> 6 B B 2021 0 2 0
#> 7 B X 2022 1 3 3
#> 8 B X 2023 2 4 8
#> 9 B Y 2022 3 5 5
#> 10 B Y 2023 4 6 14
The solution is flexible to the identifiers of the trials and seems reasonably easy to debug and to edit if need be [at least to me].
Because tidyverse didn't seem like a strict requirement, I take the opportunity to suggest a data.table alternative:
Starting with the 'desired_outcome' data, just to make it easier to compare results:
library(data.table)
setDT(desired_outcome)
desired_outcome[ , v3 := {
c(value1[1], sapply(c("X", "Y"), function(g){
.SD[trial %in% c("B", g), (cumsum(value1) + cumsum(shift(value2, fill = 0)))[-1]]
}))}, by = grp]
# grp trial yr value1 value2 Value3 v3
# 1: A B 2021 2 0 2 2
# 2: A X 2022 3 1 5 5
# 3: A X 2023 4 2 10 10
# 4: A Y 2022 5 3 7 7
# 5: A Y 2023 6 4 16 16
# 6: B B 2021 0 2 0 0
# 7: B X 2022 1 3 3 3
# 8: B X 2023 2 4 8 8
# 9: B Y 2022 3 5 5 5
# 10: B Y 2023 4 6 14 14
For each 'grp' (by = grp), loop over 'trial' "X" and "Y" (sapply(c("X", "Y")). Within each sub-dataset defined by by (.SD), select rows where 'trial' is equal to "B" or the current value of the loop (trial %in% c("B", g)).
Do the calculation (cumsum(value1) + cumsum(shift(value2, fill = 0)) and remove the first value ([-1]). Append the first row within each 'grp', i.e. the row that corresponds to trial "B" (c(value1[1], ...). Assign the result to a new variable by reference (v3 := )
Given that, i have a dataframe as below:
dt <- data.frame(year = sample(c(2000:2019),100,replace = T ),
month = sample(c(1:12),100,replace = T ),
paitent_ID = sample(c(1:50),100,replace = T ),
state = sample(c(1:10),100,replace = T ) )
and i need to apply the below function to this dataset after group by and sort:
newState <- function(dt){
dt["new"]= dt[0,"state"]*3
dt
}
So, this function is supposed to add a new column called new to each group.
Here is the group_by:
library(dplyr)
dt %>%
group_by(paitent_ID) %>%
group_map( ~ .x %>%
arrange( year,month)) %>%
group_map( ~ .x %>%
newState())
when i run the code, it complains with:
Error in UseMethod("group_split") :
no applicable method for 'group_split' applied to an object of class "list"
As #André Oliveira mentions in the comments, it is recommended to use mutate for adding a column. However, it is possible to do so with group_modify after making some small changes to your function.
newState <- function(dt, groupvars){
dt["new"]= dt[1,"state"]*3
dt
}
dt %>%
group_by(paitent_ID) %>%
arrange(year, month) %>%
group_modify(newState) %>%
ungroup
# # A tibble: 100 x 5
# paitent_ID year month state new
# <int> <int> <int> <int> <dbl>
# 1 1 2006 5 3 9
# 2 2 2012 12 3 9
# 3 3 2013 11 8 24
# 4 3 2014 10 1 24
# 5 3 2019 5 6 24
# 6 4 2006 7 5 15
# 7 4 2006 7 2 15
# 8 5 2003 8 8 24
# 9 7 2015 12 2 6
# 10 7 2017 8 10 6
And a more conventional approach
dt %>%
group_by(paitent_ID) %>%
arrange(year, month) %>%
mutate(new = state[1]*3)
df = data.frame(group=c(1,1,1,2,2,2,3,3,3),
score=c(11,NA,7,NA,NA,4,6,9,15),
MAKE=c(11,11,11,4,4,4,15,15,15))
Say you have data as above with group and score and the objective is to make new variable MAKE which is just the maximum value of score for each group repeated.
And this is my attempt yet it does not work.
df %>%
group_by(group) %>%
summarise(Value = max(is.na(score)))
For that you need
df %>% group_by(group) %>% mutate(MAKE = max(score, na.rm = TRUE))
# A tibble: 9 x 3
# Groups: group [3]
# group score MAKE
# <dbl> <dbl> <dbl>
# 1 1 11 11
# 2 1 NA 11
# 3 1 7 11
# 4 2 NA 4
# 5 2 NA 4
# 6 2 4 4
# 7 3 6 15
# 8 3 9 15
# 9 3 15 15
The issue with max(is.na(score)) is that is.na(score) is a logical vector and when max is applied, it gets coerced to a binary vector with 1 for TRUE and 0 for FALSE. A somewhat less natural solution but closer to what you tried then would be
df %>% group_by(group) %>% mutate(MAKE = max(score[!is.na(score)]))
which finds the maximal value among all those values of score that are not NA.
library(tidyverse)
df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
How do I make dplyr look at this data frame df and collapse all these occurences of 2 into a single summed group, and collapse all the occurrences of 1 into a single summed group? And also keep the rest of the data frame.
Turn this:
# A tibble: 20 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
11 11 2
12 12 2
13 13 2
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
into this:
# A tibble: 5 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
[Edit] - I fixed the example data. Sorry about that.
We group by a manufactured sortkey to maintain sort order. We used the fact that b is in descending order in the input but if that is not the case in your actual data then replace sortkey = -b with the more general sortkey = data.table::rleid(b) or the longer sortkey = cumsum(coalesce(b != lag(b), FALSE)) .
We also convert b to the group names giving a new a. It wasn't clear which groups are to be converted to grp... form. Hard-coded 1 and 2? Any group with more than one row? Groups at the end with more than one row? At any rate it would be easy enough to change the condition in the if_else once that were clarified.
Finally perform the summation and then remove the sortkey.
df %>%
group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
summarize(b = sum(b)) %>%
ungroup %>%
select(-sortkey)
giving:
# A tibble: 5 x 2
a b
<chr> <int>
1 50 50
2 20 20
3 13 13
4 grp2 20
5 grp1 7
Here's a way. I have converted a from factor to character to make things easier. You can convert it back to factor if you want. Also your test data was a bit wrong.
df <- tibble(a = as.character(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>%
mutate(
a = case_when(
b == 1 ~ "grp1",
b == 2 ~ "grp2",
TRUE ~ a
)
) %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 5 x 2
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp1 7
5 grp2 20
This is an approach which gives you the desired names for groups & where you don't need to think in advance how many cases like that you would need (e.g. it would create grp3, grp4, ... depending on the number in b).
library(dplyr)
df %>%
mutate(
grp = as.numeric(lag(df$b) != df$b),
grp = cumsum(ifelse(is.na(grp), 0, grp))
) %>% group_by(grp) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)
Output:
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
Note that the code could be also condensed but that leads to a certain lack of readability in my opinion:
df %>%
group_by(grp = cumsum(ifelse(is.na(as.numeric(lag(df$b) != df$b)), 0, as.numeric(lag(df$b) != df$b)))) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)