adding rows by group to get same number of observations by group - r

I have what seems like a pretty simple question, but I haven't been able to successfully adapt solutions to similar ones to my situation, including this one: Add row for each group with missing value
I have some data that looks like this:
# A tibble: 265 x 4
anon_ID assistance_date Benefit_1 nth_assistance_interaction
<int> <chr> <chr> <dbl>
1 8 2020-04-10 Medical 5
2 8 2020-04-13 Medical 10
3 8 2020-04-15 Medical 15
4 8 2020-04-21 Medical 20
5 11 2020-06-17 Housing 5
6 11 2020-06-25 Financial 10
7 11 2021-01-27 Financial 15
8 26 2020-05-18 Legal 5
9 26 2021-06-01 Food 10
10 26 2021-08-02 Utilities 15
# ... with 255 more rows
I want to modify it so that each anon_ID has four observations, one for each unique value of nth_assistance_interaction. The values of assistance_date and Benefit_1 should be NA when real values for these variables don't exist.
e.g., for anon_ID = 11, these two variables would have NA values when nth_assistance_interaction = 20.
# A tibble: 265 x 4
anon_ID assistance_date Benefit_1 nth_assistance_interaction
<int> <chr> <chr> <dbl>
1 8 2020-04-10 Medical 5
2 8 2020-04-13 Medical 10
3 8 2020-04-15 Medical 15
4 8 2020-04-21 Medical 20
5 11 2020-06-17 Housing 5
6 11 2020-06-25 Financial 10
7 11 2021-01-27 Financial 15
8 11 NA NA 20
9 26 2020-05-18 Legal 5
10 26 2021-06-01 Food 10
11 26 2021-08-02 Utilities 15
# ... with 255 more rows
This is just one example of what I'm trying to accomplish. It could also be the case that anon_ID = 27 only has one observation for nth_assistance_interaction, and so I would need to add three rows for them.
How can I go about making this happen? Thanks in advance.

We may group by 'anon_ID' and use complete to expand the data
library(dplyr)
library(tidyr)
df1 %>
group_by(anon_ID) %>%
complete(nth_assistance_interaction = c(5, 10, 15, 20)) %>%
ungroup

Related

How to add a new column with values specific to grouped variables

I'm new to R and have found similar solutions to my problem, but I'm struggling to apply these to my code. Please help...
These data are simplified, as the id variables are many:
df = data.frame(id = rep(c("a_10", "a_11", "b_10", "b_11"), each = 5),
site = rep(1:5, 4),
value = sample(1:20))
The aim is to add another column labelled "year" with values that are grouped by "id" but the true names are many - so I'm trying to simplify the code by using the ending digits.
I can use dplyr to split the dataframe into each id variable using this code (repeated for each id variable):
df %>%
select(site, id, value) %>%
filter(grepl("10$", id)) %>%
mutate(Year = "2010")`
Rather than using merge to re-combine the dataframes back into one, is there not a more simple method?
I tried modifying case_when with mutate as described in a previous answer:
[https://stackoverflow.com/a/63043920/12313457][1]
mutate(year = case_when(grepl(c("10$", "11$", id) == c("2010", "2011"))))
is something like this possible??
Thanks in advance
In case your id column has different string lengths you can use sub:
df %>%
mutate(Year = paste0("20", sub('^.*_(\\d+)$', '\\1', id)))
#> id site value Year
#> 1 a_10 1 2 2010
#> 2 a_10 2 7 2010
#> 3 a_10 3 16 2010
#> 4 a_10 4 10 2010
#> 5 a_10 5 11 2010
#> 6 a_11 1 5 2011
#> 7 a_11 2 13 2011
#> 8 a_11 3 14 2011
#> 9 a_11 4 6 2011
#> 10 a_11 5 12 2011
#> 11 b_10 1 17 2010
#> 12 b_10 2 1 2010
#> 13 b_10 3 4 2010
#> 14 b_10 4 15 2010
#> 15 b_10 5 9 2010
#> 16 b_11 1 8 2011
#> 17 b_11 2 20 2011
#> 18 b_11 3 19 2011
#> 19 b_11 4 18 2011
#> 20 b_11 5 3 2011
Created on 2022-04-21 by the reprex package (v2.0.1)
You can use substr to get the final two digits of id and then paste0 this to "20" to recreate the year.
df |> dplyr::mutate(Year = paste0("20", substr(id, 3, 4)))
#> id site value Year
#> 1 a_10 1 5 2010
#> 2 a_10 2 12 2010
#> 3 a_10 3 9 2010
#> 4 a_10 4 7 2010
#> 5 a_10 5 13 2010
#> 6 a_11 1 3 2011
#> 7 a_11 2 4 2011
#> 8 a_11 3 16 2011
#> 9 a_11 4 2 2011
#> 10 a_11 5 6 2011
#> 11 b_10 1 19 2010
#> 12 b_10 2 14 2010
#> 13 b_10 3 15 2010
#> 14 b_10 4 10 2010
#> 15 b_10 5 11 2010
#> 16 b_11 1 18 2011
#> 17 b_11 2 1 2011
#> 18 b_11 3 20 2011
#> 19 b_11 4 17 2011
#> 20 b_11 5 8 2011
Created on 2022-04-21 by the reprex package (v2.0.1)

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

Aggregate columns into new ones depending on row value in r [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I'm searching for a way to aggregate rows, based on date and ID, but arranging the values depending on one of te colums' values.
Basically I want to convert this:
DATE ID V1 V2 SHIFT
1 2020-05-21 A 21 34 morning
2 2020-05-21 A 6 12 afternoon
3 2020-05-20 A 19 15 morning
4 2020-05-20 A 7 16 afternoon
5 2020-05-20 B 17 20 morning
6 2020-05-20 B 5 11 afternoon
Into this:
DATE ID V1_mor V2_mor V1_aft V2_aft
1 2020-05-21 A 21 34 6 12
2 2020-05-20 A 19 15 7 16
3 2020-05-20 B 17 20 5 11
I've tried the route of making a loop and adding a bunch of "if" statements, but I don't want to get caught in bad practice.
The goal is so I can later add another column with a calculation of the style of (v1_mor * v2_aft) / v2_mor
Your goal can be achieved by using pivot_wider() from library tidyr:
library(tidyverse)
df %>%
pivot_wider(names_from = SHIFT, values_from = c(V1, V2))
This gives
# A tibble: 3 x 6
# Groups: DATE [2]
DATE ID V1_morning V1_afternoon V2_morning V2_afternoon
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2020-05-21 A 21 6 34 12
2 2020-05-20 A 19 7 15 16
3 2020-05-20 B 17 5 20 11
Data
df <- read_table2("DATE ID V1 V2 SHIFT
2020-05-21 A 21 34 morning
2020-05-21 A 6 12 afternoon
2020-05-20 A 19 15 morning
2020-05-20 A 7 16 afternoon
2020-05-20 B 17 20 morning
2020-05-20 B 5 11 afternoon")

A running sum for daily data that resets when month turns

I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33

Cumulative sums in R with multiple conditions?

I am trying to figure out how to create a cumulative or rolling sum in R based on a few conditions.
The data set in question is a few million observations of library loans, and the question is to determine how many copies of a given book/title would be necessary to meet demand.
So for each Title.ID, begin with 1 copy for the first instance (ID.Index). Then for each instance after, determine whether another copy is needed based on whether the REQUEST.DATE is within 16 weeks (112 days) of the previous request.
# A tibble: 15 x 3
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index
<date> <int> <int>
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5
The tricky part is that determining whether a new copy is needed is based not only on the number of request (ID.Index) and the REQUEST.DATE of some previous loan, but also on the preceding accumulating sum.
For instance, for the third request for title 2 (Title.ID 2, ID.Index 3), there are now two copies, so to determine whether a new copy is needed, you have to see whether the REQUEST.DATE is within 112 days of the first (not second) request (ID.Index 1). By contrast, for the third request for title 6 (Title.ID 6, ID.Index 3), there is only one copy available (since request 2 was not within 112 days), so determining whether a new copy is needed is based on looking back to the REQUEST.DATE of ID.Index 2.
The desired output ("Copies") would take each new request (ID.Index), then look back to the relevant REQUEST.DATE based on the number of available copies, and doing that would mean looking at the accumulating sum for the preceding calculation. (Note: The max number of copies would be 10.)
I've provided the desired output for the sample below ("Copies").
# A tibble: 15 x 4
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index Copies
<date> <int> <int> <dbl>
1 2013-07-09 2 1 1
2 2013-08-07 2 2 2
3 2013-08-20 2 3 3
4 2013-09-08 2 4 4
5 2013-09-28 2 5 5
6 2013-12-27 2 6 5
7 2014-02-10 2 7 5
8 2014-03-12 2 8 5
9 2014-03-14 2 9 5
10 2014-08-27 2 10 5
11 2014-04-27 6 1 1
12 2014-08-01 6 2 2
13 2014-11-13 6 3 2
14 2015-02-14 6 4 2
15 2015-05-14 6 5 2
>
I recognize that the solution will be way beyond my abilities, so I will be extremely grateful for any solution or advice about how to solve this type of problem in the future.
Thanks a million!
*4/19 update: new examples where new copy may be added after delay, i.e., not in sequence. I've also added columns showing days since a given previous request, which helps checking whether a new copy should be added, based on how many copies there are.
Sample 2: new copy should be added with third request, since it has only been 96 days since last request (and there is only one copy)
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10680332 2013-10-17 6 1 NA days NA days NA days NA days NA days 1
2 PEN-10835735 2014-04-27 6 2 192 days NA days NA days NA days NA days 1
3 PEN-10873506 2014-08-01 6 3 96 days 288 days NA days NA days NA days 1
4 PEN-10951264 2014-11-13 6 4 104 days 200 days 392 days NA days NA days 1
5 PEN-11029526 2015-02-14 6 5 93 days 197 days 293 days 485 days NA days 1
6 PEN-11106581 2015-05-14 6 6 89 days 182 days 286 days 382 days 574 days 1
Sample 3: new copy should be added with last request, since there are two copies, and the oldest request is 45 days.
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10999392 2015-01-20 76 1 NA days NA days NA days NA days NA days 1
2 YAL-11004302 2015-01-22 76 2 2 days NA days NA days NA days NA days 2
3 COR-11108471 2015-05-18 76 3 116 days 118 days NA days NA days NA days 2
4 HVD-11136632 2015-07-27 76 4 70 days 186 days 188 days NA days NA days 2
5 MIT-11164843 2015-09-09 76 5 44 days 114 days 230 days 232 days NA days 2
6 HVD-11166239 2015-09-10 76 6 1 days 45 days 115 days 231 days 233 days 2
You can use runner package to apply any R function on cumulative window.
This time we execute function f using x = REQUEST.DATE. We just count number of observations which are within min(x) + 112.
library(dplyr)
library(runner)
data %>%
group_by(Title.ID) %>%
mutate(
Copies = runner(
x = REQUEST.DATE,
f = function(x) {
length(x[x <= (min(x + 112))])
}
)
)
# # A tibble: 15 x 4
# # Groups: Title.ID [2]
# REQUEST.DATE Title.ID ID.Index Copies
# <date> <int> <int> <int>
# 1 2013-07-09 2 1 1
# 2 2013-08-07 2 2 2
# 3 2013-08-20 2 3 3
# 4 2013-09-08 2 4 4
# 5 2013-09-28 2 5 5
# 6 2013-12-27 2 6 5
# 7 2014-02-10 2 7 5
# 8 2014-03-12 2 8 5
# 9 2014-03-14 2 9 5
# 10 2014-08-27 2 10 5
# 11 2014-04-27 6 1 1
# 12 2014-08-01 6 2 2
# 13 2014-11-13 6 3 2
# 14 2015-02-14 6 4 2
# 15 2015-05-14 6 5 2
data
data <- read.table(
text = " REQUEST.DATE Title.ID ID.Index
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5",
header = TRUE)
data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))
I was able to find a workable solution based on finding the max number of other requests within 112 days of a request (after creating return date), for each title.
data$RETURN.DATE <- as.Date(data$REQUEST.DATE + 112)
data <- data %>%
group_by(Title.ID) %>%
mutate(
Copies = sapply(REQUEST.DATE, function(x)
sum(as.Date(REQUEST.DATE) <= as.Date(x) &
as.Date(RETURN.DATE) >= as.Date(x)
))
)
Then I de-duplicated the list of titles, using the max number for each title, and added it back to the original data.
I still think there's a solution to the original problem, where I could go back and see at which point new copies needed to be added (for analysis based on when a title is published), but this works for now.

Resources