Create weekly cumulative totals from R dataframe - r

I have a dataframe that has launch weeks for products across markets. Here is a snapshot of the dataframe.
Prod_ID Market_Name START_WEEK
11044913000 PHOENIX, AZ 1397
11044913000 WEST TEX/NEW MEX 1206
11159402003 PORTLAND,OR 1188
11159402003 SEATTLE/TACOMA 1188
11159402003 SPOKANE 1195
11159410010 PORTLAND,OR 1186
11159410010 SALT LAKE CITY 1190
11159410010 SEATTLE/TACOMA 1186
11159410010 SPOKANE 1187
11159410010 WEST TEX/NEW MEX 1197
11159410014 PORTLAND,OR 1198
11159410014 SEATTLE/TACOMA 1239
I would like to create another dataframe which will give me for each Prod_ID, cumulative totals of number of markets a product has been launched in on a weekly basis for first 6 weeks. For the above snippet of data, the output should like something like this.
Prod_ID Week1 Week2 Week3 Week4 Week5 Week6
11044913000 1 1 1 1 1 1
11159402003 2 2 2 2 2 2
11159410010 2 3 3 3 4 4
11159410014 1 1 1 1 1 1
For ease of displaying, I have shown the output only till Week 6, but I need to track till Week 12 for my need. Week is denoted by a 4 digit number in my dataset and is not in date format. Please note that not all products have the same starting week, so I need to infer the earliest week for a Prod_IDfrom the START_WEEK variable. And then identify the next 6 weeks to generate the total number of markets launched in each week.
Any help to do this is appreciated.

I think I understand your problem. Here is my shot. There are several phases to this solution.
The first step is to calculate the cumulative sum of markets for the weeks and the week number for each Prod_ID since they opened. This is done with the following code chunk.
df1 <- df %>%
group_by(Prod_ID, START_WEEK) %>%
count() %>%
arrange(Prod_ID, START_WEEK) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(tot_market = cumsum(n)) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(min_START_WEEK = min(START_WEEK)) %>%
mutate(week = START_WEEK - min_START_WEEK + 1)
df1
# # A tibble: 10 x 6
# # Groups: Prod_ID [4]
# Prod_ID START_WEEK n tot_market min_START_WEEK week
# <dbl> <int> <int> <int> <dbl> <dbl>
# 1 11044913000. 1206 1 1 1206. 1.
# 2 11044913000. 1397 1 2 1206. 192.
# 3 11159402003. 1188 2 2 1188. 1.
# 4 11159402003. 1195 1 3 1188. 8.
# 5 11159410010. 1186 2 2 1186. 1.
# 6 11159410010. 1187 1 3 1186. 2.
# 7 11159410010. 1190 1 4 1186. 5.
# 8 11159410010. 1197 1 5 1186. 12.
# 9 11159410014. 1198 1 1 1198. 1.
# 10 11159410014. 1239 1 2 1198. 42.
The second phase is to expand the week and Prod_ID to the maximum number of weeks in week.
df2 <- expand.grid(min(df1$week):max(df1$week), unique(df1$Prod_ID))
colnames(df2) <- c("week", "Prod_ID")
The third phase is done by merging df1 and df2 and using zoo::locf to fill the NA's in tot_market (total market) by Prod_ID with the preceding value.
df2 %>% left_join(df1) %>% select(-START_WEEK, -n, -min_START_WEEK) %>%
group_by(Prod_ID) %>%
arrange(Prod_ID, week) %>%
mutate(tot_market = zoo::na.locf(tot_market)) %>%
spread(week, tot_market) %>%
ungroup() %>%
mutate_at(vars(Prod_ID), as.character) %>%
rename_if(is.integer, function(x) paste0("Week", x))
# # A tibble: 4 x 193
# Prod_ID Week1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week10 Week11
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 11044913000 1 1 1 1 1 1 1 1 1 1 1
# 2 11159402003 2 2 2 2 2 2 2 3 3 3 3
# 3 11159410010 2 3 3 3 4 4 4 4 4 4 4
# 4 11159410014 1 1 1 1 1 1 1 1 1 1 1
# # ... with 181 more variables

Related

Is there a better way to add a new value/field for every key (sym) in a tibble instead of using mutate then pivot_longer?

I have the following table and I would like to apply a function (ret) to the values BY sym. However, instead of creating a new column with that result (simple mutate), I would like to keep the table in long format and create a new row (field/value) for each day/sym.
x <- tibble(day=rep(1:5,2),
sym=c(rep('a',5),rep('b',5)),
field=rep('price',10),
value=as.numeric(c(101:105,501:505))) %>%
arrange(day,sym)
> x
# A tibble: 10 x 4
day sym field value
<int> <chr> <chr> <dbl>
1 1 a price 101
2 1 b price 501
3 2 a price 102
4 2 b price 502
5 3 a price 103
6 3 b price 503
I can accomplish this task by mutate to create a new column and then pivot_longer and bind_rows but I have a feeling there is a more concise way...
Here is my solution:
ret <- function(x) c(NA,diff(x))/x
x2 <- x %>% group_by(sym) %>% mutate(ret=ret(value)) %>%
select(day,sym,ret) %>%
pivot_longer(cols=c(-day,-sym),names_to='field',values_to='value') %>%
bind_rows(x) %>%
ungroup() %>%
arrange(day,sym,field)
> x2
# A tibble: 20 x 4
day sym field value
<int> <chr> <chr> <dbl>
1 1 a price 101
2 1 a ret NA
3 1 b price 501
4 1 b ret NA
5 2 a price 102
6 2 a ret 0.00980
7 2 b price 502
8 2 b ret 0.00199
9 3 a price 103
10 3 a ret 0.00971
11 3 b price 503
12 3 b ret 0.00199
Thank you!! Please let me know your thoughts
D
There's no need to use bind_rows since you already have the price variable in the data.frame. If you rename value to price and don't remove it before pivoting, then you'll have both 'ret' and 'price' in your field variable without having to bind it back in:
x %>%
group_by(sym) %>%
mutate(ret = ret(value)) %>%
select(day, sym, ret, 'price' = value) %>%
pivot_longer(cols = c(-day, -sym),
names_to = 'field',
values_to = 'value')
# A tibble: 20 x 4
# Groups: sym [2]
day sym field value
<int> <chr> <chr> <dbl>
1 1 a ret NA
2 1 a price 101
3 1 b ret NA
4 1 b price 501
5 2 a ret 0.00980
6 2 a price 102
7 2 b ret 0.00199
8 2 b price 502
9 3 a ret 0.00971
10 3 a price 103
11 3 b ret 0.00199
12 3 b price 503
13 4 a ret 0.00962
14 4 a price 104
15 4 b ret 0.00198
16 4 b price 504
17 5 a ret 0.00952
18 5 a price 105
19 5 b ret 0.00198
20 5 b price 505
How about
library(dplyr)
x %>%
group_by(sym) %>%
mutate(value=ret(value), field="ret") %>%
full_join(x) %>%
arrange(day,sym,field)
which returns
Joining, by = c("day", "sym", "field", "value")
# A tibble: 20 x 4
# Groups: sym [2]
day sym field value
<int> <chr> <chr> <dbl>
1 1 a price 101
2 1 a ret NA
3 1 b price 501
4 1 b ret NA
5 2 a price 102
6 2 a ret 0.00980
7 2 b price 502
8 2 b ret 0.00199
9 3 a price 103
10 3 a ret 0.00971
11 3 b price 503
12 3 b ret 0.00199
13 4 a price 104
14 4 a ret 0.00962
15 4 b price 504
16 4 b ret 0.00198
17 5 a price 105
18 5 a ret 0.00952
19 5 b price 505
20 5 b ret 0.00198
Or replace the full_join(x) with rbind(x).

r conditional subtract number

I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
library(dplyr)
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
select(-n)
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA

Spread valued column into binary 'time series' in R

I'm attempting to spread a valued column first into a set of binary columns and then gather them again in a 'time series' format.
By way of example, consider locations that have been conquered at certain times, with data that looks like this:
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
locationID conquered_in
1 1 1931
2 2 1932
3 3 1929
I'm attempting to reshape the data to look like this:
df2 <- data.frame(locationID = c(1,1,1,1,2,2,2,2,3,3,3,3), year = c(1929,1930,1931,1932,1929,1930,1931,1932,1929,1930,1931,1932), conquered = c(0,0,1,1,0,0,0,0,1,1,1,1))
locationID year conquered
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 0
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
My original strategy was to spread on conquered and then attempt a gather. This answer seemed close, but I can't seem to get it right with fill, since I'm trying to populate the later years with 1's also.
You can use complete() to expand the data frame and then use cumsum() when conquered equals 1 to fill the grouped data downwards.
library(tidyr)
library(dplyr)
df1 %>%
mutate(conquered = 1) %>%
complete(locationID, conquered_in = seq(min(conquered_in), max(conquered_in)), fill = list(conquered = 0)) %>%
group_by(locationID) %>%
mutate(conquered = cumsum(conquered == 1))
# A tibble: 12 x 3
# Groups: locationID [3]
locationID conquered_in conquered
<dbl> <dbl> <int>
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 1
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
Using complete from tidyr would be better choice. Though we need to aware that the conquered year may not fully cover all the year from beginning to end of the war.
library(dplyr)
library(tidyr)
library(magrittr)
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
# A data frame full of all year you want to cover
df2 <- data.frame(year=seq(1929, 1940, by=1))
# Create a data frame full of combination of year and location + conquered data
df3 <- full_join(df2, df1, by=c("year"="conquered_in")) %>%
mutate(conquered=if_else(!is.na(locationID), 1, 0)) %>%
complete(year, locationID) %>%
arrange(locationID) %>%
filter(!is.na(locationID))
# calculate conquered depend on the first year it get conquered - using group by location
df3 %<>%
group_by(locationID) %>%
# year 2000 in the min just for case if you have location that never conquered
mutate(conquered=if_else(year>=min(2000, year[conquered==1], na.rm=T), 1, 0)) %>%
ungroup()
df3 %>% filter(year<=1932)
# A tibble: 12 x 3
year locationID conquered
<dbl> <dbl> <dbl>
1 1929 1 0
2 1930 1 0
3 1931 1 1
4 1932 1 1
5 1929 2 0
6 1930 2 0
7 1931 2 0
8 1932 2 1
9 1929 3 1
10 1930 3 1
11 1931 3 1
12 1932 3 1

Summarise? Count occurences in column based on another column

I believe this may have a simple solution but I'm having trouble describing what I need to do (and hence what to search for). I think I need the summarize function. My goal output is at the very bottom.
I'm trying to count the occurrences of a value between each unique value in another column. Here is an example df that hopefully illustrates what I need todo.
library(dplyr)
set.seed(1)
df <- tibble("name" = c(rep("dinah",2),rep("lucy",4),rep("sora",9)),
"meal" = c(rep(c("chicken","beef","fish"),5)),
"date" = seq(as.Date("1999/1/1"),as.Date("2000/1/1"),25),
"num.wins" = sample(0:30)[1:15])
Among other things, I'm trying to summarize (sum) the types of meals each name had using this data.
df
# A tibble: 15 x 4
name meal date num.wins
<chr> <chr> <date> <int>
1 dinah chicken 1999-01-01 8
2 dinah beef 1999-01-26 11
3 lucy fish 1999-02-20 16
4 lucy chicken 1999-03-17 25
5 lucy beef 1999-04-11 5
6 lucy fish 1999-05-06 23
7 sora chicken 1999-05-31 27
8 sora beef 1999-06-25 15
9 sora fish 1999-07-20 14
10 sora chicken 1999-08-14 1
11 sora beef 1999-09-08 4
12 sora fish 1999-10-03 3
13 sora chicken 1999-10-28 13
14 sora beef 1999-11-22 6
15 sora fish 1999-12-17 18
I've made progress with other calculations I'm interested in, below:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins))
# A tibble: 3 x 5
name count medianDate life wins
<chr> <int> <date> <time> <int>
1 dinah 2 1999-01-13 25 days 19
2 lucy 4 1999-03-29 75 days 69
3 sora 9 1999-09-08 200 days 101
My goal is to add an additional column for each type of food, and have the sum of the occurrences of that food displayed in each row, like so:
name count medianDate life wins chicken beef fish
1 dinah 2 1999-01-13 25 days 19 1 1 0
2 lucy 4 1999-03-29 75 days 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3
Though older, and possibly on a deprecation path, reshape2::dcast does this nicely:
reshape2::dcast(df, name ~ meal)
# name beef chicken fish
# 1 dinah 1 1 0
# 2 lucy 1 1 2
# 3 sora 3 3 3
You can understand the formula as rows ~ columns. By default, it will aggregate the values in the columns using the length function---which gives exactly what you want, the count of each.
This can be easily joined to your summary data:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins)) %>%
left_join(reshape2::dcast(df, name ~ meal))
# # A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <int> <int> <int>
# 1 dinah 2 1999-01-13 25 days 19 1 1 0
# 2 lucy 4 1999-03-29 75 days 69 1 1 2
# 3 sora 9 1999-09-08 200 days 101 3 3 3
One option is to use table inside summarise as a list column, unnest and then spread it to 'wide' format
library(tidyverse)
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
n = list(enframe(table(meal))) ) %>%
unnest %>%
spread(name1, value, fill = 0)
# A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <dbl> <dbl> <dbl>
#1 dinah 2 1999-01-13 25 days 19 1 1 0
#2 lucy 4 1999-03-29 75 days 69 1 1 2
#3 sora 9 1999-09-08 200 days 101 3 3 3
I'm not entirely sure why I'm getting the funky formatting for life, but I think this gets at your need for a count of the meal types.
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
chicken = sum(meal == "chicken"),
beef = sum(meal == "beef"),
fish = sum(meal == "fish"))
# A tibble: 3 x 8
name count medianDate life wins chicken beef fish
<chr> <int> <date> <time> <int> <int> <int> <int>
1 dinah 2 1999-01-13 " 25 days" 19 1 1 0
2 lucy 4 1999-03-29 " 75 days" 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3

Get difference with closest previous row in a group which meets criterion

I'm trying, for each row, to calculate the difference with the closest previous row belonging to the same group which meets a certain criterion.
Suppose I have the following dataframe:
s <- read.table(text = "Visit_num Patient Day Admitted
1 1 2015/01/01 Yes
2 1 2015/01/10 No
3 1 2015/01/15 Yes
4 1 2015/02/10 No
5 1 2015/03/08 Yes
6 2 2015/01/01 Yes
7 2 2015/04/01 No
8 2 2015/04/10 No
9 3 2015/04/01 No
10 3 2015/04/10 No", header = T, sep = "")
For each Visit_num and for each Patient, I'd like to get the difference with the closest row for which the patient was admitted (i.e. Yes). Note column day is ordered by day, and time unit for this example is days.
Here is what I wanted my dataframe to look like:
Visit_num Patient Day Admitted Diff_days
1 1 2015/01/01 Yes NA
2 1 2015/01/10 No 9
3 1 2015/01/15 Yes 14
4 1 2015/02/10 No 26
5 1 2015/03/08 Yes 52
6 2 2015/01/01 Yes NA
7 2 2015/04/01 No 90
8 2 2015/04/10 No 99
9 3 2015/04/01 No NA
10 3 2015/04/10 No NA
Any help is appreciated.
Here is an option with tidyverse. Convert the 'Day' to Date class, arrange by 'Patient', 'Day', grouped by 'Patient' get the difference of adjacent 'Day', create a group 'grp' based on the occurrence of 'Yes' in 'Admitted' and take the cumulative sum of 'Diff_days'
library(tidyverse)
s %>%
mutate(Day = ymd(Day)) %>%
arrange(Patient, Day) %>%
group_by(Patient) %>%
mutate(Diff_days = c(NA, diff(Day))) %>%
group_by(grp = cumsum(lag(Admitted == "Yes", default = TRUE)), add = TRUE) %>%
mutate(Diff_days = cumsum(replace_na(Diff_days, 0))) %>%
ungroup %>%
select(-grp) %>%
mutate(Diff_days = na_if(Diff_days, 0))
# A tibble: 8 x 5
# Visit_num Patient Day Admitted Diff_days
# <int> <int> <date> <fct> <dbl>
#1 1 1 2015-01-01 Yes NA
#2 2 1 2015-01-10 No 9
#3 3 1 2015-01-15 Yes 14
#4 4 1 2015-02-10 No 26
#5 5 1 2015-03-08 Yes 52
#6 6 2 2015-01-01 Yes NA
#7 7 2 2015-04-01 No 90
#8 8 2 2015-04-10 No 99

Resources