R - Number of days since last event in another dataframe - r

I have the following two data frames:
> head(Reaction_per_park_per_day_3)
Park Date Type_1_2 Number_AC_events
<chr> <date> <chr> <int>
1 Beaverdam Flats 2018-09-25 0 1
2 Nosehill 64 ave 2018-09-26 0 1
3 Nosehill 64 ave 2018-09-26 0 1
4 Nosehill Macewin 2018-09-26 0 1
5 Crestmont 2018-09-27 0 2
6 Country Hills G.C. - Nose Creek 2018-09-28 0 1
> head(All_reports_per_month2)
Month Park Code Reports_per_month
<date> <chr> <chr> <dbl>
1 2018-09-29 Beaverdam Flats 1 1
2 2018-10-12 Nosehill 64 ave 2 1
3 2018-10-25 Nosehill 64 ave 1 2
4 2018-09-21 Crestmont 1 1
5 2018-09-29 Crestmont 2 1
I would like to add a "days since last AC event" column to All_reports_per_month2 that would take into account the date and the park of the AC event as well as the date and park of the report. If the report data is prior to the first AC event in a certain park, NA would appear. See example below:
Month Park Code Reports_per_month Days_since_last_AC
<date> <chr> <chr> <dbl> <chr>
1 2018-09-29 Beaverdam Flats 1 1 4
2 2018-10-12 Nosehill 64 ave 2 1 16
3 2018-10-25 Nosehill 64 ave 1 2 29
4 2018-09-21 Crestmont 1 1 NA
5 2018-09-29 Crestmont 2 1 2
Any help would be appreciated!

This is a joining and filtering operation that will use the dplyr package.
# import the packages
library( dplyr )
# join the data tables and filter so that we are always looking back in time
All_reports_per_month2 %>%
left_join( Reaction_per_park_per_day_3, by="Park" ) %>%
filter( Date <= Month ) %>%
group_by( Park, Month ) %>%
summarize( Days_since_last_AC = Month - max(Date) )

Related

R adding a column to one dataframe based on another dataframe and the date

I have a dataframe (Reports_following_AC) where each row represents a report. This dataframe looks like this:
> head(Reports_following_AC)
Park Month Obs_con Coy_Season Number_AC Number_4w_AC
<chr> <date> <dbl> <dbl> <int> <int>
1 14st NE - Coventry 2019-06-14 1 2 8 0
2 14st NE - Coventry 2019-10-12 0 3 10 0
3 14st NE - Coventry 2019-10-13 0 3 10 0
4 14st NE - Coventry 2021-06-23 1 2 10 0
5 Airways Park 2020-07-05 0 2 3 0
6 Airways Park 2021-07-18 1 2 6 0
I would like to add a column to my Reports_following_AC dataframe, "Last_treatment", based on the "AC_code" column of the Reaction_per_park_per_day_3 dataframe (below). In my Reaction_per_park_per_day_3 dataframe, each row represents an AC event.
The Last_treatment column that would be added to the Reports_following_AC dataframe would represent the "AC_code" (treatment) of the last AC event prior to a report in a Park, if that AC event was done in the 4 weeks (28 days) prior to a report.
> head(Reaction_per_park_per_day_3)
# A tibble: 6 x 10
Park Date AC_code
<chr> <date> <dbl>
1 14st NE - Coventry 2019-06-05 6
2 14st NE - Coventry 2019-07-12 7
3 14st NE - Coventry 2019-10-05 1
4 14st NE - Coventry 2021-06-18 2
5 Airways Park 2020-06-26 1
6 Airways Park 2021-06-30 5
The resulting dataframe would therefore look like this:
Park Month Obs_con Coy_Season Number_AC Number_4w_AC Last_treatment
<chr> <date> <dbl> <dbl> <int> <int> <dbl>
1 14st NE - Coventry 2019-06-14 1 2 8 0 6
2 14st NE - Coventry 2019-10-12 0 3 10 0 1
3 14st NE - Coventry 2019-10-13 0 3 10 0 1
4 14st NE - Coventry 2021-06-23 1 2 10 0 NA
5 Airways Park 2020-07-05 0 2 3 0 1
6 Airways Park 2021-07-18 1 2 6 0 5
I tried the following code, but it's not quite working because instead of providing the AC_Code for the last AC event prior to the reports if within 30 days of the report, it provides the AC_code for all the AC events within 30 days of the report.
Reports_following_AC_1 <- Reports_following_AC %>%
left_join(select(Reaction_per_park_per_day_3, c(Park, Date, AC_code))) %>%
filter(Date <= Month ) %>%
group_by(Park, Month, Obs_con, Coy_Season) %>%
mutate(Last_treatment = if_else((Month - max(Date))<28, AC_code, as.character(NA))) %>%
distinct
> head(Reports_following_AC_1)
Park Month Obs_con Coy_Season Number_AC Number_4w_AC Date AC_code Last_treatment
<chr> <date> <dbl> <dbl> <int> <int> <date> <chr> <chr>
1 14st NE - Coventry 2019-06-14 1 2 8 0 2019-01-30 3 NA
2 14st NE - Coventry 2019-06-14 1 2 8 0 2019-01-30 4 NA
3 14st NE - Coventry 2019-06-14 1 2 8 0 2019-01-30 1 NA
4 14st NE - Coventry 2019-06-14 1 2 8 0 2019-02-01 4 NA
5 14st NE - Coventry 2019-06-14 1 2 8 0 2019-02-01 2 NA
6 14st NE - Coventry 2019-06-14 1 2 8 0 2019-02-04 1 NA
I'm ideally looking for a dplyr solution, but I'm open to other possibilities.
you want to join with a selection of columns from Reaction_per_park_per_day_3 if i understand correctly? This should work:
Reports_following_AC_1 <- Reports_following_AC %>%
left_join(select(Reaction_per_park_per_day_3, c(Park,Month,AC_cod), by="Park" ) %>%
filter(Date <= Month ) %>%
group_by(Park, Month, Obs_con, Coy_Season) %>%
mutate(Last_treatment = if_else((Month - max(Date))<28, lag(AC_code), as.character(NA))) %>%
distinct
I figured it out!
Reports_following_AC_1 <- Reports_following_AC %>%
left_join(select(Reaction_per_park_per_day_3, c(Park, Date, AC_code))) %>%
filter(Date < Month ) %>%
group_by(Park, Month, Obs_con, Coy_Season, Number_4w_AC) %>%
mutate(Last_treatment = last(if_else((Month - max(Date))<28, AC_code, as.character(NA)))) %>%
select(c(Park, Month, Obs_con, Coy_Season, Number_4w_AC, Last_treatment)) %>%
distinct

R - Count the number of reports a month before a week

This is very similar to the question I asked previously (see Count the number of rows a month before a date), but the solution suggested does not fix my issue in this case.
I have a dataframe that looks like this:
> Reports_per_park_per_week_3
Park Week Coy_Season Reports_per_week Number_4w_AC Year
<chr> <date> <chr> <dbl> <int> <chr>
1 Airways Park 2018-04-29 1 5 0 2018
2 Airways Park 2018-05-06 2 2 1 2018
3 Airways Park 2018-05-13 2 0 1 2018
4 Baker Park 2018-05-20 2 3 2 2018
5 Baker Park 2018-05-27 2 9 2 2018
6 Baker Park 2018-06-03 2 2 5 2018
I would like to create another column that would calculate the total number of reports per park in the month prior to the week being evaluated. The column in question would therefore have to take into account the Park column, the Week column and the Reports per week column.
> Reports_per_park_per_week_3
Park Week Coy_Season Reports_per_week Number_4w_AC Year Reports_4w
<chr> <date> <chr> <dbl> <int> <chr>
1 Airways Park 2018-04-29 1 5 0 2018 5
2 Airways Park 2018-05-06 2 2 1 2018 7
3 Airways Park 2018-05-13 2 0 1 2018 7
4 Baker Park 2018-05-20 2 3 2 2018 3
5 Baker Park 2018-05-27 2 9 2 2018 12
6 Baker Park 2018-06-03 2 2 5 2018 14
Does this do what you want? It is assumed here that your time series all have 1-week spacing throughout (no weeks are skipped) and that there are zero reports prior to the earliest week in each time series.
library("dplyr")
library("zoo")
Reports_per_park_per_week_3 %>%
group_by(Park) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(Reports_4w = rollsum(c(integer(3L), Reports_per_week), 4L))

Number of reports one week before an event R

I'm trying to add a column (AC_1_before) to my dataframe that would count the number of reports in the week (or two, three or four weeks) prior to an event within a park.
My dataframe currently looks like this:
View(Reaction_per_park_per_day_3)
Park Date Type_1_2 Coy_season AC_code Year Total_prior_AC
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl>
1 Airways Park 2019-01-14 1 1 3 2019 0
2 Airways Park 2019-01-16 0 1 2 2019 1
3 Airways Park 2019-01-24 0 1 2 2019 2
4 Auburn Bay 2021-03-02 1 1 1 2021 0
5 Auburn Bay 2021-03-03 0 1 1 2021 1
6 Auburn Bay 2021-05-08 0 1 1 2021 2
7 Bears Paw 2019-05-22 0 2 1 2019 0
8 Bears Paw 2019-05-22 0 2 2 2019 1
Where Type_1_2 represents a specific reaction, Coy_season refers to a season, AC_code represents a treatment, and Total_prior_AC represents the total number of events prior to a report within a park.
With the added column, I would like my dataframe to look like this:
Park Date Type_1_2 Coy_season AC_code Year Total_prior_AC AC_1_before
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Airways Park 2019-01-14 1 1 3 2019 0 0
2 Airways Park 2019-01-16 0 1 2 2019 1 1
3 Airways Park 2019-01-24 0 1 2 2019 2 1
4 Auburn Bay 2021-03-02 1 1 1 2021 0 0
5 Auburn Bay 2021-03-03 0 1 1 2021 1 1
6 Auburn Bay 2021-05-08 0 1 1 2021 2 0
7 Bears Paw 2019-05-22 0 2 1 2019 0 0
8 Bears Paw 2019-05-22 0 2 2 2019 1 1
I tried this:
library(lubridate)
library(dplyr)
Reaction_per_park_per_day_4 <- Reaction_per_park_per_day_3 %>%
group_by(Park, Date) %>%
mutate(Start_date = min(Date)) %>%
group_by(Park, Date, Start_date) %>%
summarise(AC_1_before = sum(Date <= Start_date & Date >= Start_date - weeks(1)),
.groups = "drop")
This does not seem to work; although the code does run, the result obtained is not correct (I get 1s where I should get 0s, and the sums are often wrong). By grouping by Park and Date, I also group together events that were conducted on the same park and on the same day, which I do not want to do.
Any ideas on how I could do this?
If I understood you correctly, one way to do this could be to a for loop. For simplicity I made a new dataframe:
library(dplyr)
library(lubridate)
Reaction_per_park_per_day_3<-data.frame("Park" = c(rep("Airways Park", 3), rep("Auburn Bay", 3), rep("Bears Paw", 2)),
"Date" = as.POSIXct(c("2019-01-14", "2019-01-16", "2019-01-24", "2021-03-02", "2021-03-03", "2021-05-08", "2019-05-22", "2019-05-22")),
"Type_1_2" = c(1,0,0,1,0,0,0,0),
"Coy_season" = c(1,1,1,1,1,1,2,2),
"AC_code" = c(3,2,2,1,1,1,1,2),
"Year" = c(2019,2019,2019,2021,2021,2021,2019,2019),
"Total_prior_AC" = c(0,1,2,0,1,2,0,1))
for(i in 1:nrow(Reaction_per_park_per_day_3)) {
Reaction_per_park_per_day_3$AC_1_before[i] <- nrow(Reaction_per_park_per_day_3[0:(i-1),]%>%
filter(Park == Reaction_per_park_per_day_3$Park[i] &
Date %within% interval(Reaction_per_park_per_day_3$Date[i]-604800,
Reaction_per_park_per_day_3$Date[i])))
#604800 is # of seconds in a week
}
So for each row, count the number of rows before which matches in the "Park" column and is within the interval of 7 days from the current row. I'm sure there's a better way to do this but this could work I think!

Summarise? Count occurences in column based on another column

I believe this may have a simple solution but I'm having trouble describing what I need to do (and hence what to search for). I think I need the summarize function. My goal output is at the very bottom.
I'm trying to count the occurrences of a value between each unique value in another column. Here is an example df that hopefully illustrates what I need todo.
library(dplyr)
set.seed(1)
df <- tibble("name" = c(rep("dinah",2),rep("lucy",4),rep("sora",9)),
"meal" = c(rep(c("chicken","beef","fish"),5)),
"date" = seq(as.Date("1999/1/1"),as.Date("2000/1/1"),25),
"num.wins" = sample(0:30)[1:15])
Among other things, I'm trying to summarize (sum) the types of meals each name had using this data.
df
# A tibble: 15 x 4
name meal date num.wins
<chr> <chr> <date> <int>
1 dinah chicken 1999-01-01 8
2 dinah beef 1999-01-26 11
3 lucy fish 1999-02-20 16
4 lucy chicken 1999-03-17 25
5 lucy beef 1999-04-11 5
6 lucy fish 1999-05-06 23
7 sora chicken 1999-05-31 27
8 sora beef 1999-06-25 15
9 sora fish 1999-07-20 14
10 sora chicken 1999-08-14 1
11 sora beef 1999-09-08 4
12 sora fish 1999-10-03 3
13 sora chicken 1999-10-28 13
14 sora beef 1999-11-22 6
15 sora fish 1999-12-17 18
I've made progress with other calculations I'm interested in, below:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins))
# A tibble: 3 x 5
name count medianDate life wins
<chr> <int> <date> <time> <int>
1 dinah 2 1999-01-13 25 days 19
2 lucy 4 1999-03-29 75 days 69
3 sora 9 1999-09-08 200 days 101
My goal is to add an additional column for each type of food, and have the sum of the occurrences of that food displayed in each row, like so:
name count medianDate life wins chicken beef fish
1 dinah 2 1999-01-13 25 days 19 1 1 0
2 lucy 4 1999-03-29 75 days 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3
Though older, and possibly on a deprecation path, reshape2::dcast does this nicely:
reshape2::dcast(df, name ~ meal)
# name beef chicken fish
# 1 dinah 1 1 0
# 2 lucy 1 1 2
# 3 sora 3 3 3
You can understand the formula as rows ~ columns. By default, it will aggregate the values in the columns using the length function---which gives exactly what you want, the count of each.
This can be easily joined to your summary data:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins)) %>%
left_join(reshape2::dcast(df, name ~ meal))
# # A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <int> <int> <int>
# 1 dinah 2 1999-01-13 25 days 19 1 1 0
# 2 lucy 4 1999-03-29 75 days 69 1 1 2
# 3 sora 9 1999-09-08 200 days 101 3 3 3
One option is to use table inside summarise as a list column, unnest and then spread it to 'wide' format
library(tidyverse)
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
n = list(enframe(table(meal))) ) %>%
unnest %>%
spread(name1, value, fill = 0)
# A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <dbl> <dbl> <dbl>
#1 dinah 2 1999-01-13 25 days 19 1 1 0
#2 lucy 4 1999-03-29 75 days 69 1 1 2
#3 sora 9 1999-09-08 200 days 101 3 3 3
I'm not entirely sure why I'm getting the funky formatting for life, but I think this gets at your need for a count of the meal types.
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
chicken = sum(meal == "chicken"),
beef = sum(meal == "beef"),
fish = sum(meal == "fish"))
# A tibble: 3 x 8
name count medianDate life wins chicken beef fish
<chr> <int> <date> <time> <int> <int> <int> <int>
1 dinah 2 1999-01-13 " 25 days" 19 1 1 0
2 lucy 4 1999-03-29 " 75 days" 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3

Create weekly cumulative totals from R dataframe

I have a dataframe that has launch weeks for products across markets. Here is a snapshot of the dataframe.
Prod_ID Market_Name START_WEEK
11044913000 PHOENIX, AZ 1397
11044913000 WEST TEX/NEW MEX 1206
11159402003 PORTLAND,OR 1188
11159402003 SEATTLE/TACOMA 1188
11159402003 SPOKANE 1195
11159410010 PORTLAND,OR 1186
11159410010 SALT LAKE CITY 1190
11159410010 SEATTLE/TACOMA 1186
11159410010 SPOKANE 1187
11159410010 WEST TEX/NEW MEX 1197
11159410014 PORTLAND,OR 1198
11159410014 SEATTLE/TACOMA 1239
I would like to create another dataframe which will give me for each Prod_ID, cumulative totals of number of markets a product has been launched in on a weekly basis for first 6 weeks. For the above snippet of data, the output should like something like this.
Prod_ID Week1 Week2 Week3 Week4 Week5 Week6
11044913000 1 1 1 1 1 1
11159402003 2 2 2 2 2 2
11159410010 2 3 3 3 4 4
11159410014 1 1 1 1 1 1
For ease of displaying, I have shown the output only till Week 6, but I need to track till Week 12 for my need. Week is denoted by a 4 digit number in my dataset and is not in date format. Please note that not all products have the same starting week, so I need to infer the earliest week for a Prod_IDfrom the START_WEEK variable. And then identify the next 6 weeks to generate the total number of markets launched in each week.
Any help to do this is appreciated.
I think I understand your problem. Here is my shot. There are several phases to this solution.
The first step is to calculate the cumulative sum of markets for the weeks and the week number for each Prod_ID since they opened. This is done with the following code chunk.
df1 <- df %>%
group_by(Prod_ID, START_WEEK) %>%
count() %>%
arrange(Prod_ID, START_WEEK) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(tot_market = cumsum(n)) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(min_START_WEEK = min(START_WEEK)) %>%
mutate(week = START_WEEK - min_START_WEEK + 1)
df1
# # A tibble: 10 x 6
# # Groups: Prod_ID [4]
# Prod_ID START_WEEK n tot_market min_START_WEEK week
# <dbl> <int> <int> <int> <dbl> <dbl>
# 1 11044913000. 1206 1 1 1206. 1.
# 2 11044913000. 1397 1 2 1206. 192.
# 3 11159402003. 1188 2 2 1188. 1.
# 4 11159402003. 1195 1 3 1188. 8.
# 5 11159410010. 1186 2 2 1186. 1.
# 6 11159410010. 1187 1 3 1186. 2.
# 7 11159410010. 1190 1 4 1186. 5.
# 8 11159410010. 1197 1 5 1186. 12.
# 9 11159410014. 1198 1 1 1198. 1.
# 10 11159410014. 1239 1 2 1198. 42.
The second phase is to expand the week and Prod_ID to the maximum number of weeks in week.
df2 <- expand.grid(min(df1$week):max(df1$week), unique(df1$Prod_ID))
colnames(df2) <- c("week", "Prod_ID")
The third phase is done by merging df1 and df2 and using zoo::locf to fill the NA's in tot_market (total market) by Prod_ID with the preceding value.
df2 %>% left_join(df1) %>% select(-START_WEEK, -n, -min_START_WEEK) %>%
group_by(Prod_ID) %>%
arrange(Prod_ID, week) %>%
mutate(tot_market = zoo::na.locf(tot_market)) %>%
spread(week, tot_market) %>%
ungroup() %>%
mutate_at(vars(Prod_ID), as.character) %>%
rename_if(is.integer, function(x) paste0("Week", x))
# # A tibble: 4 x 193
# Prod_ID Week1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week10 Week11
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 11044913000 1 1 1 1 1 1 1 1 1 1 1
# 2 11159402003 2 2 2 2 2 2 2 3 3 3 3
# 3 11159410010 2 3 3 3 4 4 4 4 4 4 4
# 4 11159410014 1 1 1 1 1 1 1 1 1 1 1
# # ... with 181 more variables

Resources