Converting days into weeks in R - r

I want to convert days into weeks with all the values from that week summed up
Right now I have the following df
Date x
1 2018-02-23 15
2 2018-03-26 4
3 2018-03-29 3
4 2018-03-30 6
5 2018-04-03 5
6 2018-04-04 12
7 2018-04-05 7
8 2018-04-06 5
9 2018-04-07 5
10 2018-04-09 13
11 2018-04-10 8
12 2018-04-11 2
ETC.
The x in this df stands for amount of items sent on a certain day.
There are days in this df where there are no items beeing transported.
This df has a total of 688 tuples.
What I would like to see it:
Date x
1 Week 8 2018 19
2 Week 9 2018 26
3 Week 10 2018 33
ETC.
Can someone help me out?

You can use aggregate and get the weeks with format %V:
aggregate(df$x, list(Date=format(df$Date, "%V %Y")), sum)
# Date x
#1 08 2018 15
#2 13 2018 13
#3 14 2018 34
#4 15 2018 23
Or with Week (Thanks to #sindri-baldur for the comment):
aggregate(df$x, list(Date=sub("^0?", "Week ", format(df$Date, "%V %Y"))), sum)
#aggregate(df$x, list(Date=format(df$Date, "Week %-V %Y")), sum) #Alternative
# Date x
#1 Week 13 2018 13
#2 Week 14 2018 34
#3 Week 15 2018 23
#4 Week 8 2018 15
Data:
df <- read.table(header=TRUE, text=" Date x
1 2018-02-23 15
2 2018-03-26 4
3 2018-03-29 3
4 2018-03-30 6
5 2018-04-03 5
6 2018-04-04 12
7 2018-04-05 7
8 2018-04-06 5
9 2018-04-07 5
10 2018-04-09 13
11 2018-04-10 8
12 2018-04-11 2")
df$Date <- as.Date(df$Date)

library(lubridate)
library(tidyverse)
## Random data
df <- data.frame(date=seq.Date(from = as.Date("2018-01-01"), to=as.Date("2018-12-31"), by = "day"),x=runif(n=365,min=0,max=25))
## Aggregating by week
df2 <- df %>%
mutate(week = lubridate::week(ymd(date))) %>%
group_by(week) %>%
summarise(total_per_week = sum(x))

Using collapse
library(collapse)
library(lubridate)
library(magrittr)
df %>%
ftransform(week = week(ymd(Date))) %>%
fgroup_by(week) %>%
fsummarise(total_per_week = fsum(x))
# week total_per_week
#1 8 15
#2 13 13
#3 14 34
#4 15 23
data
df <- structure(list(Date = c("2018-02-23", "2018-03-26", "2018-03-29",
"2018-03-30", "2018-04-03", "2018-04-04", "2018-04-05", "2018-04-06",
"2018-04-07", "2018-04-09", "2018-04-10", "2018-04-11"), x = c(15L,
4L, 3L, 6L, 5L, 12L, 7L, 5L, 5L, 13L, 8L, 2L)), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

#akrun
This almost worked. Right now I get 52 rows out of 3 years of data:
week total_per_week
1 1 246
2 2 491
3 3 458
4 4 556
5 5 454
6 6 594
7 7 592
8 8 496
9 9 567
10 10 615

Related

r: rearrangement of data frame based on time

I have a large drug use database:
library(data.table)
df <- data.frame("ID" = c(1,1,1,1,2,2,2,3,3), "IndexDate" = c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01", "2019-05-01", "2019-05-01", "2019-05-01", "2019-07-01", "2019-07-01"), "CensorDate" = c("2019-06-30", "2019-06-30", "2019-06-30", "2019-06-30", "2019-07-30", "2019-07-30", "2019-07-30", "2019-12-31", "2019-12-31"), "DrugStart" = c("2019-02-01", "2019-03-01", "2019-04-01", "2019-06-01", "2019-03-01", "2019-04-15", "2019-05-16", "2019-07-05", "2020-01-01"), "DrugEnd" = c("2019-02-15", "2019-04-15", "2019-04-30", "2019-06-05", "2019-03-15", "2019-05-15", "2019-05-30", "2019-07-15", "2020-01-15"),"Notes" = c("", "", "Overlap 15 days", "", "All days before IndexDate", "15 days before IndexDate", "", "", "15 days after CensorDate"))
df
ID IndexDate CensorDate DrugStart DrugEnd Notes
1 1 2019-01-01 2019-06-30 2019-02-01 2019-02-15
2 1 2019-01-01 2019-06-30 2019-03-01 2019-04-15
3 1 2019-01-01 2019-06-30 2019-04-01 2019-04-30 Overlap 15 days
4 1 2019-01-01 2019-06-30 2019-06-01 2019-06-05
5 2 2019-05-01 2019-07-30 2019-03-01 2019-03-15 All days before IndexDate
6 2 2019-05-01 2019-07-30 2019-04-15 2019-05-15 15 days before IndexDate
7 2 2019-05-01 2019-07-30 2019-05-16 2019-05-30
8 3 2019-07-01 2019-12-31 2019-07-05 2019-07-15
9 3 2019-07-01 2019-12-31 2020-01-01 2020-01-15 15 days after CensorDate
The IndexDate and CensorDate are all the same for each ID. Observation period is from IndexDate to CensorDate.
I would like to rearrange it by following criterias:
Linked by ID
Neglect days before IndexDate or after CensorDate;
The overlapped time periods are only counted one time;
df is a drug use database. All periods in df (from DrugStart to DrugEnd) means use of drug. Those missing period in df, but within observation period (from IndexDate to CensorDate) means Not use of drug.
Drug use is labeled as 2 (use) and 1 (not use);
IndexDate is defined as Day 0 (means all start time of "TimeStart" is 0).
I expect results as follows:
> df2 <- data.frame("ID" = c(1,1,1,1,1,1,1,2,2,3,3,3), "TimeStart" = c("0", "31", "46", "59", "120", "151", "156", "0", "30", "0", "4", "15"), "TimeEnd" = c("30", "45", "58", "119", "150", "155", "180", "29", "90", "3", "14", "183"), "DrugUse" = c("1", "2", "1", "2", "1", "2", "1", "2", "1", "1", "2", "1"))
> df2
ID TimeStart TimeEnd DrugUse
1 1 0 30 1
2 1 31 45 2
3 1 46 58 1
4 1 59 119 2
5 1 120 150 1
6 1 151 155 2
7 1 156 180 1
8 2 0 29 2
9 2 30 90 1
10 3 0 3 1
11 3 4 14 2
12 3 15 183 1
Now, I know how to generate TimeStart and TimeEnd by "DrugStart-IndexDate" and "DrugEnd-IndexDate", as follows:
df$TimeStart<- as.Date(df$DrugStart, format="%Y-%m-%d")-as.Date(df$IndexDate, format="%Y-%m-%d")
df$TimeEnd<- as.Date(df$DrugEnd, format="%Y-%m-%d")-as.Date(df$IndexDate, format="%Y-%m-%d")
df
ID IndexDate CensorDate DrugStart DrugEnd Notes_Drug.use.days TimeStart TimeEnd
1 1 2019-01-01 2019-06-30 2019-02-01 2019-02-15 15days 31 days 45 days
2 1 2019-01-01 2019-06-30 2019-03-01 2019-04-15 46days 59 days 104 days
3 1 2019-01-01 2019-06-30 2019-04-01 2019-04-30 Overlap 15days + 15days 90 days 119 days
4 1 2019-01-01 2019-06-30 2019-06-01 2019-06-05 5days 151 days 155 days
5 2 2019-05-01 2019-07-30 2019-03-01 2019-03-15 15days before IndexDate -61 days -47 days
6 2 2019-05-01 2019-07-30 2019-04-15 2019-05-15 15days before IndexDate+15days -16 days 14 days
7 2 2019-05-01 2019-07-30 2019-05-16 2019-05-30 15days 15 days 29 days
8 3 2019-07-01 2019-12-31 2019-07-05 2019-07-15 11days 4 days 14 days
9 3 2019-07-01 2019-12-31 2020-01-01 2020-01-15 15days after CensorDate 184 days 198 days
But I do not know how to deal with the overlapped periods and those continuous periods, as following:
# Overlapped periods:
# Transform
ID TimeStart TimeEnd
2 1 59 days 104 days
3 1 90 days 119 days
# to
ID TimeStart TimeEnd
2 1 59 days 119 days
# And Continous periods:
# Transform
ID TimeStart TimeEnd
6 2 -16 days 14 days
7 2 15 days 29 days
# To
ID TimeStart TimeEnd
6 2 0 days 29 days
Also, how to add those periods that we do not use the drug (those DrugUse=1)? such as these lines:
ID TimeStart TimeEnd DrugUse
1 1 0 30 1
3 1 46 58 1
5 1 120 150 1
7 1 156 180 1
9 2 30 90 1
10 3 0 3 1
12 3 15 183 1
Is there anyone help me? Thank you very much!
#####################################################
Updated:
Thank you for Bas's answer!! I made minor revisions on Bas's answer. The following code might be the final version!!
library(data.table)
df <- data.frame("ID" = c(1,1,1,1,2,2,2,3,3), "IndexDate" = c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01", "2019-05-01", "2019-05-01", "2019-05-01", "2019-07-01", "2019-07-01"), "CensorDate" = c("2019-06-30", "2019-06-30", "2019-06-30", "2019-06-30", "2019-07-30", "2019-07-30", "2019-07-30", "2019-12-31", "2019-12-31"), "DrugStart" = c("2019-02-01", "2019-03-01", "2019-04-01", "2019-06-01", "2019-03-01", "2019-04-15", "2019-05-16", "2019-07-05", "2020-01-01"), "DrugEnd" = c("2019-02-15", "2019-04-15", "2019-04-30", "2019-06-05", "2019-03-15", "2019-05-15", "2019-05-30", "2019-07-15", "2020-01-15"),"Notes" = c("", "", "Overlap 15 days", "", "All days before IndexDate", "15 days before IndexDate", "", "", "15 days after CensorDate"))
df$DrugEnd <- as.Date(df$DrugEnd, format="%Y-%m-%d") + 1
df$CensorDate <- as.Date(df$CensorDate, format="%Y-%m-%d") + 1
library(dplyr)
library(tidyr)
library(lubridate)
df2 <- df %>%
mutate(across(IndexDate:DrugEnd, as.Date)) %>%
filter(DrugStart <= CensorDate, # Neglect days before IndexDate or after CensorDate
DrugEnd >= IndexDate) %>%
group_by(ID) %>%
mutate(interval = list(int_diff(sort(unique(c(IndexDate, CensorDate, DrugStart, DrugEnd)))))) %>%
unnest(interval) %>%
mutate(DrugUse = DrugStart < int_end(interval) & DrugEnd > int_start(interval)) %>%
group_by(ID, interval) %>%
summarise(IndexDate = first(IndexDate),
CensorDate = first(CensorDate),
DrugUse = if_else(sum(DrugUse) > 0, 2, 1)) %>%
ungroup() %>%
filter(int_end(interval) <= CensorDate,
int_start(interval) >= IndexDate) %>%
mutate(TimeStart = as.numeric(difftime(int_start(interval), IndexDate, units = "days")),
TimeEnd = as.numeric(difftime(int_end(interval), IndexDate, units = "days"))-1) %>%
group_by(ID, data.table::rleid(DrugUse)) %>%
summarise(TimeStart = min(TimeStart),
TimeEnd = max(TimeEnd),
DrugUse = first(DrugUse)) %>%
select(ID, TimeStart, TimeEnd, DrugUse)
> df2
# A tibble: 12 x 4
# Groups: ID [3]
ID TimeStart TimeEnd DrugUse
<dbl> <dbl> <dbl> <dbl>
1 1 0 30 1
2 1 31 45 2
3 1 46 58 1
4 1 59 119 2
5 1 120 150 1
6 1 151 155 2
7 1 156 180 1
8 2 0 29 2
9 2 30 90 1
10 3 0 3 1
11 3 4 14 2
12 3 15 183 1
#####################################################
2nd updated:
If your dataset is too large (for example, more than one million records), using above codes may be very slow. The file after unnest() is extremely large, and this step is very slow.
In this case, We can split file using split() (better no more than 10 thousands records in each file). Running by loop syntax (for(i in sequence){statement}). Then combine the files using rbind().
Good luck!
Using dplyr, tidyr and lubridate, this gets you close but not quite there:
df %>%
mutate(across(IndexDate:DrugEnd, as.Date)) %>%
filter(DrugStart <= CensorDate, # Neglect days before IndexDate or after CensorDate
DrugEnd >= IndexDate) %>%
group_by(ID) %>%
mutate(interval = list(int_diff(sort(unique(c(IndexDate, CensorDate, DrugStart, DrugEnd)))))) %>%
unnest(interval) %>%
mutate(DrugUse = DrugStart < int_end(interval) & DrugEnd > int_start(interval)) %>%
group_by(ID, interval) %>%
summarise(IndexDate = first(IndexDate),
CensorDate = first(CensorDate),
DrugUse = if_else(sum(DrugUse) > 0, 2, 1)) %>%
ungroup() %>%
filter(int_end(interval) <= CensorDate,
int_start(interval) >= IndexDate) %>%
mutate(TimeStart = as.numeric(difftime(int_start(interval), IndexDate, units = "days")),
TimeEnd = as.numeric(difftime(int_end(interval), IndexDate, units = "days"))) %>%
group_by(ID, data.table::rleid(DrugUse)) %>%
summarise(TimeStart = min(TimeStart),
TimeEnd = max(TimeEnd),
DrugUse = first(DrugUse)) %>%
select(ID, TimeStart, TimeEnd, DrugUse)
which gives
ID TimeStart TimeEnd DrugUse
<dbl> <dbl> <dbl> <dbl>
1 1 0 31 1
2 1 31 45 2
3 1 45 59 1
4 1 59 119 2
5 1 119 151 1
6 1 151 155 2
7 1 155 180 1
8 2 0 14 2
9 2 14 15 1
10 2 15 29 2
11 2 29 90 1
12 3 0 4 1
13 3 4 14 2
14 3 14 183 1

A running sum for daily data that resets when month turns

I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33

Calculate average number of individuals present on each date in R

I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857

R: cumulative sum over rolling date range

In R, how can I calculate cumsum for a defined time period prior to the row being calculate? Prefer dplyr if possible.
For example, if the period was 10 days, then the function would achieve cum_rolling10:
date value cumsum cum_rolling10
1/01/2000 9 9 9
2/01/2000 1 10 10
5/01/2000 9 19 19
6/01/2000 3 22 22
7/01/2000 4 26 26
8/01/2000 3 29 29
13/01/2000 10 39 29
14/01/2000 9 48 38
18/01/2000 2 50 21
19/01/2000 9 59 30
21/01/2000 8 67 38
25/01/2000 5 72 24
26/01/2000 1 73 25
30/01/2000 6 79 20
31/01/2000 6 85 18
A solution using dplyr, tidyr, lubridate, and zoo.
library(dplyr)
library(tidyr)
library(lubridate)
library(zoo)
dt2 <- dt %>%
mutate(date = dmy(date)) %>%
mutate(cumsum = cumsum(value)) %>%
complete(date = full_seq(date, period = 1), fill = list(value = 0)) %>%
mutate(cum_rolling10 = rollapplyr(value, width = 10, FUN = sum, partial = TRUE)) %>%
drop_na(cumsum)
dt2
# A tibble: 15 x 4
date value cumsum cum_rolling10
<date> <dbl> <int> <dbl>
1 2000-01-01 9 9 9
2 2000-01-02 1 10 10
3 2000-01-05 9 19 19
4 2000-01-06 3 22 22
5 2000-01-07 4 26 26
6 2000-01-08 3 29 29
7 2000-01-13 10 39 29
8 2000-01-14 9 48 38
9 2000-01-18 2 50 21
10 2000-01-19 9 59 30
11 2000-01-21 8 67 38
12 2000-01-25 5 72 24
13 2000-01-26 1 73 25
14 2000-01-30 6 79 20
15 2000-01-31 6 85 18
DATA
dt <- structure(list(date = c("1/01/2000", "2/01/2000", "5/01/2000",
"6/01/2000", "7/01/2000", "8/01/2000", "13/01/2000", "14/01/2000",
"18/01/2000", "19/01/2000", "21/01/2000", "25/01/2000", "26/01/2000",
"30/01/2000", "31/01/2000"), value = c(9L, 1L, 9L, 3L, 4L, 3L,
10L, 9L, 2L, 9L, 8L, 5L, 1L, 6L, 6L)), .Names = c("date", "value"
), row.names = c(NA, -15L), class = "data.frame")
I recommend using runner package designed to calculate functions on rolling/running windows. You can achieve this by using sum_run - one liner here:
library(runner)
library(dplyr)
df %>%
mutate(
cum_rolling_10 = sum_run(
x = df$value,
k = 10,
idx = as.Date(df$date, format = "%d/%m/%Y"))
)
df
# date value cum_rolling_10
# 1 1/01/2000 9 9
# 2 2/01/2000 1 10
# 3 5/01/2000 9 19
# 4 6/01/2000 3 22
# 5 7/01/2000 4 26
# 6 8/01/2000 3 29
# 7 13/01/2000 10 29
# 8 14/01/2000 9 38
# 9 18/01/2000 2 21
# 10 19/01/2000 9 30
# 11 21/01/2000 8 38
# 12 25/01/2000 5 24
# 13 26/01/2000 1 25
# 14 30/01/2000 6 20
# 15 31/01/2000 6 18
Enjoy!
this solution will avoid memory overhead, and migrate to sparklyr will be easy.
lag = 7
dt %>%
mutate(date = dmy(date)) %>%
mutate(order = datediff(date,min(date)) %>%
arrange(desc(order)) %>%
mutate(n_order = lag(order + lag,1L,default = 0)) %>%
mutate(b_order = ifelse(order - n_order >= 0,order,-1)) %>%
mutate(m_order = cummax(b_order)) %>%
group_by(m_order) %>%
mutate(rolling_value = cumsum(value))
Use slide_index_sum() from slider, which is designed to have the same API as purrr.
library(slider)
library(dplyr)
df <- tibble(
date = c(
"1/01/2000", "2/01/2000", "5/01/2000", "6/01/2000", "7/01/2000",
"8/01/2000", "13/01/2000", "14/01/2000", "18/01/2000", "19/01/2000",
"21/01/2000", "25/01/2000", "26/01/2000", "30/01/2000", "31/01/2000"
),
value = c(9L, 1L, 9L, 3L, 4L, 3L, 10L, 9L, 2L, 9L, 8L, 5L, 1L, 6L, 6L)
)
df <- mutate(df, date = as.Date(date, format = "%d/%m/%Y"))
df %>%
mutate(
cumsum = cumsum(value),
cum_rolling10 = slide_index_sum(value, date, before = 9L)
)
#> # A tibble: 15 × 4
#> date value cumsum cum_rolling10
#> <date> <int> <int> <dbl>
#> 1 2000-01-01 9 9 9
#> 2 2000-01-02 1 10 10
#> 3 2000-01-05 9 19 19
#> 4 2000-01-06 3 22 22
#> 5 2000-01-07 4 26 26
#> 6 2000-01-08 3 29 29
#> 7 2000-01-13 10 39 29
#> 8 2000-01-14 9 48 38
#> 9 2000-01-18 2 50 21
#> 10 2000-01-19 9 59 30
#> 11 2000-01-21 8 67 38
#> 12 2000-01-25 5 72 24
#> 13 2000-01-26 1 73 25
#> 14 2000-01-30 6 79 20
#> 15 2000-01-31 6 85 18

Remove dates which are not continuous in the data in R

I have a dataframe and I want to filter out the entries that are not continuous in date. In other words, I am looking at the cluster of continuous dates.
a %>% group_by(day) %>% summarise(count = n()) %>% mutate(day_dif = day - lag(day))
Source: local data frame [20 x 3]
day count day_dif
(date) (int) (dfft)
1 2016-02-02 12 NA days
2 2016-02-03 80 1 days
3 2016-02-04 102 1 days
4 2016-02-05 97 1 days
5 2016-02-06 118 1 days
6 2016-02-07 115 1 days
7 2016-02-08 4 1 days
8 2016-02-20 13 12 days
9 2016-02-21 136 1 days
10 2016-02-22 114 1 days
11 2016-02-23 134 1 days
12 2016-02-24 126 1 days
13 2016-02-25 128 1 days
14 2016-02-26 63 1 days
15 2016-02-27 118 1 days
16 2016-03-06 1 8 days
17 2016-03-29 28 23 days
18 2016-04-03 18 5 days
19 2016-04-08 18 5 days
20 2016-04-27 23 19 days
In this, I want to filter out entries that are not continuous in date. For ex, 2016-03-06, 2016-03-29, 2016-04-03 are single day entries which needs to be removed. I am looking only for continuous days entries. entries that occur with multiple days. My ideal output which I am looking is,
day count day_dif Cluster
(date) (int) (dfft)
1 2016-02-02 12 NA days 1
2 2016-02-03 80 1 days 1
3 2016-02-04 102 1 days 1
4 2016-02-05 97 1 days 1
5 2016-02-06 118 1 days 1
6 2016-02-07 115 1 days 1
7 2016-02-08 4 1 days 1
8 2016-02-20 13 12 days 2
9 2016-02-21 136 1 days 2
10 2016-02-22 114 1 days 2
11 2016-02-23 134 1 days 2
12 2016-02-24 126 1 days 2
13 2016-02-25 128 1 days 2
14 2016-02-26 63 1 days 2
15 2016-02-27 118 1 days 2
Where cluster column indicates the date clusters and also the output removes the single dates. Here 1 in the cluster column indicates, first group of dates and 2 indicates second group of dates. If there are more than 3 continuous days, I want to consider as on cluster.
I am trying to do this by using lag functions and all. But without much success. Can anybody help me in doing this? Any idea would be appreciated.
Thanks
We can use rle to subset the rows
i1 <- c(TRUE, a1$day_dif[-1] >=3)
i2 <- inverse.rle(within.list(rle(i1), {values1 <- values
values[values1 &lengths >3] <- FALSE
values[!values1]<- TRUE}))
a1$Cluster <- cumsum(i1)
a1[i2,]
# day count day_dif Cluster
#1 2016-02-02 12 NA days 1
#2 2016-02-03 80 1 days 1
#3 2016-02-04 102 1 days 1
#4 2016-02-05 97 1 days 1
#5 2016-02-06 118 1 days 1
#6 2016-02-07 115 1 days 1
#7 2016-02-08 4 1 days 1
#8 2016-02-20 13 12 days 2
#9 2016-02-21 136 1 days 2
#10 2016-02-22 114 1 days 2
#11 2016-02-23 134 1 days 2
#12 2016-02-24 126 1 days 2
#13 2016-02-25 128 1 days 2
#14 2016-02-26 63 1 days 2
#15 2016-02-27 118 1 days 2
The above code can be also be chained (%>%)
a1 %>%
mutate(i1 = c(TRUE, day_dif[-1] >=3)) %>%
do(data.frame(., i2 = inverse.rle(within.list(rle(.$i1), {
values1 <- values
values[values1 & lengths >3] <- FALSE
values[!values1] <- TRUE
})))) %>%
mutate(Cluster = cumsum(i1)) %>%
filter(i2) %>%
select(-i1, -i2)
# day count day_dif Cluster
#1 2016-02-02 12 NA days 1
#2 2016-02-03 80 1 days 1
#3 2016-02-04 102 1 days 1
#4 2016-02-05 97 1 days 1
#5 2016-02-06 118 1 days 1
#6 2016-02-07 115 1 days 1
#7 2016-02-08 4 1 days 1
#8 2016-02-20 13 12 days 2
#9 2016-02-21 136 1 days 2
#10 2016-02-22 114 1 days 2
#11 2016-02-23 134 1 days 2
#12 2016-02-24 126 1 days 2
#13 2016-02-25 128 1 days 2
#14 2016-02-26 63 1 days 2
#15 2016-02-27 118 1 days 2
data
a <- structure(list(day = structure(c(16833, 16834, 16835, 16836,
16837, 16838, 16839, 16851, 16852, 16853, 16854, 16855, 16856,
16857, 16858, 16866, 16889, 16894, 16899, 16918), class = "Date"),
count = c(12L, 80L, 102L, 97L, 118L, 115L, 4L, 13L, 136L,
114L, 134L, 126L, 128L, 63L, 118L, 1L, 28L, 18L, 18L, 23L
)), .Names = c("day", "count"), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"), class = "data.frame")
a1 <- a %>%
mutate(day_dif = day - lag(day))
There is probably a better way to deal with the first NA values. Here, I manually assigned it to 0. Then, because the difference of continuous date will be 1, you can take advantage of this property to create a boolean vector and then use cumsum to get the results. Finally, you can remove those groups which their length are equal to 1.
# Let the first NA equal to 0
df[which(is.na(df), arr.ind=TRUE)] <- 0
df %>% mutate(cluster=cumsum(day_dif !=1)) %>%
group_by(cluster) %>% filter(length(cluster) > 1) %>% ungroup()
# Source: local data frame [15 x 4]
# day count day_dif cluster
# (date) (int) (dfft) (int)
# 1 2016-02-02 12 0 days 1
# 2 2016-02-03 80 1 days 1
# 3 2016-02-04 102 1 days 1
# 4 2016-02-05 97 1 days 1
# 5 2016-02-06 118 1 days 1
# 6 2016-02-07 115 1 days 1
# 7 2016-02-08 4 1 days 1
# 8 2016-02-20 13 12 days 2
# 9 2016-02-21 136 1 days 2
# 10 2016-02-22 114 1 days 2
# 11 2016-02-23 134 1 days 2
# 12 2016-02-24 126 1 days 2
# 13 2016-02-25 128 1 days 2
# 14 2016-02-26 63 1 days 2
# 15 2016-02-27 118 1 days 2

Resources