I have a large drug use database:
library(data.table)
df <- data.frame("ID" = c(1,1,1,1,2,2,2,3,3), "IndexDate" = c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01", "2019-05-01", "2019-05-01", "2019-05-01", "2019-07-01", "2019-07-01"), "CensorDate" = c("2019-06-30", "2019-06-30", "2019-06-30", "2019-06-30", "2019-07-30", "2019-07-30", "2019-07-30", "2019-12-31", "2019-12-31"), "DrugStart" = c("2019-02-01", "2019-03-01", "2019-04-01", "2019-06-01", "2019-03-01", "2019-04-15", "2019-05-16", "2019-07-05", "2020-01-01"), "DrugEnd" = c("2019-02-15", "2019-04-15", "2019-04-30", "2019-06-05", "2019-03-15", "2019-05-15", "2019-05-30", "2019-07-15", "2020-01-15"),"Notes" = c("", "", "Overlap 15 days", "", "All days before IndexDate", "15 days before IndexDate", "", "", "15 days after CensorDate"))
df
ID IndexDate CensorDate DrugStart DrugEnd Notes
1 1 2019-01-01 2019-06-30 2019-02-01 2019-02-15
2 1 2019-01-01 2019-06-30 2019-03-01 2019-04-15
3 1 2019-01-01 2019-06-30 2019-04-01 2019-04-30 Overlap 15 days
4 1 2019-01-01 2019-06-30 2019-06-01 2019-06-05
5 2 2019-05-01 2019-07-30 2019-03-01 2019-03-15 All days before IndexDate
6 2 2019-05-01 2019-07-30 2019-04-15 2019-05-15 15 days before IndexDate
7 2 2019-05-01 2019-07-30 2019-05-16 2019-05-30
8 3 2019-07-01 2019-12-31 2019-07-05 2019-07-15
9 3 2019-07-01 2019-12-31 2020-01-01 2020-01-15 15 days after CensorDate
The IndexDate and CensorDate are all the same for each ID. Observation period is from IndexDate to CensorDate.
I would like to rearrange it by following criterias:
Linked by ID
Neglect days before IndexDate or after CensorDate;
The overlapped time periods are only counted one time;
df is a drug use database. All periods in df (from DrugStart to DrugEnd) means use of drug. Those missing period in df, but within observation period (from IndexDate to CensorDate) means Not use of drug.
Drug use is labeled as 2 (use) and 1 (not use);
IndexDate is defined as Day 0 (means all start time of "TimeStart" is 0).
I expect results as follows:
> df2 <- data.frame("ID" = c(1,1,1,1,1,1,1,2,2,3,3,3), "TimeStart" = c("0", "31", "46", "59", "120", "151", "156", "0", "30", "0", "4", "15"), "TimeEnd" = c("30", "45", "58", "119", "150", "155", "180", "29", "90", "3", "14", "183"), "DrugUse" = c("1", "2", "1", "2", "1", "2", "1", "2", "1", "1", "2", "1"))
> df2
ID TimeStart TimeEnd DrugUse
1 1 0 30 1
2 1 31 45 2
3 1 46 58 1
4 1 59 119 2
5 1 120 150 1
6 1 151 155 2
7 1 156 180 1
8 2 0 29 2
9 2 30 90 1
10 3 0 3 1
11 3 4 14 2
12 3 15 183 1
Now, I know how to generate TimeStart and TimeEnd by "DrugStart-IndexDate" and "DrugEnd-IndexDate", as follows:
df$TimeStart<- as.Date(df$DrugStart, format="%Y-%m-%d")-as.Date(df$IndexDate, format="%Y-%m-%d")
df$TimeEnd<- as.Date(df$DrugEnd, format="%Y-%m-%d")-as.Date(df$IndexDate, format="%Y-%m-%d")
df
ID IndexDate CensorDate DrugStart DrugEnd Notes_Drug.use.days TimeStart TimeEnd
1 1 2019-01-01 2019-06-30 2019-02-01 2019-02-15 15days 31 days 45 days
2 1 2019-01-01 2019-06-30 2019-03-01 2019-04-15 46days 59 days 104 days
3 1 2019-01-01 2019-06-30 2019-04-01 2019-04-30 Overlap 15days + 15days 90 days 119 days
4 1 2019-01-01 2019-06-30 2019-06-01 2019-06-05 5days 151 days 155 days
5 2 2019-05-01 2019-07-30 2019-03-01 2019-03-15 15days before IndexDate -61 days -47 days
6 2 2019-05-01 2019-07-30 2019-04-15 2019-05-15 15days before IndexDate+15days -16 days 14 days
7 2 2019-05-01 2019-07-30 2019-05-16 2019-05-30 15days 15 days 29 days
8 3 2019-07-01 2019-12-31 2019-07-05 2019-07-15 11days 4 days 14 days
9 3 2019-07-01 2019-12-31 2020-01-01 2020-01-15 15days after CensorDate 184 days 198 days
But I do not know how to deal with the overlapped periods and those continuous periods, as following:
# Overlapped periods:
# Transform
ID TimeStart TimeEnd
2 1 59 days 104 days
3 1 90 days 119 days
# to
ID TimeStart TimeEnd
2 1 59 days 119 days
# And Continous periods:
# Transform
ID TimeStart TimeEnd
6 2 -16 days 14 days
7 2 15 days 29 days
# To
ID TimeStart TimeEnd
6 2 0 days 29 days
Also, how to add those periods that we do not use the drug (those DrugUse=1)? such as these lines:
ID TimeStart TimeEnd DrugUse
1 1 0 30 1
3 1 46 58 1
5 1 120 150 1
7 1 156 180 1
9 2 30 90 1
10 3 0 3 1
12 3 15 183 1
Is there anyone help me? Thank you very much!
#####################################################
Updated:
Thank you for Bas's answer!! I made minor revisions on Bas's answer. The following code might be the final version!!
library(data.table)
df <- data.frame("ID" = c(1,1,1,1,2,2,2,3,3), "IndexDate" = c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01", "2019-05-01", "2019-05-01", "2019-05-01", "2019-07-01", "2019-07-01"), "CensorDate" = c("2019-06-30", "2019-06-30", "2019-06-30", "2019-06-30", "2019-07-30", "2019-07-30", "2019-07-30", "2019-12-31", "2019-12-31"), "DrugStart" = c("2019-02-01", "2019-03-01", "2019-04-01", "2019-06-01", "2019-03-01", "2019-04-15", "2019-05-16", "2019-07-05", "2020-01-01"), "DrugEnd" = c("2019-02-15", "2019-04-15", "2019-04-30", "2019-06-05", "2019-03-15", "2019-05-15", "2019-05-30", "2019-07-15", "2020-01-15"),"Notes" = c("", "", "Overlap 15 days", "", "All days before IndexDate", "15 days before IndexDate", "", "", "15 days after CensorDate"))
df$DrugEnd <- as.Date(df$DrugEnd, format="%Y-%m-%d") + 1
df$CensorDate <- as.Date(df$CensorDate, format="%Y-%m-%d") + 1
library(dplyr)
library(tidyr)
library(lubridate)
df2 <- df %>%
mutate(across(IndexDate:DrugEnd, as.Date)) %>%
filter(DrugStart <= CensorDate, # Neglect days before IndexDate or after CensorDate
DrugEnd >= IndexDate) %>%
group_by(ID) %>%
mutate(interval = list(int_diff(sort(unique(c(IndexDate, CensorDate, DrugStart, DrugEnd)))))) %>%
unnest(interval) %>%
mutate(DrugUse = DrugStart < int_end(interval) & DrugEnd > int_start(interval)) %>%
group_by(ID, interval) %>%
summarise(IndexDate = first(IndexDate),
CensorDate = first(CensorDate),
DrugUse = if_else(sum(DrugUse) > 0, 2, 1)) %>%
ungroup() %>%
filter(int_end(interval) <= CensorDate,
int_start(interval) >= IndexDate) %>%
mutate(TimeStart = as.numeric(difftime(int_start(interval), IndexDate, units = "days")),
TimeEnd = as.numeric(difftime(int_end(interval), IndexDate, units = "days"))-1) %>%
group_by(ID, data.table::rleid(DrugUse)) %>%
summarise(TimeStart = min(TimeStart),
TimeEnd = max(TimeEnd),
DrugUse = first(DrugUse)) %>%
select(ID, TimeStart, TimeEnd, DrugUse)
> df2
# A tibble: 12 x 4
# Groups: ID [3]
ID TimeStart TimeEnd DrugUse
<dbl> <dbl> <dbl> <dbl>
1 1 0 30 1
2 1 31 45 2
3 1 46 58 1
4 1 59 119 2
5 1 120 150 1
6 1 151 155 2
7 1 156 180 1
8 2 0 29 2
9 2 30 90 1
10 3 0 3 1
11 3 4 14 2
12 3 15 183 1
#####################################################
2nd updated:
If your dataset is too large (for example, more than one million records), using above codes may be very slow. The file after unnest() is extremely large, and this step is very slow.
In this case, We can split file using split() (better no more than 10 thousands records in each file). Running by loop syntax (for(i in sequence){statement}). Then combine the files using rbind().
Good luck!
Using dplyr, tidyr and lubridate, this gets you close but not quite there:
df %>%
mutate(across(IndexDate:DrugEnd, as.Date)) %>%
filter(DrugStart <= CensorDate, # Neglect days before IndexDate or after CensorDate
DrugEnd >= IndexDate) %>%
group_by(ID) %>%
mutate(interval = list(int_diff(sort(unique(c(IndexDate, CensorDate, DrugStart, DrugEnd)))))) %>%
unnest(interval) %>%
mutate(DrugUse = DrugStart < int_end(interval) & DrugEnd > int_start(interval)) %>%
group_by(ID, interval) %>%
summarise(IndexDate = first(IndexDate),
CensorDate = first(CensorDate),
DrugUse = if_else(sum(DrugUse) > 0, 2, 1)) %>%
ungroup() %>%
filter(int_end(interval) <= CensorDate,
int_start(interval) >= IndexDate) %>%
mutate(TimeStart = as.numeric(difftime(int_start(interval), IndexDate, units = "days")),
TimeEnd = as.numeric(difftime(int_end(interval), IndexDate, units = "days"))) %>%
group_by(ID, data.table::rleid(DrugUse)) %>%
summarise(TimeStart = min(TimeStart),
TimeEnd = max(TimeEnd),
DrugUse = first(DrugUse)) %>%
select(ID, TimeStart, TimeEnd, DrugUse)
which gives
ID TimeStart TimeEnd DrugUse
<dbl> <dbl> <dbl> <dbl>
1 1 0 31 1
2 1 31 45 2
3 1 45 59 1
4 1 59 119 2
5 1 119 151 1
6 1 151 155 2
7 1 155 180 1
8 2 0 14 2
9 2 14 15 1
10 2 15 29 2
11 2 29 90 1
12 3 0 4 1
13 3 4 14 2
14 3 14 183 1
Related
I want to convert days into weeks with all the values from that week summed up
Right now I have the following df
Date x
1 2018-02-23 15
2 2018-03-26 4
3 2018-03-29 3
4 2018-03-30 6
5 2018-04-03 5
6 2018-04-04 12
7 2018-04-05 7
8 2018-04-06 5
9 2018-04-07 5
10 2018-04-09 13
11 2018-04-10 8
12 2018-04-11 2
ETC.
The x in this df stands for amount of items sent on a certain day.
There are days in this df where there are no items beeing transported.
This df has a total of 688 tuples.
What I would like to see it:
Date x
1 Week 8 2018 19
2 Week 9 2018 26
3 Week 10 2018 33
ETC.
Can someone help me out?
You can use aggregate and get the weeks with format %V:
aggregate(df$x, list(Date=format(df$Date, "%V %Y")), sum)
# Date x
#1 08 2018 15
#2 13 2018 13
#3 14 2018 34
#4 15 2018 23
Or with Week (Thanks to #sindri-baldur for the comment):
aggregate(df$x, list(Date=sub("^0?", "Week ", format(df$Date, "%V %Y"))), sum)
#aggregate(df$x, list(Date=format(df$Date, "Week %-V %Y")), sum) #Alternative
# Date x
#1 Week 13 2018 13
#2 Week 14 2018 34
#3 Week 15 2018 23
#4 Week 8 2018 15
Data:
df <- read.table(header=TRUE, text=" Date x
1 2018-02-23 15
2 2018-03-26 4
3 2018-03-29 3
4 2018-03-30 6
5 2018-04-03 5
6 2018-04-04 12
7 2018-04-05 7
8 2018-04-06 5
9 2018-04-07 5
10 2018-04-09 13
11 2018-04-10 8
12 2018-04-11 2")
df$Date <- as.Date(df$Date)
library(lubridate)
library(tidyverse)
## Random data
df <- data.frame(date=seq.Date(from = as.Date("2018-01-01"), to=as.Date("2018-12-31"), by = "day"),x=runif(n=365,min=0,max=25))
## Aggregating by week
df2 <- df %>%
mutate(week = lubridate::week(ymd(date))) %>%
group_by(week) %>%
summarise(total_per_week = sum(x))
Using collapse
library(collapse)
library(lubridate)
library(magrittr)
df %>%
ftransform(week = week(ymd(Date))) %>%
fgroup_by(week) %>%
fsummarise(total_per_week = fsum(x))
# week total_per_week
#1 8 15
#2 13 13
#3 14 34
#4 15 23
data
df <- structure(list(Date = c("2018-02-23", "2018-03-26", "2018-03-29",
"2018-03-30", "2018-04-03", "2018-04-04", "2018-04-05", "2018-04-06",
"2018-04-07", "2018-04-09", "2018-04-10", "2018-04-11"), x = c(15L,
4L, 3L, 6L, 5L, 12L, 7L, 5L, 5L, 13L, 8L, 2L)), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
#akrun
This almost worked. Right now I get 52 rows out of 3 years of data:
week total_per_week
1 1 246
2 2 491
3 3 458
4 4 556
5 5 454
6 6 594
7 7 592
8 8 496
9 9 567
10 10 615
I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.
Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6
I have yearly observations of income for a series of geographies, like this:
library(dplyr)
library(lubridate)
date <- c("2004-01-01", "2005-01-01", "2006-01-01",
"2004-01-01", "2005-01-01", "2006-01-01")
geo <- c(1, 1, 1, 2, 2, 2)
inc <- c(10, 12, 14, 32, 34, 50)
data <- tibble(date = ymd(date), geo, inc)
date geo inc
<date> <dbl> <dbl>
1 2004-01-01 1 10
2 2005-01-01 1 12
3 2006-01-01 1 14
4 2004-01-01 2 32
5 2005-01-01 2 34
6 2006-01-01 2 50
I need to insert mid-year values, as averages of the start-of-year and end-of-year observations, so that the data is every 6 months. The outcome would like this:
2004-01-01 1 10
2004-06-01 1 11
2005-01-01 1 12
2004-06-01 1 13
2006-01-01 1 14
2004-01-01 2 32
2004-06-01 2 33
2005-01-01 2 34
2004-06-01 2 42
2006-01-01 2 50
Would appreciate any ideas.
Grouped by 'geoo', add (+) the 'inc' with the next value (lead) and get the average (/2), as well as add 5 months to the 'date', then filter out the NA elements in 'inc', bind the rows with the original data
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
summarise(date = date %m+% months(5),
inc = (inc + lead(inc))/2, .groups = 'drop') %>%
filter(!is.na(inc)) %>%
bind_rows(data, .) %>%
arrange(geo, date)
-output
# A tibble: 10 x 3
# date geo inc
# <date> <dbl> <dbl>
# 1 2004-01-01 1 10
# 2 2004-06-01 1 11
# 3 2005-01-01 1 12
# 4 2005-06-01 1 13
# 5 2006-01-01 1 14
# 6 2004-01-01 2 32
# 7 2004-06-01 2 33
# 8 2005-01-01 2 34
# 9 2005-06-01 2 42
#10 2006-01-01 2 50
You can use complete to create a sequence of dates for 6 months and then use na.approx to fill the NA values with interpolated values.
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
tidyr::complete(date = seq(min(date), max(date), by = '6 months')) %>%
mutate(date = if_else(is.na(inc), date %m-% months(1), date),
inc = zoo::na.approx(inc))
# geo date inc
# <dbl> <date> <dbl>
# 1 1 2004-01-01 10
# 2 1 2004-06-01 11
# 3 1 2005-01-01 12
# 4 1 2005-06-01 13
# 5 1 2006-01-01 14
# 6 2 2004-01-01 32
# 7 2 2004-06-01 33
# 8 2 2005-01-01 34
# 9 2 2005-06-01 42
#10 2 2006-01-01 50
I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857
I have a dataframe and I want to filter out the entries that are not continuous in date. In other words, I am looking at the cluster of continuous dates.
a %>% group_by(day) %>% summarise(count = n()) %>% mutate(day_dif = day - lag(day))
Source: local data frame [20 x 3]
day count day_dif
(date) (int) (dfft)
1 2016-02-02 12 NA days
2 2016-02-03 80 1 days
3 2016-02-04 102 1 days
4 2016-02-05 97 1 days
5 2016-02-06 118 1 days
6 2016-02-07 115 1 days
7 2016-02-08 4 1 days
8 2016-02-20 13 12 days
9 2016-02-21 136 1 days
10 2016-02-22 114 1 days
11 2016-02-23 134 1 days
12 2016-02-24 126 1 days
13 2016-02-25 128 1 days
14 2016-02-26 63 1 days
15 2016-02-27 118 1 days
16 2016-03-06 1 8 days
17 2016-03-29 28 23 days
18 2016-04-03 18 5 days
19 2016-04-08 18 5 days
20 2016-04-27 23 19 days
In this, I want to filter out entries that are not continuous in date. For ex, 2016-03-06, 2016-03-29, 2016-04-03 are single day entries which needs to be removed. I am looking only for continuous days entries. entries that occur with multiple days. My ideal output which I am looking is,
day count day_dif Cluster
(date) (int) (dfft)
1 2016-02-02 12 NA days 1
2 2016-02-03 80 1 days 1
3 2016-02-04 102 1 days 1
4 2016-02-05 97 1 days 1
5 2016-02-06 118 1 days 1
6 2016-02-07 115 1 days 1
7 2016-02-08 4 1 days 1
8 2016-02-20 13 12 days 2
9 2016-02-21 136 1 days 2
10 2016-02-22 114 1 days 2
11 2016-02-23 134 1 days 2
12 2016-02-24 126 1 days 2
13 2016-02-25 128 1 days 2
14 2016-02-26 63 1 days 2
15 2016-02-27 118 1 days 2
Where cluster column indicates the date clusters and also the output removes the single dates. Here 1 in the cluster column indicates, first group of dates and 2 indicates second group of dates. If there are more than 3 continuous days, I want to consider as on cluster.
I am trying to do this by using lag functions and all. But without much success. Can anybody help me in doing this? Any idea would be appreciated.
Thanks
We can use rle to subset the rows
i1 <- c(TRUE, a1$day_dif[-1] >=3)
i2 <- inverse.rle(within.list(rle(i1), {values1 <- values
values[values1 &lengths >3] <- FALSE
values[!values1]<- TRUE}))
a1$Cluster <- cumsum(i1)
a1[i2,]
# day count day_dif Cluster
#1 2016-02-02 12 NA days 1
#2 2016-02-03 80 1 days 1
#3 2016-02-04 102 1 days 1
#4 2016-02-05 97 1 days 1
#5 2016-02-06 118 1 days 1
#6 2016-02-07 115 1 days 1
#7 2016-02-08 4 1 days 1
#8 2016-02-20 13 12 days 2
#9 2016-02-21 136 1 days 2
#10 2016-02-22 114 1 days 2
#11 2016-02-23 134 1 days 2
#12 2016-02-24 126 1 days 2
#13 2016-02-25 128 1 days 2
#14 2016-02-26 63 1 days 2
#15 2016-02-27 118 1 days 2
The above code can be also be chained (%>%)
a1 %>%
mutate(i1 = c(TRUE, day_dif[-1] >=3)) %>%
do(data.frame(., i2 = inverse.rle(within.list(rle(.$i1), {
values1 <- values
values[values1 & lengths >3] <- FALSE
values[!values1] <- TRUE
})))) %>%
mutate(Cluster = cumsum(i1)) %>%
filter(i2) %>%
select(-i1, -i2)
# day count day_dif Cluster
#1 2016-02-02 12 NA days 1
#2 2016-02-03 80 1 days 1
#3 2016-02-04 102 1 days 1
#4 2016-02-05 97 1 days 1
#5 2016-02-06 118 1 days 1
#6 2016-02-07 115 1 days 1
#7 2016-02-08 4 1 days 1
#8 2016-02-20 13 12 days 2
#9 2016-02-21 136 1 days 2
#10 2016-02-22 114 1 days 2
#11 2016-02-23 134 1 days 2
#12 2016-02-24 126 1 days 2
#13 2016-02-25 128 1 days 2
#14 2016-02-26 63 1 days 2
#15 2016-02-27 118 1 days 2
data
a <- structure(list(day = structure(c(16833, 16834, 16835, 16836,
16837, 16838, 16839, 16851, 16852, 16853, 16854, 16855, 16856,
16857, 16858, 16866, 16889, 16894, 16899, 16918), class = "Date"),
count = c(12L, 80L, 102L, 97L, 118L, 115L, 4L, 13L, 136L,
114L, 134L, 126L, 128L, 63L, 118L, 1L, 28L, 18L, 18L, 23L
)), .Names = c("day", "count"), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"), class = "data.frame")
a1 <- a %>%
mutate(day_dif = day - lag(day))
There is probably a better way to deal with the first NA values. Here, I manually assigned it to 0. Then, because the difference of continuous date will be 1, you can take advantage of this property to create a boolean vector and then use cumsum to get the results. Finally, you can remove those groups which their length are equal to 1.
# Let the first NA equal to 0
df[which(is.na(df), arr.ind=TRUE)] <- 0
df %>% mutate(cluster=cumsum(day_dif !=1)) %>%
group_by(cluster) %>% filter(length(cluster) > 1) %>% ungroup()
# Source: local data frame [15 x 4]
# day count day_dif cluster
# (date) (int) (dfft) (int)
# 1 2016-02-02 12 0 days 1
# 2 2016-02-03 80 1 days 1
# 3 2016-02-04 102 1 days 1
# 4 2016-02-05 97 1 days 1
# 5 2016-02-06 118 1 days 1
# 6 2016-02-07 115 1 days 1
# 7 2016-02-08 4 1 days 1
# 8 2016-02-20 13 12 days 2
# 9 2016-02-21 136 1 days 2
# 10 2016-02-22 114 1 days 2
# 11 2016-02-23 134 1 days 2
# 12 2016-02-24 126 1 days 2
# 13 2016-02-25 128 1 days 2
# 14 2016-02-26 63 1 days 2
# 15 2016-02-27 118 1 days 2