Deduplicate observations based on window of time - r

I have data on a large number of individuals and there may be multiple observations per person. I want to deduplicate the data into 'episodes' of 28 days for each individual. I want to drop those records where the date of the observation is 28 days or less than the date of the start of the prior episode.
Some sample data on 6 observations of a single individual are below. The duplicate and new_episode variables are dummy variables and are not present in the original data and indicate the logic of the example.
dat <- data.frame(id = rep(1, 6), spec_n = seq(1,6,1),
spec_date = as.Date(c("2016/01/01", "2016/01/02", "2016/01/30",
"2016/01/31", "2016/02/02", "2016/02/28")),
duplicate = c(0,1,0,1,1,0), new_episode = c(1,0,1,0,0,1),
stringsAsFactors = FALSE)
dat
id spec_n spec_date duplicate new_episode
1 1 1 2016-01-01 0 1
2 1 2 2016-01-02 1 0
3 1 3 2016-01-30 0 1
4 1 4 2016-01-31 1 0
5 1 5 2016-02-02 1 0
6 1 6 2016-02-28 0 1
With dplyr I can calculate the time since the last observation and the time since the first episode. So deduplicating on date_diff would not provide the data I require.
library(dplyr)
dat <- dat %>% group_by(id) %>%
mutate(date_diff = spec_date - lag(spec_date),
earliest_spec_date = min(spec_date),
diff_earliest = spec_date - earliest_spec_date)
dat
id spec_n spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl> <dbl> <date> <dbl> <dbl> <time> <date> <time>
1 1 1 2016-01-01 0 1 NA days 2016-01-01 0 days
2 1 2 2016-01-02 1 0 1 days 2016-01-01 1 days
3 1 3 2016-01-30 0 1 28 days 2016-01-01 29 days
4 1 4 2016-01-31 1 0 1 days 2016-01-01 30 days
5 1 5 2016-02-02 1 0 2 days 2016-01-01 32 days
6 1 6 2016-02-28 0 1 26 days 2016-01-01 58 days
However, this does not quite provide what I need. spec_n == 6 is less than 28 days since the previous observation, but more than 28 days since the start of the last episode (spec_n == 3).
Expected output would be those rows where duplicate is 0 or new_episode is 1, e.g.
id spec_n spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl> <dbl> <date> <dbl> <dbl> <time> <date> <time>
1 1 1 2016-01-01 0 1 NA days 2016-01-01 0 days
2 1 3 2016-01-30 0 1 28 days 2016-01-01 29 days
3 1 6 2016-02-28 0 1 26 days 2016-01-01 58 days

This should work (its an implementation of the idea Llopis suggested I think).
I make some simulated data first:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by=1), data=rnorm(365))
head(df)
date data
1 2015-01-01 -1.4493544
2 2015-01-02 -0.8860342
3 2015-01-03 1.3629541
4 2015-01-04 -2.0131108
5 2015-01-05 -0.4527413
6 2015-01-06 0.8428585
Now we write a function that takes the first date and checks if subsequent dates are more than 28 days distant from it, returning 0 if they are not and 1 if they are. If a date is 28 days away it takes that new date as the basis of future comparisons.
dupFinder <- function(x) {
n <- 1
N <- length(x)
res <- rep(1, N)
start <- x[n]
while (n < (N)) {
if (as.numeric(x[n+1]-start)>=28) {
res[n+1] <- 1
n <- n+1
start <- x[n]
}
else {
res[n+1] <- 0
n <- n+1
}
}
return(res)
}
The function dupFinder will return a vector of length equal to that of your dataframe, and you can then use it to subset the dataframe to the rows of interest. Thus:
df[dupFinder(df$date)==1,]
date data
1 2015-01-01 -1.4493544
29 2015-01-29 0.2084123
57 2015-02-26 1.4541566
85 2015-03-26 0.6794230
113 2015-04-23 -0.8285670
141 2015-05-21 -0.8686872
169 2015-06-18 2.1657994
197 2015-07-16 -1.1802231
225 2015-08-13 0.1808395
253 2015-09-10 -0.4762835
281 2015-10-08 -0.3769593
309 2015-11-05 0.2825544
337 2015-12-03 -0.7132649
365 2015-12-31 -1.8111226
As expected we start with the January 1, then January 29, then Feb 26, since Feb has 28 days we next get March 26th, etc.

Related

how to create a new date (month, year) data in R

I have a very simple question and I hope you can help me.
I have a dataset with monthly temperatures from 1958 to 2020. This gives me a total of 756 observations, which matches with the amount of months.
This is the only column I have, and I would like to add a column with the date in format month-year, starting from 01-1958 in the first observation, following 02-1958, 03-1958...... 12-2020.
Any ideas?
Thank you very much!
Two things:
I think a Date object would be much better (there is no Month object), since it has natural number-like properties that allows you to find differences, plot without bias, etc. Note that stored this way, every other representation can be derived trivially for reports/renders.
Even if you must go with a string, I suggest putting year first so that sorting works as expected.
You offered no data, so I'll make something up:
mydata <- data.frame(val = 1:756)
mydata$date <- seq(as.Date("1958-01-01"), length.out=756, by="month")
mydata$ym_chr <- format(mydata$date, format = "%Y-%m")
mydata$my_chr <- format(mydata$date, format = "%m-%Y")
mydata[c(1:5, 752:756),]
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 2 2 1958-02-01 1958-02 02-1958
# 3 3 1958-03-01 1958-03 03-1958
# 4 4 1958-04-01 1958-04 04-1958
# 5 5 1958-05-01 1958-05 05-1958
# 752 752 2020-08-01 2020-08 08-2020
# 753 753 2020-09-01 2020-09 09-2020
# 754 754 2020-10-01 2020-10 10-2020
# 755 755 2020-11-01 2020-11 11-2020
# 756 756 2020-12-01 2020-12 12-2020
As a quick demonstrating that we are looking at exactly (no more, no fewer) than one month per year, all months, all years, here's a quick table:
table(year=gsub(".*-", "", mydata$my_chr), month=gsub("-.*", "", mydata$my_chr))
# month
# year 01 02 03 04 05 06 07 08 09 10 11 12
# 1958 1 1 1 1 1 1 1 1 1 1 1 1
# 1959 1 1 1 1 1 1 1 1 1 1 1 1
# 1960 1 1 1 1 1 1 1 1 1 1 1 1
# ...
# 2018 1 1 1 1 1 1 1 1 1 1 1 1
# 2019 1 1 1 1 1 1 1 1 1 1 1 1
# 2020 1 1 1 1 1 1 1 1 1 1 1 1
All snipped rows are identical in all but the year, i.e., all 1s. The sum(.) of this is 756. (Just checking since I wanted to make sure I was doing it right.)
Lastly, to highlight my comment about sorting, here are some examples premised on the knowledge that val is incrementing from 1.
head(mydata[order(mydata$ym_chr),])
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 2 2 1958-02-01 1958-02 02-1958
# 3 3 1958-03-01 1958-03 03-1958
# 4 4 1958-04-01 1958-04 04-1958
# 5 5 1958-05-01 1958-05 05-1958
# 6 6 1958-06-01 1958-06 06-1958
head(mydata[order(mydata$my_chr),])
# val date ym_chr my_chr
# 1 1 1958-01-01 1958-01 01-1958
# 13 13 1959-01-01 1959-01 01-1959
# 25 25 1960-01-01 1960-01 01-1960
# 37 37 1961-01-01 1961-01 01-1961
# 49 49 1962-01-01 1962-01 01-1962
# 61 61 1963-01-01 1963-01 01-1963
If being able to sort by date is important, than I suggest it will be much simpler to use either $date or the string $ym_chr.

Using dplyr for multiple mutate on different criteria

This is my repex:
dates <- seq(as.POSIXct("2015-01-01 13:10:00", tz = "UTC"), as.POSIXct("2015-01-01 13:10:10", tz="UTC"), by="1 sec")
dates[dst(dates)] <- dates[dst(dates)] - 3600
datavalues <- data.frame(x=c(90,90,80,65,NA,64,71,75,62,63,74))
data <- cbind(dates,datavalues)
data
dates x
1 2015-01-01 13:10:00 90
2 2015-01-01 13:10:01 90
3 2015-01-01 13:10:02 80
4 2015-01-01 13:10:03 65
5 2015-01-01 13:10:04 NA
6 2015-01-01 13:10:05 64
7 2015-01-01 13:10:06 71
8 2015-01-01 13:10:07 75
9 2015-01-01 13:10:08 62
10 2015-01-01 13:10:09 63
11 2015-01-01 13:10:10 74
I would have to obtain the following data frame (which I will concatenate to data):
results <- data.frame(Duration=c(3,3,3,0,0,0,2,2,0,0,1),Maxx=c(90,90,90,0,0,0,75,75,0,0,74),Delta=c(0,0,0,0,0,0,7,0,0,0,11))
results
Duration Maxx Delta
1 3 90 0
2 3 90 0
3 3 90 0
4 0 0 0
5 0 0 0
6 0 0 0
7 2 75 7
8 2 75 0
9 0 0 0
10 0 0 0
11 1 74 11
I fix a threshold to 70.
The Duration column is the number of consecutive times during exceeding the threshold.
The Maxx column is the maximum of x for each non null duration.
lastly the Delta column is the difference between the first x exceeding 70 and the precedent x.
I would like if possible to get code using dplyr because around this pice of code, there is already dplyr code. Thank you in advance.
With the help of data.table rleid you can create group of consecutive values which are above or below the threshold and calculate the numbers in each group.
library(dplyr)
library(data.table)
threshold <- 70
data %>%
#Create a unique group of consecutive values
group_by(group = rleid(replace(x, is.na(x), 0) < threshold)) %>%
#If the value is less than threshold put 0 in duration or else
#include number of observations in the group. Do the same for max value.
mutate(Duration = if_else(x < threshold, 0L, n(), missing = 0L),
#+(Duration > 0) is used to turn values less than threshold to 0
Maxx = max(x, na.rm = TRUE) * +(Duration > 0)) %>%
ungroup() %>%
#Subtract current value with previous value
mutate(Delta = x - lag(x),
#Keep only those values that are first row in each group
Delta = replace(Delta, group == lag(group, default = first(group)) |
Duration == 0, 0)) %>%
select(-group)
# dates x Duration Maxx Delta
# <dttm> <dbl> <int> <dbl> <dbl>
# 1 2015-01-01 13:10:00 90 3 90 0
# 2 2015-01-01 13:10:01 90 3 90 0
# 3 2015-01-01 13:10:02 80 3 90 0
# 4 2015-01-01 13:10:03 65 0 0 0
# 5 2015-01-01 13:10:04 NA 0 0 0
# 6 2015-01-01 13:10:05 64 0 0 0
# 7 2015-01-01 13:10:06 71 2 75 7
# 8 2015-01-01 13:10:07 75 2 75 0
# 9 2015-01-01 13:10:08 62 0 0 0
#10 2015-01-01 13:10:09 63 0 0 0
#11 2015-01-01 13:10:10 74 1 74 11

`mstate`: prepare "long" format data into "mstate" format data

The typical preparation steps for mstate involve converting "wide" format data (1x row per 'patient') into "multi-state" format data (multiple rows per 'patient' for each possible transition in the multi-state model).
For example, data in wide format:
library(mstate)
data(ebmt4)
ebmt <- ebmt4
> head(ebmt)
id rec rec.s ae ae.s recae recae.s rel rel.s srv srv.s year agecl proph match
1 1 22 1 995 0 995 0 995 0 995 0 1995-1998 20-40 no no gender mismatch
2 2 29 1 12 1 29 1 422 1 579 1 1995-1998 20-40 no no gender mismatch
3 3 1264 0 27 1 1264 0 1264 0 1264 0 1995-1998 20-40 no no gender mismatch
4 4 50 1 42 1 50 1 84 1 117 1 1995-1998 20-40 no gender mismatch
5 5 22 1 1133 0 1133 0 114 1 1133 0 1995-1998 >40 no gender mismatch
6 6 33 1 27 1 33 1 1427 0 1427 0 1995-1998 20-40 no no gender mismatch
Is converted to multi-state format:
tmat <- transMat(x = list(c(2, 3, 5, 6), c(4, 5, 6), c(4, 5, 6), c(5, 6), c(), c()), names = c("Tx", "Rec", "AE", "Rec+AE", "Rel", "Death"))
msebmt <- msprep(data = ebmt, trans = tmat, time = c(NA, "rec", "ae", "recae", "rel", "srv"), status = c(NA, "rec.s", "ae.s", "recae.s", "rel.s", "srv.s"), keep = c("match", "proph", "year", "agecl"))
> head(msebmt)
An object of class 'msdata'
Data:
id from to trans Tstart Tstop time status match proph year agecl
1 1 1 2 1 0 22 22 1 no gender mismatch no 1995-1998 20-40
2 1 1 3 2 0 22 22 0 no gender mismatch no 1995-1998 20-40
3 1 1 5 3 0 22 22 0 no gender mismatch no 1995-1998 20-40
4 1 1 6 4 0 22 22 0 no gender mismatch no 1995-1998 20-40
5 1 2 4 5 22 995 973 0 no gender mismatch no 1995-1998 20-40
6 1 2 5 6 22 995 973 0 no gender mismatch no 1995-1998 20-40
But what if my original dataset has time-varying covariates (i.e. long format) and I want to format the data into multi-state mode? All of the tutorials I have found online are only for converting initially wide data to multi-state data (not initially long data); for example the mstate package vignette.
So, let's say I have the below data df, where id is for a 'patient', (start,stop] tell us the time periods, state is the state the patient is in at the end of the time period, and tv.cov is their time-varying covariate (assumed constant over the time period). Note that only patient id=5 has 3x entries and that person's tv.cov changes.
id start stop state tv.cov
1 0 1 1 1
2 0 4 1 2
3 0 7 1 1
4 0 10 1 5
5 0 6 1 4
5 6 10 2 10
5 10 15 3 12
Assuming the basic "illness-death" transition model:
tmat <- mstate::trans.illdeath(names = c("healthy", "sick", "death"))
> tmat
to
from healthy sick death
healthy NA 1 2
sick NA NA 3
death NA NA NA
How can I prep df into multi-state format?
As a hack, should I setup the data in "wide" format, format the data into "multi-state" format using msprep and then join another frame onto it which contains the time-varying covariates for each patient at each time interval?

Use dplyr (I think) to manipulate a dataset

I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0.
I first cleaned the data set and took out only the chicks that were recorded for all 12 weigh ins:
library(datasets)
library(dplyr)
Frequency <- dplyr::count(ChickWeight$Chick)
colnames(Frequency)[colnames(Frequency)=="x"] <- "Chick"
a <- inner_join(ChickWeight, Frequency, by='Chick')
complete <- a[(a$freq == 12),]
head(complete,3)
This data set is in the library(datasets) of r, called ChickWeight.
You can try:
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(wdiff = weight - first(weight))
# A tibble: 540 x 5
# Groups: Chick [45]
weight Time Chick Diet wdiff
<dbl> <dbl> <ord> <fct> <dbl>
1 42 0 1 1 0
2 51 2 1 1 9
3 59 4 1 1 17
4 64 6 1 1 22
5 76 8 1 1 34
6 93 10 1 1 51
7 106 12 1 1 64
8 125 14 1 1 83
9 149 16 1 1 107
10 171 18 1 1 129
# ... with 530 more rows

Creating a counter variable in R grouped by ID that conditionally resets [duplicate]

This question already has answers here:
Is there a dplyr equivalent to data.table::rleid?
(6 answers)
Closed 5 years ago.
I'm trying to count # consecutive days inactive (consecDaysInactive), per ID.
I have already created an indicator variable inactive that is 1 on days where id is inactive and 0 when active. I also have an id variable, and a date variable. My analysis dataset will have hundreds of thousands of rows, so efficiency will be important.
The logic I'm trying to create is as follows:
per id, if user is active, consecDaysInactive = 0
per id, if user is inactive, and was active on previous day, consecDaysInactive = 1
per id, if user is inactive on previous day, consecDaysInactive = 1 + # previous consecutive inactive days
consecDaysInactive should reset to 0 for new values of id.
I've been able to create a cumulative sum, but unable to get it to reset at 0 after >= rows of inactive==0.
I've illustrated below the result that I want (consecDaysInactive), as well as the result that I was able to achieve programmatically (bad_consecDaysInactive).
library(dplyr)
d <- data.frame(id = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2), date=as.Date(c('2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08')), inactive=c(0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1), consecDaysInactive=c(0,0,0,1,2,3,0,1,0,1,2,3,4,0,0,1))
d <- d %>%
group_by(id) %>%
arrange(id, date) %>%
do( data.frame(., bad_consecDaysInactive = cumsum(ifelse(.$inactive==1, 1,0))
)
)
d
where consecDaysInactive iterates by +1 for each consecutive day inactive, but resets to 0 each date user is active, and resets to 0 for new values of id. As the output shows below, I'm unable to get bad_consecDaysInactive to reset to 0 -- e.g. row
id date inactive consecDaysInactive bad_consecDaysInactive
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0
2 1 2017-01-02 0 0 0
3 1 2017-01-03 0 0 0
4 1 2017-01-04 1 1 1
5 1 2017-01-05 1 2 2
6 1 2017-01-06 1 3 3
7 1 2017-01-07 0 0 3
8 1 2017-01-08 1 1 4
9 2 2017-01-01 0 0 0
10 2 2017-01-02 1 1 1
11 2 2017-01-03 1 2 2
12 2 2017-01-04 1 3 3
13 2 2017-01-05 1 4 4
14 2 2017-01-06 0 0 4
15 2 2017-01-07 0 0 4
16 2 2017-01-08 1 1 5
I also considered (and tried) incrementing a variable within group_by() & do(), but since do() isn't iterative, I can't get my counter to get past 2:
d2 <- d %>%
group_by(id) %>%
do( data.frame(., bad_consecDaysInactive2 = ifelse(.$inactive == 0, 0, ifelse(.$inactive==1,.$inactive+lag(.$inactive), .$inactive))))
d2
which yielded, as described above:
id date inactive consecDaysInactive bad_consecDaysInactive bad_consecDaysInactive2
<dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0 0
2 1 2017-01-02 0 0 0 0
3 1 2017-01-03 0 0 0 0
4 1 2017-01-04 1 1 1 1
5 1 2017-01-05 1 2 2 2
6 1 2017-01-06 1 3 3 2
7 1 2017-01-07 0 0 3 0
8 1 2017-01-08 1 1 4 1
9 2 2017-01-01 0 0 0 0
10 2 2017-01-02 1 1 1 1
11 2 2017-01-03 1 2 2 2
12 2 2017-01-04 1 3 3 2
13 2 2017-01-05 1 4 4 2
14 2 2017-01-06 0 0 4 0
15 2 2017-01-07 0 0 4 0
16 2 2017-01-08 1 1 5 1
As you can see, my iterator bad_consecDaysInactive2 resets at 0, but doesn't increment past 2! If there's a data.table solution, I'd be happy to hear it as well.
Here's a cute way to do it with a for-loop:
a <- c(1,1,1,1,0,0,1,0,1,1,1,0,0)
b <- rep(NA, length(a))
b[1] <- a[1]
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
}
a
b
It may not be the most efficient way to do this, but it will be pretty darn fast. 11.7 seconds for ten million rows on my computer.
a <- round(runif(10000000,0,1))
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
}
b
Sys.time()-t
Time difference of 11.73612 secs
But this doesn't account for the need to do things per id. That's easy to fix, at a minimal efficiency penalty. Your example dataframe is sorted by id. If you actual data are not already sorted, then do so. Then:
a <- round(runif(10000000,0,1))
id <- round(runif(10000000,1,1000))
id <- id[order(id)]
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
if(id[i] != id[i-1]){
b[i] <- a[i]
}
}
b
Sys.time()-t
Time difference of 13.54373 secs
If we include the time that it took to sort id, then the time difference is closer to 19 seconds. Still not too bad!
How much of an efficiency savings can we get using Frank's answer in the comments on the OP?
d <- data.frame(inactive=a, id=id)
t2 <- Sys.time()
b <- setDT(d)[, v := if (inactive[1]) seq.int(.N) else 0L, by=rleid(inactive)]
Sys.time()-t2
Time difference of 2.233547 secs

Resources