R: Compute max of next 12 hours for each timestep - r

I have this dataframe df <- tibble(id = c(1, 1, 2, 2), v= c(0, 3, 1, 2), time = c(as.POSIXct("2016-12-01 12:30:00"), as.POSIXct("2016-12-01 20:30:00"), as.POSIXct("2016-12-01 3:30:00"), as.POSIXct("2016-12-01 12:30:00")))
# A tibble: 4 x 3
id v time
<dbl> <dbl> <dttm>
1 1 0 2016-12-01 12:30:00
2 1 3 2016-12-01 20:30:00
3 2 1 2016-12-01 03:30:00
4 2 2 2016-12-01 12:30:00
For each timestep and within each id, I want to compute the max value of v within a specific time period, e.g. 12 hours. My solution is the following:
df %>% group_by(id) %>% mutate(max_in_12h = purrr::map_dbl(time, function(t){max(v[time >= t && time <= t + 60*60*12])}))
id v time max_in_12h
<dbl> <dbl> <dttm> <dbl>
1 1 0 2016-12-01 12:30:00 3
2 1 3 2016-12-01 20:30:00 -Inf
3 2 1 2016-12-01 03:30:00 2
4 2 2 2016-12-01 12:30:00 -Inf
However, in my experience, purrr scales poorly when the dataframe has millions of rows. Is there another neat option?

You will need to test whether the performance is adequate but here is an alternative.
library(sqldf)
sqldf("select a.*, max(b.v) as max
from df a
left join df b on a.id = b.id and
b.time > a.time and b.time <= a.time + 60 * 60 * 12
group by a.rowid")
giving:
id v time max
1 1 0 2016-12-01 12:30:00 3
2 1 3 2016-12-01 20:30:00 NA
3 2 1 2016-12-01 03:30:00 2
4 2 2 2016-12-01 12:30:00 NA

Related

calculate number of frost change days (number of days) from the weather hourly data in r

I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE

Filter data based on subgroups R

In reality it's much more complex, but let's say my data looks like this:
df <- data.frame(
id = c(1,1,1,2,2,2,2,3,3,3),
event = c(0,0,0,1,1,1,1,0,0,0),
day = c(1,3,3,1,6,6,7,1,4,6),
time = c("2016-10-25 14:00:00", "2016-10-27 12:00:15", "2016-10-27 15:30:00",
"2016-10-23 11:00:00", "2016-10-28 08:00:15", "2016-10-28 23:00:00", "2016-10-29 12:00:00",
"2016-10-24 15:00:00", "2016-10-27 15:00:15", "2016-10-29 16:00:00"))
df$time <- as.POSIXct(df$time)
Output:
id event day time
1 1 0 1 2016-10-25 14:00:00
2 1 0 3 2016-10-27 12:00:15
3 1 0 3 2016-10-27 15:30:00
4 2 1 1 2016-10-23 11:00:00
5 2 1 6 2016-10-28 08:00:15
6 2 1 6 2016-10-28 23:00:00
7 2 1 7 2016-10-29 12:00:00
8 3 0 1 2016-10-24 15:00:00
9 3 0 4 2016-10-27 15:00:15
10 3 0 6 2016-10-29 16:00:00
What I need to do:
If event is 0, I want to keep only the last 24 hours per id.
If event is 1, I want to keep the 6th day.
I know how to keep the last 24 hours in general:
library(lubridate)
last_twentyfour_hours <- df %>%
group_by(id) %>%
filter(time > last(time) - hours(24))
But how do i filter differently for each group?
Thank you very much in advance!
Grouped by 'id', 'event', do a filter with if/else i.e. if 0 is in 'event', then use the OP's condition or else return the rows where 'day' is 6
library(dplyr)
library(lubridate)
df %>%
group_by(id, event) %>%
filter(if(0 %in% event) time > last(time) - hours(24) else
day == 6) %>%
ungroup
-output
# A tibble: 5 × 4
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00
We could use the & and | operator:
df %>%
group_by(id) %>%
filter(event == 0 & time > last(time) - hours(24) |
event == 1 & day==6)
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)

R: new column for unique observations over time elapsed (indexing from previous)

I am trying to create a new column that assigns a unique value to the observation (row) only IF the recorded observation occur after a specific time following the last observation (see data frame).
Context:
I set up camera trap to observe what species visit a particular plot, every visit by a species should get a unique visitID. The actual database contains more complexity but this is the main problem I have.
new.df <- data.frame(
species = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
visit.time = c(seq(ymd_hm('2015-01-01 00:00'), ymd_hm('2015-01-01 00:10'), by = '2 mins'),
seq(ymd_hm('2015-01-01 00:00'), ymd_hm('2015-01-01 00:10'), by = '2 mins'))
)
> new.df
species visit.time
1 A 2015-01-01 00:00:00
2 A 2015-01-01 00:02:00
3 A 2015-01-01 00:04:00
4 A 2015-01-01 00:06:00
5 A 2015-01-01 00:08:00
6 A 2015-01-01 00:10:00
7 B 2015-01-01 00:00:00
8 B 2015-01-01 00:02:00
9 B 2015-01-01 00:04:00
10 B 2015-01-01 00:06:00
11 B 2015-01-01 00:08:00
12 B 2015-01-01 00:10:00
I would like to create a new column called "visitID" that records an each species' visit that occured. However, I only want to assign a unique number only of the visit occurred at least 2 minutes after the previous recorded visit:
> new.df
species visit.time visitID
1 A 2015-01-01 00:00:00 1
2 A 2015-01-01 00:02:00 -
3 A 2015-01-01 00:04:00 2
4 A 2015-01-01 00:06:00 -
5 A 2015-01-01 00:08:00 3
6 A 2015-01-01 00:10:00 -
7 B 2015-01-01 00:00:00 1
8 B 2015-01-01 00:02:00 -
9 B 2015-01-01 00:04:00 2
10 B 2015-01-01 00:06:00 -
11 B 2015-01-01 00:08:00 3
12 B 2015-01-01 00:10:00 -
where - is just an NA
I would usually try using dplyr:mutate with conditional terms ifelse, the problem is I do not know how to account for time elapse from the previous visit.
Please let me know if there are more details that could provide. Thanks!
From your desired output it seems you want a new ID when the time difference between the current and the last recorded visit that received a new ID exceeds 2 minutes. In that case, we could use a cumulative sum that resets at a certain threshold. I've used the function from this answer: dplyr / R cumulative sum with reset
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>thresh, .y, .x+.y))
}
}
new.df <- new.df %>%
group_by(species) %>% # group df by species
arrange(species, visit.time) %>% # sort the data
mutate(
time.elapsed = as.numeric(difftime(visit.time, lag(visit.time), units = "mins")), # calculate time difference in minutes
time.elapsed = ifelse(is.na(time.elapsed), 0, time.elapsed), # replace NAs at first entries with 0s
time.elapsed.cum = sum_reset_at(2)(time.elapsed), # build cumulative sum that resets once the value is greater (not greater or equal) to two
newID = ifelse(time.elapsed.cum > 2, TRUE, FALSE), # build logical vector that marks the position where a new ID starts
visitID = cumsum(newID) + 1, # generate visit IDs
visitID = replace(visitID, duplicated(visitID), NA) # keep only first entry of an id, replace rest with NA
)
Output:
> new.df
# A tibble: 12 x 6
# Groups: species [2]
species visit.time time.elapsed time.elapsed.cum newID visitID
<fct> <dttm> <dbl> <dbl> <lgl> <dbl>
1 A 2015-01-01 00:00:00 0 0 FALSE 1
2 A 2015-01-01 00:02:00 2 2 FALSE NA
3 A 2015-01-01 00:04:00 2 4 TRUE 2
4 A 2015-01-01 00:06:00 2 2 FALSE NA
5 A 2015-01-01 00:08:00 2 4 TRUE 3
6 A 2015-01-01 00:10:00 2 2 FALSE NA
7 B 2015-01-01 00:00:00 0 0 FALSE 1
8 B 2015-01-01 00:02:00 2 2 FALSE NA
9 B 2015-01-01 00:04:00 2 4 TRUE 2
10 B 2015-01-01 00:06:00 2 2 FALSE NA
11 B 2015-01-01 00:08:00 2 4 TRUE 3
12 B 2015-01-01 00:10:00 2 2 FALSE NA
So basically we are summing up the time differences until they exceed two minutes, then we reset the sum to zero. Where this cumsum is greater than two we need to add a new ID. We do this by adding a logical vector and building the cumsum of that vector (because TRUE = 1 and FALSE = 0). Lastly, we replace the duplicated IDs in the groups to get the output you specified. We can drop the columns you don't need:
> new.df %>% select(-c(time.elapsed, time.elapsed.cum, newID))
# A tibble: 12 x 3
# Groups: species [2]
species visit.time visitID
<fct> <dttm> <dbl>
1 A 2015-01-01 00:00:00 1
2 A 2015-01-01 00:02:00 NA
3 A 2015-01-01 00:04:00 2
4 A 2015-01-01 00:06:00 NA
5 A 2015-01-01 00:08:00 3
6 A 2015-01-01 00:10:00 NA
7 B 2015-01-01 00:00:00 1
8 B 2015-01-01 00:02:00 NA
9 B 2015-01-01 00:04:00 2
10 B 2015-01-01 00:06:00 NA
11 B 2015-01-01 00:08:00 3
12 B 2015-01-01 00:10:00 NA
You can return the differences using diff(). Just make sure to prepend a 2 to each group of species, i.e. c(2, diff(visit.time) / 60), so that the first visit for each species always gets an ID (R will throw an error otherwise).
The only criterion you've given for visitID is that the values for each species are unique, but not that they are consecutive, so I'll assume that 1 5 6 is just as valid as 1 2 3. This simplifies things quite a bit:
library(dplyr)
df %>%
group_by(species) %>%
mutate(tdiff = c(2, diff(visit.time) / 60),
visitID = seq_along(species),
visitID = ifelse(tdiff >= 2, visitID, NA)
)
Which will return the following data frame:
# A tibble: 12 x 4
# Groups: species [2]
species visit.time tdiff visitID
<fct> <dttm> <dbl> <int>
1 A 2015-01-01 00:02:10 2 1
2 A 2015-01-01 00:03:00 0.833 NA
3 A 2015-01-01 00:03:10 0.167 NA
4 A 2015-01-01 00:04:00 0.833 NA
5 A 2015-01-01 00:07:40 3.67 5
6 A 2015-01-01 00:09:40 2 6
7 B 2015-01-01 00:00:40 2 1
8 B 2015-01-01 00:01:10 0.5 NA
9 B 2015-01-01 00:04:10 3 3
10 B 2015-01-01 00:05:40 1.5 NA
11 B 2015-01-01 00:09:40 4 5
12 B 2015-01-01 00:09:50 0.167 NA
Note that I've used a modified dataset because the differences between the times in the example you provide are all == 2.
Data:
df <- structure(list(species = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
visit.time = structure(c(1420070530, 1420070580, 1420070590,
1420070640, 1420070860, 1420070980, 1420070440, 1420070470,
1420070650, 1420070740, 1420070980, 1420070990), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA,
-12L))

Fill in missing cases till specific condition per group

I'm attempting to create a data frame that shows all of the in between months for my data set, by subject. Here is an example of what the data looks like:
dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01',
'2017-02-01', '2017-04-01'))
colnames(dat) <- c('id', 'value', 'date')
dat$Out.Of.Study <- c("", "", "Out", "Out", "", "", "Out", "", "", "Out")
dat
id value date Out.Of.Study
1 1 30 2017-01-01
2 1 30 2017-02-01
3 1 25 2017-04-01 Out
4 2 25 2017-02-01 Out
5 3 25 2017-01-01
6 3 25 2017-02-01
7 3 25 2017-03-01 Out
8 4 20 2017-01-01
9 4 20 2017-02-01
10 4 20 2017-04-01 Out
If I want to show the in between months where no data was collected (but the subject was still enrolled in the study) I can use the complete() function. However, the issue is that I get all missing months for each subject id based on the min and max month identified in the data set:
## Add Dates by Group
library(tidyr)
complete(dat, id, date)
id date value Out.Of.Study
1 1 2017-01-01 30
2 1 2017-02-01 30
3 1 2017-03-01 NA <NA>
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA <NA>
6 2 2017-02-01 25 Out
7 2 2017-03-01 NA <NA>
8 2 2017-04-01 NA <NA>
9 3 2017-01-01 25
10 3 2017-02-01 25
11 3 2017-03-01 25 Out
12 3 2017-04-01 NA <NA>
13 4 2017-01-01 20
14 4 2017-02-01 20
15 4 2017-03-01 NA <NA>
16 4 2017-04-01 20 Out
The issue with this is that I don't want the missing months to exceed the subject's final observed month (essentially, I have subjects who are censored and would need to be removed from the study) or show up prior to the month a subject started the study. For example, subject 2 was only a participant in the month '2017-02-01'. There for, I'd like the data to represent that this was the only month they were in there and not have them represented by the extra months after and the extra month before, as shown above. The same is the case with subject 3, who has an extra month, even though they are out of the study.
Perhaps the complete() isn't the best way to go about this?
This can be solved by creating a sequence of months individually for each id and by joining the sequences with dat to complete the missing months.
1. data.table
(The question is tagged with tidyr. But as I am more acquainted with data.table I have tried this first.)
library(data.table)
# coerce date strings to class Date
setDT(dat)[, date := as.Date(date)]
# create sequence of months for each id
sdt <- dat[, .(date = seq(min(date), max(date), "month")), by = id]
# join
dat[sdt, on = .(id, date)]
id value date Out.Of.Study
1: 1 30 2017-01-01
2: 1 30 2017-02-01
3: 1 NA 2017-03-01 <NA>
4: 1 25 2017-04-01 Out
5: 2 25 2017-02-01 Out
6: 3 25 2017-01-01
7: 3 25 2017-02-01
8: 3 25 2017-03-01 Out
9: 4 20 2017-01-01
10: 4 20 2017-02-01
11: 4 NA 2017-03-01 <NA>
12: 4 20 2017-04-01 Out
Note that there is only one row for id == 2 as requested by the OP.
This approach requires to coerce date from factor to class Date to make sure that all missing months will be completed.
This is also safer than to rely on the avialable date factors in the dataset. For illustration, let's assume that id == 4 is Out in month 2017-06-01 (June) instead of 2017-04-01 (April). Then, there would be no month 2017-05-01 (May) in the whole dataset and the final result would be incomplete.
Without creating the temporary variable sdt the code becomes
library(data.table)
setDT(dat)[, date := as.Date(date)][
dat[, .(date = seq(min(date), max(date), "month")), by = id], on = .(id, date)]
2. tidyr / dplyr
library(dplyr)
library(tidyr)
# coerce date strings to class Date
dat <- dat %>%
mutate(date = as.Date(date))
dat %>%
# create sequence of months for each id
group_by(id) %>%
expand(date = seq(min(date), max(date), "month")) %>%
# join to complete the missing month for each id
left_join(dat, by = c("id", "date"))
# A tibble: 12 x 4
# Groups: id [?]
id date value Out.Of.Study
<dbl> <date> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-02-01 25 Out
6 3 2017-01-01 25 ""
7 3 2017-02-01 25 ""
8 3 2017-03-01 25 Out
9 4 2017-01-01 20 ""
10 4 2017-02-01 20 ""
11 4 2017-03-01 NA NA
12 4 2017-04-01 20 Out
There is a variant which does not update dat:
library(dplyr)
library(tidyr)
dat %>%
mutate(date = as.Date(date)) %>%
right_join(group_by(., id) %>%
expand(date = seq(min(date), max(date), "month")),
by = c("id", "date"))
I would still use complete (probably the right method to use here), but after it would subset rows that exceed row with "Out". You can do this with dplyr::between.
dat %>%
group_by(id) %>%
complete(date) %>%
# Filter rows that are between 1 and the one that has "Out"
filter(between(row_number(), 1, which(Out.Of.Study == "Out")))
id date value Out.Of.Study
<dbl> <fct> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA NA
6 2 2017-02-01 25 Out
7 3 2017-01-01 25 ""
8 3 2017-02-01 25 ""
9 3 2017-03-01 25 Out
10 4 2017-01-01 20 ""
11 4 2017-02-01 20 ""
12 4 2017-03-01 NA NA
13 4 2017-04-01 20 Out

Resources