How to calculate moving average by specified grouping and deal with NAs - r

I have a data.table which needs a moving average to be calculated on the previous n days of data (let's use n=2 for simplicity, not incl. current day) for a specified grouping (ID1, ID2). The moving average should attempt to include the last 2 days of values for each ID1-ID2 pair. I would like to calculate moving average to handle NAs two separate ways:
1. Only calculate when there are 2 non-NA observations, otherwise avg should be NA (e.g. first 2 days within an ID1-ID2 will always have NAs).
2. Calculate the moving average based on any non-NA observations within the last 2 days (na.rm=TRUE ?).
I've tried to use the zoo package and various functions within it. I've settled on the following (used shift() to exclude the week considered in the avg, put dates in reverse order to highlight dates are not always ordered initially):
library(zoo)
library(data.table)
DATE = rev(rep(seq(as.Date("2018-01-01"),as.Date("2018-01-04"),"day"),4))
VALUE =seq(1,16,1)
VALUE[16] <- NA
ID1 = rep(c("A","B"),each=8)
ID2 = rep(1:2,2,each=4)
testdata = data.frame (DATE, ID1, ID2, VALUE)
setDT(testdata)[order(DATE), VALUE_AVG := shift(rollapplyr(VALUE, 2, mean,
na.rm=TRUE,fill = NA)), by = c("ID1", "ID2")]
I seem to have trouble grouping by multiple columns. Groupings where VALUE begins/ends with NA values also seem to cause issues. I'm open to any solutions which make sense within a data.table framework, especially frollmean (need to update my versions of R + data.table). I don't know if I need to order the dates differently in conjunction with a specified alignment (e.g. "right").
I would hope my output would look something like the following except ordered by oldest date first per ID1-ID2 grouping:
DATE ID1 ID2 VALUE VALUE_AVG
1: 2018-01-04 A 1 1 2.5
2: 2018-01-03 A 1 2 3.5
3: 2018-01-02 A 1 3 NA
4: 2018-01-01 A 1 4 NA
5: 2018-01-04 A 2 5 6.5
6: 2018-01-03 A 2 6 7.5
7: 2018-01-02 A 2 7 NA
8: 2018-01-01 A 2 8 NA
9: 2018-01-04 B 1 9 10.5
10: 2018-01-03 B 1 10 11.5
11: 2018-01-02 B 1 11 NA
12: 2018-01-01 B 1 12 NA
13: 2018-01-04 B 2 13 14.5
14: 2018-01-03 B 2 14 15.0
15: 2018-01-02 B 2 15 NA
16: 2018-01-01 B 2 NA NA
My code seems to roughly achieve the desired results for the sample data. Nevertheless, when trying to run the same code on large dataset for a 4-week average where ID1 and ID2 are both integers, I get the following error:
Error in seq.default(start.at, NROW(data), by = by) :
wrong sign in 'by' argument
My results seem right for most ID1-ID2 combinations but there are specific cases of ID1 where VALUE has leading and trailing NAs. I'm guessing this is causing the issue, although it hasn't for the example above.

Using shift complicates this unnecessarily. rollapply already can handle that itself. In rollapplyr specify:
a width of list(-seq(2)) to specify that it should act on offsets -1 and -2.
partial = TRUE to indicate that if there are fewer than 2 prior rows it will use whatever is there.
fill = NA to fill empty cells with NA
na.rm = TRUE to remove any NAs and only perform the mean on the remaining cells. If the prior cells are all NA then mean gives NaN.
To only consider situations where there are 2 prior non-NAs giving NA otherwise remove the partial = TRUE and na.rm = TRUE arguments.
First case
Take mean of non-NAs in prior 2 rows or fewer rows if fewer prior rows.
testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG :=
rollapplyr(VALUE, list(-seq(2)), mean, fill = NA, partial = TRUE, na.rm = TRUE),
by = c("ID1", "ID2")]
testdata
giving:
DATE ID1 ID2 VALUE VALUE_AVG
1: 2018-01-01 A 1 4 NA
2: 2018-01-02 A 1 3 4.0
3: 2018-01-03 A 1 2 3.5
4: 2018-01-04 A 1 1 2.5
5: 2018-01-01 A 2 8 NA
6: 2018-01-02 A 2 7 8.0
7: 2018-01-03 A 2 6 7.5
8: 2018-01-04 A 2 5 6.5
9: 2018-01-01 B 1 12 NA
10: 2018-01-02 B 1 11 12.0
11: 2018-01-03 B 1 10 11.5
12: 2018-01-04 B 1 9 10.5
13: 2018-01-01 B 2 NA NA
14: 2018-01-02 B 2 15 NaN
15: 2018-01-03 B 2 14 15.0
16: 2018-01-04 B 2 13 14.5
Second case
NA if any of the prior 2 rows are NA or if there are fewer than 2 prior rows.
testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG :=
rollapplyr(VALUE, list(-seq(2)), mean, fill = NA),
by = c("ID1", "ID2")]
testdata
giving:
DATE ID1 ID2 VALUE VALUE_AVG
1: 2018-01-01 A 1 4 NA
2: 2018-01-02 A 1 3 NA
3: 2018-01-03 A 1 2 3.5
4: 2018-01-04 A 1 1 2.5
5: 2018-01-01 A 2 8 NA
6: 2018-01-02 A 2 7 NA
7: 2018-01-03 A 2 6 7.5
8: 2018-01-04 A 2 5 6.5
9: 2018-01-01 B 1 12 NA
10: 2018-01-02 B 1 11 NA
11: 2018-01-03 B 1 10 11.5
12: 2018-01-04 B 1 9 10.5
13: 2018-01-01 B 2 NA NA
14: 2018-01-02 B 2 15 NA
15: 2018-01-03 B 2 14 NA
16: 2018-01-04 B 2 13 14.5

Maybe something like:
setorder(setDT(testdata), ID1, ID2, DATE)
testdata[order(DATE), VALUE_AVG := shift(
rollapplyr(VALUE, 2L, function(x) if(sum(!is.na(x)) > 0L) mean(x, na.rm=TRUE), fill = NA_real_)
), by = c("ID1", "ID2")]

Related

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)

creating a unique variable based on row differences of another variable considering groups

By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10

In R, use mutate() to create a new column based on conditions by group

For each person, there are two types of visits and for each visits, there are date records. The dataset looks like below.
p <-c(1,1,1,2,2,2,2,3,3,3,4)
type <- c(15,20,20,15,20,15,20,20,15,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03","2014-04-05"))
d <- data.frame(p,type,date)
So now the dataset looks like this.
> d
p type date
1 1 15 2014-02-03
2 1 20 2014-02-04
3 1 20 2014-02-06
4 2 15 2014-01-28
5 2 20 2014-02-03
6 2 15 2014-03-03
7 2 20 2014-03-13
8 3 20 2014-04-03
9 3 15 2014-04-09
10 3 15 2014-12-03
Now, I'd like to create three new columns.
indicating whether a type 20 visit happens in 7 days after the type 15 visit, if yes then the indicator is 1, otherwise 0.(for example, for p2, in the line 4, this value should be 1, and in the line 6, this value should be 0)
What is the first date of type 20 visit happened in 7 days after the type 15 visit. If there is no type 20 visit in 7 days after the type 15, then keep it blank. (for example, for p1, the value should be 2014-02-04 instead of 2014-02-06)
How many days is between the type 15 visit and type 20 visit happened in 7 days from it. If there is no type 20 visit in 7 days after the type 15, then keep it blank.(for example, the value in line 1 should be 1)
I'm a super newbie in R, and basically have no idea of what to do. I tried a for loop within group, but it never works.
group_by(p)%>%
for(i in i:length(date)){
*if(type[i]== 15 && date[i]+7 >= date[i+1:length(date)]){
indicator = 1
first_date =
days =* #Have no idea how to check in this part
} else {
indicator = 0
first_date = NA
days = NA
}
The expected output is as below.
p type date ind first_date days
1 1 15 2014-02-03 1 2014-02-04 1 # = 2014-02-04 - 2014-02-03
2 1 20 2014-02-04 NA <NA> NA
3 1 20 2014-02-06 NA <NA> NA
4 2 15 2014-01-28 1 2014-02-03 6 # = 2014-02-03 - 2014-01-28
5 2 20 2014-02-03 NA <NA> NA
6 2 15 2014-03-03 0 <NA> NA # since (2014-03-13 - 2014-03-03) > 7
7 2 20 2014-03-13 NA <NA> NA
8 3 20 2014-04-03 NA <NA> NA #I don't care about the value for type 20 lines
9 3 15 2014-04-09 0 <NA> NA
10 3 15 2014-12-03 0 <NA> NA
So I come up with a new idea. What if we group records by p and type == 15.Then we can use subtraction within groups as days, and the rest will be easy.
I found one way in doing this:
d[,group:= cumsum(type ==15)]
However, this will count group when encountering a new type 15 record. How to add p as another grouping condition?
I took a stab at this. There's one caveat though: My answer assumes that after a type 15 visit occurs, the next visit within 7 days will be a type_20 visit. If that's not the case, i.e. there's another type 15 visit within 7 days, the first type 15 visit won't be considered, and only the second type 15 visit matters:
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(rownum = 1:n()) %>%
spread(type, date, sep="_") %>%
group_by(p) %>%
mutate(ind = ifelse(lead(type_20) - type_15 <= 7, 1, 0)) %>%
mutate(ind = ifelse(is.na(ind), 0, ind)) %>%
mutate(ind = ifelse(is.na(type_15), NA, ind)) %>%
mutate(first_date = ifelse(ind == 1, lead(type_20), NA)) %>%
mutate(first_date = as.Date(first_date, origin = lubridate::origin)) %>%
mutate(days = first_date - type_15) %>%
gather("type", "date", type_15, type_20) %>%
filter(!is.na(date)) %>%
arrange(p, date) %>%
select(p, type, date, ind, first_date, days)
# p type date ind first_date days
# <dbl> <chr> <date> <dbl> <date> <time>
#1 1 type_15 2014-02-03 1 2014-02-04 1 days
#2 1 type_20 2014-02-04 NA <NA> NA days
#3 1 type_20 2014-02-06 NA <NA> NA days
#4 2 type_15 2014-01-28 1 2014-02-03 6 days
#5 2 type_20 2014-02-03 NA <NA> NA days
#6 2 type_15 2014-03-03 0 <NA> NA days
#7 2 type_20 2014-03-13 NA <NA> NA days
#8 3 type_20 2014-04-03 NA <NA> NA days
#9 3 type_15 2014-04-09 0 <NA> NA days
#10 3 type_15 2014-12-03 0 <NA> NA days
Let me try to explain what I'm doing:
First the type and date columns are spread so that the type and date appear in separate columns (this makes it easier to compare dates of the two different type). Next, a couple of mutates. The first three apply the conditions outlined in the questions, as follows: if lead(type_20) - type_15 <= 7) that means there was a type 20 visit within 7 days of a type 15 visit, so we mark that as 1, else we mark as 0. After this, if ind is NA, we assume no type 20 visit was found so we also mark it as 0. In the third mutate we mark the type 15 NA lines as NA.
The next three mutate lines add the columns outlined in 2 and 3 in the question.
Finally, the columns are gathered back up to their previous format, redundant rows are filtered out, the dataframe is arranged by p and date, and the needed columns are selected.
I hope this is clear enough. It might be helpful to run the code line by line, stopping to view the transformed data frame after each line to see how the transformations act on the dataframe.
If you're willing to use some functions from the purrr package and to use some custom functions, here is another option...
Packages you'll need
library(dplyr)
library(purrr)
Set up data (as per question)
p <-c(1,1,1,2,2,2,2,3,3,3)
type <- c(15,20,20,15,20,15,20,20,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03"))
d <- data.frame(cbind(p,type,date))
d$date = as.Date(date)
Create custom functions that will work with the purrr map_* functions to iterate through your data frame and create ind and first_date.
# Function to manage ind
ind_manager <- function(type, date, dates_20) {
if (type == 20)
return (NA_integer_)
checks <- map_lgl(dates_20, between, date, date + 7)
return (as.integer(any(checks)))
}
# Function to manage first_date
first_date_manager <- function(ind, date, dates_20) {
if (is.na(ind) || ind != 1)
return (NA_character_)
dates_20 <- dates_20[order(dates_20)]
as.character(dates_20[which.max(date < dates_20)])
}
Save a vector of dates where type == 20 to be used as comparisons
dates_20 <- d$date[d$type == 20]
The final mutate() call
# mutate() call to create variables
d %>%
mutate(
ind = map2_int(type, date, ind_manager, dates_20),
first_date = as.Date(map2_chr(ind, date, first_date_manager, dates_20)),
days = as.integer(first_date - date)
)
#> p type date ind first_date days
#> 1 1 15 2014-02-03 1 2014-02-04 1
#> 2 1 20 2014-02-04 NA <NA> NA
#> 3 1 20 2014-02-06 NA <NA> NA
#> 4 2 15 2014-01-28 1 2014-02-03 6
#> 5 2 20 2014-02-03 NA <NA> NA
#> 6 2 15 2014-03-03 0 <NA> NA
#> 7 2 20 2014-03-13 NA <NA> NA
#> 8 3 20 2014-04-03 NA <NA> NA
#> 9 3 15 2014-04-09 0 <NA> NA
#> 10 3 15 2014-12-03 0 <NA> NA
Here is a base R way. Generally, I prefer to create a function that does your task which can then be repeated on other pieces and debugged on test cases where it doesn't seem to work.
The first step is to define the pieces:
d <- structure(list(p = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
type = c(15, 20, 20, 15, 20, 15, 20, 20, 15, 15),
date = structure(c(16104, 16105, 16107, 16098, 16104, 16132, 16142, 16163, 16169, 16407), class = "Date")),
.Names = c("p", "type", "date"),
row.names = c(NA, -10L), class = "data.frame")
id <- with(d, {
id <- ave(type, p, FUN = function(x) cumsum(x == 15))
factor(paste0(p, id), unique(paste0(p, id)))
})
sp <- split(d, id)
So, sp creates a list of data frames to which we will apply a function. Each piece is a single unique p with at most one type == 15 (plus however many type == 20s follow.
The first two pieces are
sp[1:2]
# $`11`
# p type date
# 1 1 15 2014-02-03
# 2 1 20 2014-02-04
# 3 1 20 2014-02-06
#
# $`21`
# p type date
# 4 2 15 2014-01-28
# 5 2 20 2014-02-03
And we can apply the function below on each one
first_date(sp[[1]])
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
first_date(sp[[2]])
# p type date ind first_date days
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
Or all at once with a loop
(sp1 <- lapply(sp, first_date))
`rownames<-`(do.call('rbind', sp1), NULL)
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA
# 7 2 20 2014-03-13 NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA
You can take advantage of the arguments, like window, or any others you add without changing much of the function, for example, to change the window
(sp2 <- lapply(sp1, first_date, window = 14))
`rownames<-`(do.call('rbind', sp2), NULL)
# p type date ind first_date days ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA 1 2014-03-13 10
# 7 2 20 2014-03-13 NA <NA> NA NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA 0 <NA> NA
first_date <- function(data, window = 7) {
nr <- nrow(data)
## check at least one type 15 and > 1 row
ty15 <- data$type == 15
dt15 <- data$date[ty15]
if (!any(ty15) | nr == 1L)
return(cbind(data, ind = ifelse(any(ty15), 0, NA),
first_date = NA, days = NA))
## first date vector
dts <- rep(min(data$date[!ty15]), nr)
dts[!ty15] <- NA
## days from the type 15 date
days <- as.numeric(data$date[!ty15] - min(dt15))
days <- c(days, rep(NA, nr - length(days)))
## convert to NA if criteria not met
to_na <- days > window | is.na(dts)
days[to_na] <- dts[to_na] <- NA
## ind vector -- 1 or 0 if type 15, NA otherwise
ind <- rep(NA, nr)
ind[ty15] <- as.integer(!is.na(dts[ty15]))
## combine
cbind(data, ind = ind, first_date = dts, days = days)
}

How to recreate the table by key?

I thought it could be a very easy question, but I am really a new beginner for R.
I have a data.table with key and lots of rows, two of which could be set as key. I want to recreate the table by Key.
For example, the simple data. In this case, the key is ID and Act, and here we can get a total of 4 groups.
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
After change, each column should present a specific group which is defined by Key (ID and Act).
Below is the result:
ValueDate ID1_EUR D1_MAD D2_EUR D3_EUR
2015-01-01 21 12 5 4
2015-02-01 22 11 7 2
2015-03-01 NA NA NA 6
Thanks a lot !
What you are trying to do is not recreating the data.table, but reshaping it from a long format to a wide format. You can use dcast for this:
dcast(dd, ValueDate ~ ID + Act, value.var = "Volume")
which gives:
ValueDate 1_EUR 1_MAD 2_EUR 3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6
If you want the numbers in the resulting columns to be preceded with ID, then you can use:
dcast(dd, ValueDate ~ paste0("ID",ID) + Act, value.var = "Volume")
which gives:
ValueDate ID1_EUR ID1_MAD ID2_EUR ID3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6

Calculate cumulative product based on date by category

I want to add a new column to my data.table containing the cumulative product of Data1 based on the Date. The cumulative product should be calculated for each category (Cat) and should start with the latest available Date.
Sample data:
DF = data.frame(Cat=rep(c("A","B"),each=4), Date=rep(c("01-08-2013","01-07-2013","01-04-2013","01-03-2013"),2), Data1=c(1:8))
DF$Date = as.Date(DF$Date , "%m-%d-%Y")
DT = data.table(DF)
DT[ , Data1_cum:=NA_real_]
DT
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 NA
2: A 2013-01-07 2 NA
3: A 2013-01-04 3 NA
4: A 2013-01-03 4 NA
5: B 2013-01-08 5 NA
6: B 2013-01-07 6 NA
7: B 2013-01-04 7 NA
8: B 2013-01-03 8 NA
The result should look like this:
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-07 2 2
3: A 2013-01-04 3 6
4: A 2013-01-03 4 24
5: B 2013-01-08 5 5
6: B 2013-01-07 6 30
7: B 2013-01-04 7 210
8: B 2013-01-03 8 1680
I figured out that I could do something similar using cumprod(), but I do not know how to handle the categories. NAs in Data1 should be ignored / treated as 1.
The real dataset has about 8 million rows and 1000 categories.
If the only looksissue is the ordering...
DT[order(Date, decreasing=TRUE), Data1_cum := cumprod(Data1), by=Cat]
DT
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-07 2 2
3: A 2013-01-04 3 6
4: A 2013-01-03 4 24
5: B 2013-01-08 5 5
6: B 2013-01-07 6 30
7: B 2013-01-04 7 210
8: B 2013-01-03 8 1680
However, if you have NA's to deal with, then there is a few extra steps:
Note: If you shuffle the order of the rows, your results can vary. Careful with how you implement the order(.) command
## Let's add some NA values
DT <- rbind(DT, DT)
DT[c(2, 6, 11, 15), Data1 := NA]
# shuffle the rows, to make sure this is right
set.seed(1)
DT <- DT[sample(nrow(DT))]
Assigning the cumulative product:
Leaving NA's
## If you want to leave the NA's as NA's in the cum prod, use:
DT[ , Data1_cum := NA_real_ ]
DT[ intersect(order(Date, decreasing=TRUE), which(!is.na(Data1)))
, Data1_cum := cumprod(Data1)
, by=Cat]
# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-08 1 1
3: A 2013-01-07 2 2
4: A 2013-01-07 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
5: A 2013-01-04 3 6
6: A 2013-01-04 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
7: A 2013-01-03 4 24
8: A 2013-01-03 4 96
9: B 2013-01-08 5 5
10: B 2013-01-08 5 25
11: B 2013-01-07 6 150
12: B 2013-01-07 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
13: B 2013-01-04 7 1050
14: B 2013-01-04 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
15: B 2013-01-03 8 8400
16: B 2013-01-03 8 67200
Replacing NA's with value of previous Row
## If instead you want to treat the NA's as 1, use:
DT[order(Date, decreasing=TRUE), Data1_cum := {Data1[is.na(Data1)] <- 1; cumprod(Data1 [order(Date, decreasing=TRUE)] )}, by=Cat]
# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-08 1 1
3: A 2013-01-07 2 2
4: A 2013-01-07 NA 2 <~~~~~~~ Rows with NA took on values of the previous Row
5: A 2013-01-04 3 6
6: A 2013-01-04 NA 6 <~~~~~~~ Rows with NA took on values of the previous Row
7: A 2013-01-03 4 24
8: A 2013-01-03 4 96
9: B 2013-01-08 5 5
10: B 2013-01-08 5 25
11: B 2013-01-07 6 150
12: B 2013-01-07 NA 150 <~~~~~~~ Rows with NA took on values of the previous Row
13: B 2013-01-04 7 1050
14: B 2013-01-04 NA 1050 <~~~~~~~ Rows with NA took on values of the previous Row
15: B 2013-01-03 8 8400
16: B 2013-01-03 8 67200
Alternatively, If you already have the cumulative product and simply want to remove the NA's you can do so as follows:
# fix the NA's with the previous value
DT[order(Date, decreasing=TRUE),
Data1_cum := {tmp <- c(0, head(Data1_cum, -1));
Data1_cum[is.na(Data1_cum)] <- tmp[is.na(Data1_cum)];
Data1_cum }
, by=Cat ]

Resources