Related
In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)
I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)
I'm trying to add missing lines for "day" and extrapolate the data for "value". In my data each subject ("id") has 2 periods (period 1 and period 2) and values for consecutive days.
An example of my data looks like this:
df <- data.frame(
id = c(1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 3,3,3,3, 3,3,3,3),
period = c(1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2),
day= c(1,2,4,5, 1,3,4,5, 2,3,4,5, 1,2,3,5, 2,3,4,5, 1,2,3,4),
value =c(10,12,15,16, 11,14,15,17, 13,14,15,16, 15,16,18,20, 16,17,19,29, 14,16,18,20))
For each id and period I am missing data for days 3,2,1,4,1,5, respectively. I want to expand the data to let's say 10 days and extrapolate the data on value column (e.g. with linear regression).
My final df should be something like that:
df2 <- data.frame(
id = c(1,1,1,1,1,1,1, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 2,2,2,2,2,2,2, 3,3,3,3,3,3,3, 3,3,3,3,3,3,3),
period = c(1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2),
day= c(1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7),
value =c(10,12,13,15,16,17,18, 11,12,14,15,17,18,19, 12,13,14,15,16,18,22, 15,16,18,19,20,22,23, 15,16,17,19,29,39,49, 14,16,18,20,22,24,26))
The most similar example I found doesn't extrapolate by two variables (ID and period in my case), it extrapolates only by year. I tried to adapt the code but no success :(
Another example extrapolates the data by multiple id but doesn't add rows for missing data.
I couldn't combine both codes with my limited experience in R. Any suggestions?
Thanks in advance...
We can use complete
library(dplyr)
library(tidyr)
library(forecast)
df %>%
group_by(id, period) %>%
complete(day =1:7)%>%
mutate(value = as.numeric(na.interp(value)))
#akrun's answer is good, as long as you don't mind using linear interpolation. However, if you do want to use a linear model, you could try this data.table approach.
library(data.table)
model <- lm(value ~ day + period + id,data=df)
dt <- as.data.table(df)[,.SD[,.(day = 1:7,value = value[match(1:7,day)])],by=.(id,period)]
dt[is.na(value), value := predict(model,.SD),]
dt
id period day value
1: 1 1 1 10.00000
2: 1 1 2 12.00000
3: 1 1 3 12.86714
4: 1 1 4 15.00000
5: 1 1 5 16.00000
6: 1 1 6 18.13725
7: 1 1 7 19.89396
8: 1 2 1 11.00000
9: 1 2 2 12.15545
10: 1 2 3 14.00000
11: 1 2 4 15.00000
12: 1 2 5 17.00000
13: 1 2 6 19.18227
14: 1 2 7 20.93898
15: 2 1 1 11.90102
16: 2 1 2 13.00000
17: 2 1 3 14.00000
18: 2 1 4 15.00000
19: 2 1 5 16.00000
20: 2 1 6 20.68455
21: 2 1 7 22.44125
22: 2 2 1 15.00000
23: 2 2 2 16.00000
24: 2 2 3 18.00000
25: 2 2 4 18.21616
26: 2 2 5 20.00000
27: 2 2 6 21.72957
28: 2 2 7 23.48627
29: 3 1 1 14.44831
30: 3 1 2 16.00000
31: 3 1 3 17.00000
32: 3 1 4 19.00000
33: 3 1 5 29.00000
34: 3 1 6 23.23184
35: 3 1 7 24.98855
36: 3 2 1 14.00000
37: 3 2 2 16.00000
38: 3 2 3 18.00000
39: 3 2 4 20.00000
40: 3 2 5 22.52016
41: 3 2 6 24.27686
42: 3 2 7 26.03357
id period day value
By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10
I have a large panel dataset (10,000,000 x 53) with about 50 columns of scores. I have aggregated each score by group (there are about 15,000) and date.
Now I want to calculate a rolling sum of three values including the prior two dates' and the current date's scores, creating a new corresponding sum column.
The sums should be calculated for each score column by date and group.
For 1st and 2nd dates within a group, fewer than 3 values is allowed.
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David's answer from this post covered most of my questions on summing rolling windows by groups but I'm still missing a couple pieces.
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
My question has two parts:
What code should replace "Score" to calculate time window sums for each column in a range of columns?
Is the solution provided the most efficient version for fast calculation on large dataset?
A possible solution:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 4 9
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 4 5
6: B 2017-04-08 2017-04-06 3 1 7 6
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
When you want to take a time window into account as well, you could do (assuming LAGGED is the start of the time-window):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 3 7
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 2 2
6: B 2017-04-08 2017-04-06 3 1 3 1
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
Used data:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')