Related
In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)
I'm trying to add missing lines for "day" and extrapolate the data for "value". In my data each subject ("id") has 2 periods (period 1 and period 2) and values for consecutive days.
An example of my data looks like this:
df <- data.frame(
id = c(1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 3,3,3,3, 3,3,3,3),
period = c(1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2),
day= c(1,2,4,5, 1,3,4,5, 2,3,4,5, 1,2,3,5, 2,3,4,5, 1,2,3,4),
value =c(10,12,15,16, 11,14,15,17, 13,14,15,16, 15,16,18,20, 16,17,19,29, 14,16,18,20))
For each id and period I am missing data for days 3,2,1,4,1,5, respectively. I want to expand the data to let's say 10 days and extrapolate the data on value column (e.g. with linear regression).
My final df should be something like that:
df2 <- data.frame(
id = c(1,1,1,1,1,1,1, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 2,2,2,2,2,2,2, 3,3,3,3,3,3,3, 3,3,3,3,3,3,3),
period = c(1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2),
day= c(1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7),
value =c(10,12,13,15,16,17,18, 11,12,14,15,17,18,19, 12,13,14,15,16,18,22, 15,16,18,19,20,22,23, 15,16,17,19,29,39,49, 14,16,18,20,22,24,26))
The most similar example I found doesn't extrapolate by two variables (ID and period in my case), it extrapolates only by year. I tried to adapt the code but no success :(
Another example extrapolates the data by multiple id but doesn't add rows for missing data.
I couldn't combine both codes with my limited experience in R. Any suggestions?
Thanks in advance...
We can use complete
library(dplyr)
library(tidyr)
library(forecast)
df %>%
group_by(id, period) %>%
complete(day =1:7)%>%
mutate(value = as.numeric(na.interp(value)))
#akrun's answer is good, as long as you don't mind using linear interpolation. However, if you do want to use a linear model, you could try this data.table approach.
library(data.table)
model <- lm(value ~ day + period + id,data=df)
dt <- as.data.table(df)[,.SD[,.(day = 1:7,value = value[match(1:7,day)])],by=.(id,period)]
dt[is.na(value), value := predict(model,.SD),]
dt
id period day value
1: 1 1 1 10.00000
2: 1 1 2 12.00000
3: 1 1 3 12.86714
4: 1 1 4 15.00000
5: 1 1 5 16.00000
6: 1 1 6 18.13725
7: 1 1 7 19.89396
8: 1 2 1 11.00000
9: 1 2 2 12.15545
10: 1 2 3 14.00000
11: 1 2 4 15.00000
12: 1 2 5 17.00000
13: 1 2 6 19.18227
14: 1 2 7 20.93898
15: 2 1 1 11.90102
16: 2 1 2 13.00000
17: 2 1 3 14.00000
18: 2 1 4 15.00000
19: 2 1 5 16.00000
20: 2 1 6 20.68455
21: 2 1 7 22.44125
22: 2 2 1 15.00000
23: 2 2 2 16.00000
24: 2 2 3 18.00000
25: 2 2 4 18.21616
26: 2 2 5 20.00000
27: 2 2 6 21.72957
28: 2 2 7 23.48627
29: 3 1 1 14.44831
30: 3 1 2 16.00000
31: 3 1 3 17.00000
32: 3 1 4 19.00000
33: 3 1 5 29.00000
34: 3 1 6 23.23184
35: 3 1 7 24.98855
36: 3 2 1 14.00000
37: 3 2 2 16.00000
38: 3 2 3 18.00000
39: 3 2 4 20.00000
40: 3 2 5 22.52016
41: 3 2 6 24.27686
42: 3 2 7 26.03357
id period day value
Is it possible to create a binary variable based on the first appearance of another (date) variable?
For my thesis I am trying to create a variable that captures the number of first-time forecasts issued and revised during the month divided by the number of forecasts at the month-end for a firm in a given year. For convenience I would like to separate the first-time forecasts issued and revised in different columns.
Example data
dt <- data.table(
analyst = rep((1:2),10),
id = rep((1:5),4),
year = rep(as.Date(c('2009-12-31','2009-12-31','2010-12-31','2010-12-31'),format='%Y-%m-%d'),5),
fdate = rep(as.Date(c('2009-07-31','2009-02-26','2010-01-31','2010-05-15','2009-06-30','2009-10-08','2010-07-31','2010-11-30','2009-01-31','2009-06-26','2010-05-03','2010-04-13','2009-10-30','2009-11-02','2010-03-28','2010-10-14','2009-02-17','2009-09-14','2010-08-02','2010-10-03'),format='%Y-%m-%d')))
To create the variable, I used the following steps:
First, identifying the issuance of the first-time forecasts for a given year (for firms by analysts) with the following code:
dt2 <- setkey(setDT(dt), id, year, analyst)[order(fdate),.SD[1L] ,by=list(id,year)]
However, this generates a table with only the first-time forecast by id, year and analyst. Secondly, I give the first-time forecasts the value 1 with:
dt3 <- print(dt2[, first:=1L])
Third, combine the two data.tables:
dt4 <- dt3[dt, on = c('id', 'year', 'analyst', 'fdate')]
Fourth, I replace the na for 0
dt4[is.na(dt4)] <- 0
Fifth, creating the revised binary variable:
dt4$rev <- ifelse(dt4$first == 0,"1", "0")
Last, I sum the number of first-time and revised forecasts for every month for a firm.
Is there a more elegant way of creating this variable so I can learn more of R/data.table? I have tried to incorporate the dcast function, based on the answers from:
R data.table - categorical values in one column to binary values in multiple columns
How to programmatically create binary columns based on a categorical variable in data.table?
Data table dcast column headings
However, it doesn't work out for me.
Current result, based on the previous mentioned steps:
id year analyst fdate first rev
1 2009-12-31 1 2009-07-31 1 0
1 2009-12-31 2 2009-10-08 0 1
1 2010-12-31 1 2010-05-03 1 0
1 2010-12-31 2 2010-10-14 0 1
2 2009-12-31 1 2009-02-17 1 0
2 2009-12-31 2 2009-02-26 0 1
2 2010-12-31 1 2010-07-31 0 1
2 2010-12-31 2 2010-04-13 1 0
3 2009-12-31 1 2009-10-30 0 1
3 2009-12-31 2 2009-09-14 1 0
3 2010-12-31 1 2010-01-31 1 0
3 2010-12-31 2 2010-11-30 0 1
4 2009-12-31 1 2009-01-31 1 0
4 2009-12-31 2 2009-11-02 0 1
4 2010-12-31 1 2010-08-02 0 1
4 2010-12-31 2 2010-05-15 1 0
5 2009-12-31 1 2009-06-30 0 1
5 2009-12-31 2 2009-06-26 1 0
5 2010-12-31 1 2010-03-28 1 0
5 2010-12-31 2 2010-10-03 0 1
We can replace the ifelse and also the base R methods. Create the 'first' as 0, then do a join with 'dt2' based on the columns in the post, then assign those matching rows to 1 for 'first', negate (!) the first and convert to integer with (+) or as.integer and assign it to rev
dt[, first := 0][dt2, first := 1, on = .(id, year, analyst, fdate)]
dt[, rev := +(!first)][]
# analyst id year fdate first rev
# 1: 1 1 2009-12-31 2009-07-31 1 0
# 2: 2 1 2009-12-31 2009-10-08 0 1
# 3: 1 1 2010-12-31 2010-05-03 1 0
# 4: 2 1 2010-12-31 2010-10-14 0 1
# 5: 1 2 2009-12-31 2009-02-17 1 0
# 6: 2 2 2009-12-31 2009-02-26 0 1
# 7: 1 2 2010-12-31 2010-07-31 0 1
# 8: 2 2 2010-12-31 2010-04-13 1 0
# 9: 1 3 2009-12-31 2009-10-30 0 1
#10: 2 3 2009-12-31 2009-09-14 1 0
#11: 1 3 2010-12-31 2010-01-31 1 0
#12: 2 3 2010-12-31 2010-11-30 0 1
#13: 1 4 2009-12-31 2009-01-31 1 0
#14: 2 4 2009-12-31 2009-11-02 0 1
#15: 1 4 2010-12-31 2010-08-02 0 1
#16: 2 4 2010-12-31 2010-05-15 1 0
#17: 1 5 2009-12-31 2009-06-30 0 1
#18: 2 5 2009-12-31 2009-06-26 1 0
#19: 1 5 2010-12-31 2010-03-28 1 0
#20: 2 5 2010-12-31 2010-10-03 0 1
We will start with the following DataTable:
id date
1: 1 2015-12-31
2: 1 2014-12-31
3: 1 2013-12-31
4: 1 2012-12-31
5: 1 2011-12-31
6: 2 2015-12-31
7: 2 2014-12-31
8: 2 2014-01-25
9: 2 2013-01-25
10: 2 2012-01-25
library(data.table)
DT <- data.table(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
as.IDate(c("2015-12-31", "2014-12-31", "2013-12-31", "2012-12-31",
"2011-12-31", "2015-12-31", "2014-12-31", "2014-01-25",
"2013-01-25", "2012-01-25")))
setnames(DT, c("id", "date"))
For every unique id, I want to create a ranking. The most current date for a specific id should have a rank of 0. After I should remove one year from that date to get the ranking -1. If the month is not the same as the date of rank 0, we should stop the ranking. For example, at the line 8, for the id=2, since the month is not december we should stop the ranking.
We would get the following result:
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 NA
9: 2 2013-01-25 NA
10: 2 2012-01-25 NA
I have the following code so far (given by #Frank and #akrun) :
DT <- DT[order(id, -date)]
DT <- DT[,rank_year := { z = month(date) + year(date)*12
as.integer( (z - z[1L])/12) # 12 months
}, by = id]
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 -1
9: 2 2013-01-25 -2
10: 2 2012-01-25 -3
Ok, I guess I would do it like
DT[, rank_year := replace(
year(date) - year(date)[1L],
month(date) != month(date[1L]),
NA_integer_
), by=id]
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 NA
9: 2 2013-01-25 NA
10: 2 2012-01-25 NA
See ?replace for details on how this works.
One way of extending the old answer is
DT[, r := {
z = month(date) + year(date)*12
res = (z - z[1L])/12
as.integer( replace(res, res %% 1 != 0, NA) )
}, by=id]
My question is somehow related to Fastest way to add rows for missing values in a data.frame? but a bit tougher I think. And I can't figure out how to adapt this solution to my problem.
Here is what my data.table looks like :
ida idb value date
1: A 2 26600 2004-12-31
2: A 3 19600 2005-03-31
3: B 3 18200 2005-06-30
4: B 4 1230 2005-09-30
5: C 2 8700 2005-12-31
The difference is that every 'ida' has his own dates and there is at least one row where 'ida' appears with each date but not necessarily for all 'idb'. I want to insert every missing ('ida','idb') couple missing with the corresponding date and 0 as a value.
Moreover, there is no periodicity for the dates.
How would you do this ?
Desired output :
ida idb value date
1: A 2 26600 2004-12-31
1: A 2 0 2005-03-31
2: A 3 19600 2005-03-31
2: A 3 0 2004-12-31
3: B 3 18200 2005-06-30
4: B 3 0 2005-09-30
5: B 4 1230 2005-09-30
4: B 4 0 2005-06-30
6: C 2 8700 2005-12-31
The order doesn't matter. Every date missing is filled with a 0 value.
You just do the same thing as in your linked question by each ida:
setkey(dt, idb, date)
dt[, .SD[CJ(unique(idb), unique(date))], by = ida][is.na(value), value := 0][]
# ida idb value date
#1: A 2 26600 2004-12-31
#2: A 2 0 2005-03-31
#3: A 3 0 2004-12-31
#4: A 3 19600 2005-03-31
#5: C 2 8700 2005-12-31
#6: B 3 18200 2005-06-30
#7: B 3 0 2005-09-30
#8: B 4 0 2005-06-30
#9: B 4 1230 2005-09-30