I am trying to find the max date across rows of a data.table using lapply. I have some rows where all values in the row are NA and in this case I want to return a specific date. I wrote a function to do this but I am not getting the results that I expected.
library(data.table)
my.max = function(x){
if(all(is.na(x))){
return(as.Date("9999-12-01")) #we can use this to identify which BPIDs have no end date
}else{
return(max(x, na.rm = T))
}
}
DT = data.table("Date1" = c(as.Date("2015-12-30"),NA, NA), "Date2" = c(as.Date("2013-02-04"), as.Date("2014-01-01"), NA))
DT[ , "Row" := 1:.N]
DT[ , "Max_Date" := lapply(.SD, my.max), by = .(Row), .SDcols = c("Date1", "Date2")]
This returns
> DT
Date1 Date2 Row Max_Date
1: 2015-12-30 2013-02-04 1 2015-12-30
2: <NA> 2014-01-01 2 9999-12-01
3: <NA> <NA> 3 9999-12-01
So, it does work if all values are NA, but if one of the values is NA it also returns 9999-12-01. I put print functions into my.max to find out what was happening and it looks like it passes in one value of x at a time. This explains why the all(is.na(x)) would be true, but I expected it to pass in a vector of both dates in the row. Otherwise, how would it know what values to take the max of?
How can I change my function so it returns 9999-12-01 only if both of the other dates are NA?
Here is one method that will work. It encapsulates multiple statements in {} to form a single code block:
DT[, "this" := {temp=pmax(Date1, Date2, na.rm=TRUE);
temp[is.na(temp)] = as.Date("9999-12-01"); temp}]
which returns
DT
Date1 Date2 this
1: 2015-12-30 2013-02-04 2015-12-30
2: <NA> 2014-01-01 2014-01-01
3: <NA> <NA> 9999-12-01
data
DT = data.table("Date1" = c(as.Date("2015-12-30"),NA, NA),
"Date2" = c(as.Date("2013-02-04"), as.Date("2014-01-01"), NA))
This way, you don't have to loop through each row which can be quite slow.
While I don't recommend by-row processing...
DT[ , "Row" := 1:.N]
DT[ , "Max_Date" := my.max(unlist(.SD)), by = .(Row), .SDcols = c("Date1", "Date2")]
will produce the same output for this example.
Try this out:
library(data.table)
my.max <- function(x){
if(all(is.na(x))){
return("9999-12-01")
}else{
return(max(x, na.rm = T))
}
}
DT <- data.table("Date1" = c(as.Date("2015-12-30"),NA, NA), "Date2" = c(as.Date("2013-02-04"), as.Date("2014-01-01"), NA))
print(DT)
DT[ , "Max_Date" ] <- apply(DT, 1, my.max)
print(DT)
> DT <- data.table("Date1" = c(as.Date("2015-12-30"),NA, NA), "Date2" = c(as.Date("2013-02-04"), as.Date("2014-01-01"), NA))
> print(DT)
Date1 Date2
1: 2015-12-30 2013-02-04
2: <NA> 2014-01-01
3: <NA> <NA>
> DT[ , "Max_Date" ] <- apply(DT, 1, my.max)
> print(DT)
Date1 Date2 Max_Date
1: 2015-12-30 2013-02-04 2015-12-30
2: <NA> 2014-01-01 2014-01-01
3: <NA> <NA> 9999-12-01
Related
I'm trying cross join a data.table by three variables (group, id, and date). The R code below accomplishes exactly what I want to do, i.e., each id within each group is expanded to include all of the dates_wanted. But is there a way to do the same thing more efficiently using the excellent data.table package?
library(data.table)
data <- data.table(
group = c(rep("A", 10), rep("B", 10)),
id = c(rep("frank", 5), rep("tony", 5), rep("arthur", 5), rep("edward", 5)),
date = seq(as.IDate("2020-01-01"), as.IDate("2020-01-20"), by = "day")
)
data
dates_wanted <- seq(as.IDate("2020-01-01"), as.IDate("2020-01-31"), by = "day")
names_A <- data[group == "A"][["id"]]
names_B <- data[group == "B"][["id"]]
names_A <- CJ(group = "A", id = names_A, date = dates_wanted, unique = TRUE)
names_B <- CJ(group = "B", id = names_B, date = dates_wanted, unique = TRUE)
alldates <- rbind(names_A, names_B)
alldates
data[alldates, on = .(group, id, date)]
You can also do this:
data[, .(date=dates_wanted), .(group,id)]
Output:
group id date
1: A frank 2020-01-01
2: A frank 2020-01-02
3: A frank 2020-01-03
4: A frank 2020-01-04
5: A frank 2020-01-05
---
120: B edward 2020-01-27
121: B edward 2020-01-28
122: B edward 2020-01-29
123: B edward 2020-01-30
124: B edward 2020-01-31
We can use do.call with CJ on the id and date transformed grouped by group:
out <- data[, do.call(CJ, c(.(id = id, date = dates_wanted),
unique = TRUE)), group]
... checking:
> dim(out)
[1] 124 3
> out0 <- data[alldates, on = .(group, id, date)]
> dim(out0)
[1] 124 3
> all.equal(out, out0)
[1] TRUE
Data:
set.seed(42)
df1 = data.frame(
Date = seq.Date(as.Date("2018-01-01"),as.Date("2018-01-30"),1),
value = sample(1:30),
Y = sample(c("yes", "no"), 30, replace = TRUE)
)
df2 = data.frame(
Date = seq.Date(as.Date("2018-01-01"),as.Date("2018-01-30"),7)
)
For sum if data falls within range this works (from my previous question):
library(data.table)
df1$start <- df1$Date
df1$end <- df1$Date
df2$start <- df2$Date
df2$end <- df2$Date + 6
setDT(df1, key = c("start", "end"))
setDT(df2, key = c("start", "end"))
d = foverlaps(df1, df2)[, list(mySum = sum(value)), by = Date ]
How can I do countif ?
because when I try
d = foverlaps(df1, df2)[, list(mySum = count(value)), by = Date ]
I get error
no applicable method for 'groups' applied to an object of class "c('double', 'numeric')"
We can use .N:
foverlaps(df1, df2)[, list(myCount = .N), by = Date ]
# Date myCount
# 1: 2018-01-01 7
# 2: 2018-01-08 7
# 3: 2018-01-15 7
# 4: 2018-01-22 7
# 5: 2018-01-29 2
d = foverlaps(df1, df2)[, .N, by = Date]
If you want to count the number of rows per Date, you can try .N
foverlaps(df1, df2)[, .(mysum = .N), by = Date ]
Date mysum
1: 2018-01-01 7
2: 2018-01-08 7
3: 2018-01-15 7
4: 2018-01-22 7
5: 2018-01-29 2
If you want the count of unique values per Date you can try uniqueN()
foverlaps(df1, df2)[, .(mysum = uniqueN(value)), by = Date ]
Date mysum
1: 2018-01-01 7
2: 2018-01-08 7
3: 2018-01-15 7
4: 2018-01-22 7
5: 2018-01-29 2
Both .N and uniqueN() are from {data.table}.
Instead of list(mySum = count(value)) try c(mySum = count(value)). The Code runs for me then.
d2 <- foverlaps(df1, df2)[, c(mySum = count(value)), by = Date ]
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 4 years ago.
library(data.table)
library(lubridate)
x1 <- c(20090101, "2009-01-02", "2009 01 03", "2009-1-4",
"2009-1, 5", "Created on 2009 1 6", "200901 !!! 07")
dt2 <- data.table(id = c(1,1,1,2,2,2,2), date1 = ymd(x1), charval = c("aa","vv","ss","a","b","c","d"))
id date1 charval
1: 1 2009-01-01 aa
2: 1 2009-01-02 vv
3: 1 2009-01-03 ss
4: 2 2009-01-04 a
5: 2 2009-01-05 b
6: 2 2009-01-06 c
7: 2 2009-01-07 d
I use next code for grouping by id:
dt3 <- dt2[, Map(function(x,y) ifelse(x != "paste", get(x)(y, na.rm = TRUE), paste(y, sep = ";")),
setNames(c("mean", "paste"), names(.SD)), .SD), by = id]
to get something like this:
id date1 charval
1: 1 2009-01-02 aa;vv;ss
2: 2 2009-01-05 a;b;c;d
but in real I see next result:
id date1 charval
1: 1 NA aa
2: 2 NA a
1) I dont understand why paste doesnt work
2) I dont understand why mean(date1) doesnt work
because for example next code works fine:
mean(dt2$date1)
[1] "2009-01-04"
It is not clear why we have to go through Map and get. After grouping by 'id', get the mean of 'date1' and paste the 'charval' together
dt2[, .(date1 = mean(date1), charval = toString(charval)), id]
# id date1 charval
#1: 1 2009-01-02 aa, vv, ss
#2: 2 2009-01-05 a, b, c, d
Note: toString is paste(..., collapse=', ')
dt2[, .(date1 = mean(date1), charval = paste(charval, collapse=";")), id]
# id date1 charval
#1: 1 2009-01-02 aa;vv;ss
#2: 2 2009-01-05 a;b;c;d
As the OP's question is about Map with using get to call the mean. This seems to be triggering the
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
return(NA_real_)
and returns the NA when it finds that 'date1' is of class Date although it is stored as numeric. One option is to specify the envir in get
Another problem is the use of ifelse. It is better to use if/else as there are only two elements
dt2[, Map(function(x, y) if(x != "paste") get(x, envir = parent.frame())(y, na.rm = TRUE)
else paste(y, collapse=':'), setNames(c("mean", "paste"), names(.SD)), .SD), by = id]
# id date1 charval
#1: 1 2009-01-02 aa:vv:ss
#2: 2 2009-01-05 a:b:c:d
get is kind of tricky and if specify the correct environment, it works as expected
get("mean")(dt2$date1)
#[1] "2009-01-04"
Or instead of if/else to the "paste" string, we can check on the column class and if it is character then do the paste or else return mean
dt2[, Map(function(x, y) if(is.character(y)) get(x)(y, collapse=":")
else get(x, envir = parent.frame())(y, na.rm = TRUE),
setNames(c("mean", "paste"), names(.SD)), .SD), by = id]
# id date1 charval
#1: 1 2009-01-02 aa:vv:ss
#2: 2 2009-01-05 a:b:c:d
Note that it is better to use the first approach without any hassles
I still have some problems understanding the data.table notation. Could anyone explain why the following is not working?
I'm trying to classify dates into groups using cut. The breaks used can be found in another data.table and depend on the by argument of the outer "data" data.table
data <- data.table(A = c(1, 1, 1, 2, 2, 2),
DATE = as.POSIXct(c("01-01-2012", "30-05-2015", "01-01-2020", "30-06-2012", "30-06-2013", "01-01-1999"), format = "%d-%m-%Y"))
breaks <- data.table(B = c(1, 1, 2, 2),
BREAKPOINT = as.POSIXct(c("01-01-2015", "01-01-2016", "30-06-2012", "30-06-2013"), format = "%d-%m-%Y"))
data[, bucket := cut(DATE, breaks[B == A, BREAKPOINT], ordered_result = T), by = A]
I can get the desired result doing
# expected
data[A == 1, bucket := cut(DATE, breaks[B == 1, BREAKPOINT], ordered_result = T)]
data[A == 2, bucket := cut(DATE, breaks[B == 2, BREAKPOINT], ordered_result = T)]
data
# A DATE bucket
# 1: 1 2012-01-01 NA
# 2: 1 2015-05-30 2015-01-01
# 3: 1 2020-01-01 NA
# 4: 2 2012-06-30 2012-06-30
# 5: 2 2013-06-30 NA
# 6: 2 1999-01-01 NA
Thanks,
Michael
The problem is that cut produces factors and those are not being handled correctly in the data.table by operation (this is a bug and should be reported - the factor levels should be handled the same way they are handled in rbind.data.table or rbindlist). An easy fix to your original expression is to convert to character:
data[, bucket := as.character(cut(DATE, breaks[B == A, BREAKPOINT], ordered_result = T))
, by = A]
# A DATE bucket
#1: 1 2012-01-01 NA
#2: 1 2015-05-30 2015-01-01
#3: 1 2020-01-01 NA
#4: 2 2012-06-30 2012-06-30
#5: 2 2013-06-30 NA
#6: 2 1999-01-01 NA
I have a set of data along these lines
d1 <- data.frame(
cat1 = sample(c('a', 'b', 'c'), 100, replace = TRUE),
date = rep(Sys.Date() - sample(1:100)),
val = rnorm(100, 50, 5)
)
require(data.table)
d2 <- data.table(d1)
I can get a daily sum without problem
d2[ , list(.N, sum(val)), by = c("cat1", "date")]
I want to get a sum over 2 days (and then 7 days)
This works:
d.list <- sort(unique(d2$date))
o.list <- list()
for(i in seq_along(d.list)){
o.list[[i]] <- d2[d2$date >= d.list[i] - 1 & d2$date <= d.list[i], list(.N, sum(val), max(date)), by = c("cat1")]
}
do.call(rbind, o.list)
But slows down on a bigger data set, and doesn't seem to be the best use of data.table.
Is there a more efficient way?
This is a bit faster:
First we join for exact matches and obtain the last index (in case of multiple matches)
setkey(d2, cat1, date)
tmp1 = d2[unique(d2, by=key(d2)), which=TRUE, mult="last", allow.cartesian=TRUE]
Then, we construct a copy of d2 and change date to date-1 by reference. Then, we perform a join with roll=-Inf - which is next observation carried backwards. In other words, if there's no exact match, it'll fill the next available value.
d3 = copy(d2)[, date := date-1]
setkey(d3, cat1, date)
tmp2 = d2[unique(d3, by=key(d2)), roll=-Inf, which=TRUE, allow.cartesian=TRUE]
From here, we put together the indices:
idx1 = tmp1-tmp2+1L
idx2 = data.table:::vecseq(tmp2, idx1, sum(idx1))
Subset d2 from idx2 and generate unique ids from idx1:
ans1 = d2[idx2][, grp := rep(seq_along(idx1), idx1)]
Finally aggregate by grp and get the desired result:
ans1 = ans1[, list(cat1=cat1[1L], date=date[.N],
N = .N, val=sum(val)), by=grp][, grp:=NULL]
> head(ans1, 10L)
# cat1 date N val
# 1: a 2014-01-20 1 47.69178
# 2: a 2014-01-25 1 52.01006
# 3: a 2014-02-01 1 46.82132
# 4: a 2014-02-06 1 44.62404
# 5: a 2014-02-11 1 49.63218
# 6: a 2014-02-14 1 48.80676
# 7: a 2014-02-22 1 49.27800
# 8: a 2014-02-23 2 96.17617
# 9: a 2014-02-26 1 49.20623
# 10: a 2014-02-28 1 46.72708
The results are identical as in your solution. This one took 0.02 seconds on my laptop, where as yours took 0.58 seconds.
For 7 days, just change:
d3 = copy(d2)[, date := date-1]
to
d3 = copy(d2)[, date := date-6]
It's very poorly explained in the OP what you want, but this seems to be it:
# generate the [date-1,date] sequences for each date
# adjust length.out to suit your needs
dates = d2[, list(date.seq = seq(date, by = -1, length.out = 2)), by = date]
setkey(dates, date.seq)
setkey(d2, date)
# merge and extract info needed
dates[d2][, list(.N, sum(val), date.seq[.N]), by = list(date, cat1)][, !"date"]
# cat1 N V2 V3
# 1: a 1 38.95774 2014-01-21
# 2: a 1 38.95774 2014-01-21
# 3: c 1 55.68445 2014-01-22
# 4: c 2 102.20806 2014-01-23
# 5: c 1 46.52361 2014-01-23
# ---
#164: c 1 50.17986 2014-04-27
#165: b 1 51.43489 2014-04-28
#166: b 2 100.91982 2014-04-29
#167: b 1 49.48493 2014-04-29
#168: c 1 54.93311 2014-04-30
Would it be possible to set up a binned date, and then do by on that?
d2$day7 <- as.integer(d2$date) %/% 7
d2[ , list(.N, sum(val)), by = c("cat1", "day7")]
That would give a binned value - if you want a sliding 7 day window, I'd need to think again. Also, for a binned approach, you might need to subtract an offset before doing the %/% if you want to chose the day of the week the groups start at.