Warnings in `.SD` when selecting columns named in a variable - r

Assuming I have a data.table as below
DT <- data.table(x = rep(c("b", "a", "c"), each = 3), v = c(1, 1, 1, 2, 2, 1, 1, 2, 2), y = c(1, 3, 6), a = 1:9, b = 9:1)
> DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
I have a variable sl <- c("a","b") that selects columns to compute rowSums. If I try the code below
DT[, ab := rowSums(.SD[, ..sl])]
I am still able to get the desired output but given a warning message telling
DT[, ab := rowSums(.SD[, ..sl])]
Warning message:
In [.data.table(.SD, , ..sl) :
Both 'sl' and '..sl' exist in calling scope. Please remove the '..sl' variable in calling scope for clarity.
However, no warnings occur when running
DT[, ab := rowSums(.SD[, sl, with = FALSE])]
I am wondering how to fix the warning issue when using .SD[, ..sl]. Thanks in advance!

It may be that the syntax to use is either specify the .SDcols and call the .SD or directly call the ..cols from the original object. According to ?data.table
x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols]
if we check the source code of data.table, line 248 seems to be the one triggering the warning
as
DT[, exists(..sl, where = DT)]
#[1] TRUE
and
DT[, .SD[, exists(..sl)]]
#[1] TRUE
DT[, .SD[, exists(..sl, where = .SD)]]
#[1] TRUE

Related

Shift with dynamic n (number of position lead / lag by)

I have the below df:
df <- data.table(user = c('a', 'a', 'a', 'b', 'b')
, spend = 1:5
, shift_by = c(1,1,2,1,1)
); df
user spend shift_by
1: a 1 1
2: a 2 1
3: a 3 2
4: b 4 1
5: b 5 1
I am looking to create a lead lag column only this time the n parameter in data.table's shift function is dynamic and takes df$shiftby as input. My expected result is:
df[, spend_shifted := c(NA, 1, 1, NA, 4)]; df
user spend shift_by spend_shifted
1: a 1 1 NA
2: a 2 1 1
3: a 3 2 1
4: b 4 1 NA
5: b 5 1 4
However, with the below attempt it gives:
df[, spend_shifted := shift(x=spend, n=shift_by, type="lag"), user]; df
user spend shift_by spend_shifted
1: a 1 1 NA
2: a 2 1 NA
3: a 3 2 NA
4: b 4 1 NA
5: b 5 1 NA
This is the closest example I could find. However, I need a group by and am after a data.table solution because of speed. Truly look forward to finding any ideas.
I believe this will work. You can drop the newindex-column afterward.
df[, newindex := rowid(user) - shift_by]
df[newindex < 0, newindex := 0]
df[newindex > 0, spend_shifted := df[, spend[newindex], by = .(user)]$V1]
# user spend shift_by newindex spend_shifted
# 1: a 1 1 0 NA
# 2: a 2 1 1 1
# 3: a 3 2 1 1
# 4: b 4 1 0 NA
# 5: b 5 1 1 4
Here's another approach, using a data.table join. I use two helper-columns to join on:
df[, row := .I, by = .(user)]
df[, match_row := row - shift_by]
df[df, on = .(user, match_row = row), x := i.spend]
df[, c('row', 'match_row') := NULL]
# user spend shift_by spend_shifted x
# 1: a 1 1 NA NA
# 2: a 2 1 1 1
# 3: a 3 2 1 1
# 4: b 4 1 NA NA
# 5: b 5 1 4 4
Using matrix subsetting of data.frames:
df[,
spend_shifted :=
data.frame(shift(spend, n = unique(sort(shift_by))))[cbind(1:.N, shift_by)],
by = user]
Another solution (in addition to Wimpel's) without shift:
df[, {rows <- 1:nrow(.SD) - shift_by; .SD[replace(rows, rows <= 0, NA), spend]},
by = user]
Maybe this could help
> df[, spend_shifted := spend[replace(seq(.N) - shift_by, seq(.N) <= shift_by, NA)], user][]
user spend shift_by spend_shifted
1: a 1 1 NA
2: a 2 1 1
3: a 3 2 1
4: b 4 1 NA
5: b 5 1 4
I have carried out a benchmark test as scalability is very important for me.
df is same as original only repeating itself 10,000,000. Thus, 50,000,000 rows.
x <- 1e7
df <- data.table(user = rep(c('a', 'a', 'a', 'b', 'b'), x)
, spend = rep(1:5, x)
, shift_by = rep(c(1,1,2,1,1), x)
); df
user spend shift_by
1: a 1 1
2: a 2 1
3: a 3 2
4: b 4 1
5: b 5 1
benchmark:
a <-
microbenchmark(wimpel = {df[, newindex := rowid(user) - shift_by]
df[newindex < 0, newindex := 0]
df[newindex > 0, spend_shifted := df[, spend[newindex], by = .(user)]$V1]
}
, r2evans = {df[, spend_shifted := spend[{o <- seq_len(.N) - shift_by; o[o<1] <- NA; o; }], by = user]}
, sindri_1 = {df[, spend_shifted := data.frame(shift(spend, n = unique(sort(shift_by))))[cbind(1:.N, shift_by)], by = user]}
, sindri_2 = {df[, {rows <- 1:nrow(.SD) - shift_by; .SD[replace(rows, rows == 0, NA), spend]}, by = user]}
, talat = {df[, row := .I, by = .(user)]
df[, match_row := row - shift_by]
df[df, on = .(user, match_row = row), x := i.spend]
df[, c('row', 'match_row') := NULL]
}
, thomas = {df[, spend_shifted := spend[replace(seq(.N) - shift_by, seq(.N) <= shift_by, NA)], user]}
, times = 20
)
autoplot(a)
#ThomasIsCoding and #r2evans' methods are almost identical.
a[, .(mean=mean(time)), expr][order(mean)]]
expr mean
1: thomas 1974759530
2: r2evans 2121604845
3: sindri_2 2530492745
4: wimpel 4337907900
5: sindri_1 4585692780
6: talat 7252938170
I am still in the process of parsing the logic of all methods provided. I cannot thank you all enough for your methods contributed (of which there are many). I shall be voting for an answer in due course.

Column type set by first element being evaluated in r/data.table

I have a function that returns NA under certain conditions and an integer otherwise (an integer vector in fact, but it doesn't matter now).
When I apply this function to groups of elements in a data.table and the first group returns NA, then the whole column is erroneously set to logical thus screwing up the following elements. How can I prevent this behaviour?
Example:
library(data.table)
myfun <- function(x) {
if(x == 0) {
return(NA)
} else {
return(x*2)
}
}
DT <- data.table(x= c(0, 1, 2, 3), y= LETTERS[1:4])
DT
x y
1: 0 A
2: 1 B
3: 2 C
4: 3 D
The following should assign to column x2 the values c(NA, 2, 4, 6). Instead, I get c(NA, TRUE, TRUE, TRUE) with warnings:
DT[, x2 := myfun(x), by= y]
Warning messages:
1: In `[.data.table`(DT, , `:=`(x2, myfun(x)), by = y) :
Group 2 column 'x2': 2.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical'
2: In `[.data.table`(DT, , `:=`(x2, myfun(x)), by = y) :
Group 3 column 'x2': 4.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical'
3: In `[.data.table`(DT, , `:=`(x2, myfun(x)), by = y) :
Group 4 column 'x2': 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical'
DT
x y x2
1: 0 A NA
2: 1 B TRUE
3: 2 C TRUE
4: 3 D TRUE
Changing the order of the rows gives the expected result:
DT <- data.table(x= c(1, 2, 3, 0), y= LETTERS[1:4])
DT[, x2 := myfun(x), by= y]
DT
x y x2
1: 1 A 2
2: 2 B 4
3: 3 C 6
4: 0 D NA
I can preset the value of column x2:
DT <- data.table(x= c(0, 1, 2, 3), y= LETTERS[1:4])
DT[, x2 := integer()]
DT[, x2 := myfun(x), by= y]
DT
x y x2
1: 0 A NA
2: 1 B 2
3: 2 C 4
4: 3 D 6
but I wonder if there are better options that don't require me to set the column type beforehand.
This is with data.table v1.14.0, R 3.6.3
Do not let your function return NA, but NA_integer_, or NA_real_..
problem solved ;-)
myfun <- function(x) {
if(x == 0) {
return(NA_integer_) #<-- !!
} else {
return(x*2)
}
}

replace missing values using other rows only when other columns are the same in R

I guess that other people have already looked for it but couldn't find what I'm looking for.
I want to replace NA values with the value of the row above, only when all other values are the same. Bonus point for data.table solution.
Right now, I've managed to do it only with a (very inefficient) loop.
In addition, my current code does not replace NA in case that there are two NA's in the same row.
I have a strong feeling that I'm overthinking this problem. Any ideas of making this stuff easier?
ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(NA, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
desired_ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(2, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
col_names <- paste0("attr", 1:4)
r<-1
for (r in 1:nrow(ex)) {
print(r)
to_check <- col_names[colSums(is.na(ex[r, .SD, .SDcols = col_names])) >0]
if (length(to_check) == 0) {
print("no NA- next")
next
}
for (col_check in to_check) {
.ex <- copy(ex)[seq(from = r, to = r + 1), ]
.ex[[col_check]] <- NULL
if (nrow(unique(.ex)) == 1) {
ex[[col_check]][r] <- ex[[col_check]][r + 1]
}
}
}
all.equal(ex, desired_ex)
Here is a solution which will work for an arbitrary number of rows and columns within each id not just pairs of rows:
library(data.table)
ex[,
if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) {
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N))
} else {
.SD
}, by = id]
or, more compact,
ex[, if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id attr1 attr2 attr3 attr4
1: 1 NA 2 2 1
2: 1 NA 2 2 1
3: 2 3 NA 2 1
4: 2 3 3 1 3
Explanation
For each id it is checked if the rows fulfill the condition. If not .SD is returned unchanged. If the condition is fulfilled a new .SD is created by picking the first non-NA value in each column (or NA in case of all NA) using fcoalesce() and replicating this value as many times as there are rows in .SD.
The check for the condition consists of 2 parts. First, it is checked for each column in .SD if all values are identical thereby ignoring any NA. Finally, it is checked if this is TRUE for all columns.
Note that .SD is a data.table containing the Subset of Data for each group, excluding any columns used in by.
Another use case with more rows and columns
ex2 <- fread("
id foo bar baz attr4 attr5
1 NA 2 NA 1 5
1 NA 2 2 1 NA
1 NA 2 NA NA NA
2 3 NA 2 1 2
2 3 3 1 3 2
2 3 3 1 4 2
3 5 2 NA 1 3
3 NA 2 2 1 3
4 NA NA NA NA NA
")
ex2[, if (sum(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))) == ncol(.SD))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id foo bar baz attr4 attr5
1: 1 NA 2 2 1 5
2: 1 NA 2 2 1 5
3: 1 NA 2 2 1 5
4: 2 3 NA 2 1 2
5: 2 3 3 1 3 2
6: 2 3 3 1 4 2
7: 3 5 2 2 1 3
8: 3 5 2 2 1 3
9: 4 NA NA NA NA NA
Here is an option mixing base R with data.table:
#lead the values for comparison
cols <- paste0("attr", 1L:4L)
lcols <- paste0("lead_", cols)
ex[, (lcols) := shift(.SD, -1L), id]
#check which rows fulfill the criteria
flags <- apply(ex[, ..cols] == ex[, ..lcols], 1L, all, na.rm=TRUE) &
apply(ex[, ..lcols], 1L, function(x) !all(is.na(x)))
#update those rows with values from row below
ex[(flags), (cols) :=
mapply(function(x, y) fcoalesce(x, y), mget(lcols), mget(cols), SIMPLIFY=FALSE)]
ex[, (lcols) := NULL][]
Solution assumes that there is no recursive populating where the row after next is used to fill the current row if criteria is met.

How to get indices of top k values for each (selected) column in data.table

How to find the indices of the top k (say k=3) values for each column
> dt <- data.table( x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1) )
> dt
x y
1: 1 1
2: 1 2
3: 3 1
4: 1 2
5: 3 2
6: 1 1
7: 1 1
Required output:
> output.1
x y
1: 1 2
2: 3 4
3: 5 5
Or even better (notice the additional helpful descending sort in x):
> output.2
var top1 top2 top3
1: x 3 5 1
2: y 2 4 5
Having the output would be already a great help.
We can use sort (with index.return=TRUE) after looping over the columns of the dataset with lapply
dt[, lapply(.SD, function(x) sort(head(sort(x,
decreasing=TRUE, index.return=TRUE)$ix,3)))]
# x y
#1: 1 2
#2: 3 4
#3: 5 5
Or use order
dt[, lapply(.SD, function(x) sort(head(order(-x),3)))]
If the order of the elements having same rank doesn't matter then this answer would be also valid.
The order information can be extracted from data.table index.
library(data.table)
dt = data.table(x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1))
set2key(dt, x)
set2key(dt, y)
tail.index = function(dt, index, n){
idx = attr(attr(dt, "index"), index)
rev(tail(idx, n))
}
tail.index(dt, "__x", 3L)
#[1] 5 3 7
tail.index(dt, "__y", 3L)
#[1] 5 4 2
Here's a verbose solution which I'm sure undermines the slickness of the data.table package:
dt$idx <- seq.int(1:nrow(dt))
k <- 3
top_x <- dt[order(-x), idx[1:k]]
top_y <- dt[order(-y), idx[1:k]]
dt_top <- data.table(top_x, top_y)
dt_top
# top_x top_y
# 1: 3 2
# 2: 5 4
# 3: 1 5

R Data Aggregation With WHERE Clause on Group

As an example, I have the data.table shown below. I want to do a simple aggregation where b=sum(b). For c, however I want the value of the record in c where b is maximum. The desired output is shown below (data.aggr). This leads to a few questions:
1) Is there a way to do this data.table?
2) Is there a simpler way to do this in plyr?
3) In plyr the output object got change from a data.table to a data.frame. Can I avoid this behavior?
library(plyr)
library(data.table)
dt <- data.table(a=c('a', 'a', 'a', 'b', 'b'), b=c(1, 2, 3, 4, 5),
c=c('m', 'n', 'p', 'q', 'r'))
dt
# a b c
# 1: a 1 m
# 2: a 2 n
# 3: a 3 p
# 4: b 4 q
# 5: b 5 r
dt.split <- split(dt, dt$a)
dt.aggr <- ldply(lapply(dt.split,
FUN=function(dt){ dt[, .(b=sum(b), c=dt[b==max(b), c]),
by=.(a)] }), .id='a')
dt.aggr
# a b c
# 1 a 6 p
# 2 b 9 r
class(dt.aggr)
# [1] "data.frame"
This is a simple operation within the data.table scope
dt[, .(b = sum(b), c = c[which.max(b)]), by = a]
# a b c
# 1: a 6 p
# 2: b 9 r
A similar option would be
dt[order(b), .(b = sum(b), c = c[.N]), by = a]

Resources