set() in data.table - matching on names instead of column number - r

Using set() for efficiency to update values insided of a data.table I ran into problems, when the order of the columns changed. So to prevent that I used a workaround to match the column name instead of the column postion.
I would like to know if there's a better way of addressing the column in the j part of the set query.
DT <- as.data.table(cbind( Period = 1:10,
Col.Name=NA))
set(DT, i = 1L , j = as.integer(match("Col.Name",names(DT))), value = 0)
set(DT, i = 3L , j = 2L, value = 0)
So I would like to ask if there's a data.table workaround for this, perhaps a fast matching on the colnames already available.

We can use column name directly in 'j'
set(DT, i = 1L , j = "Col.Name", value = 0)
DT
# Period Col.Name
# 1: 1 0
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
#10: 10 NA

Related

data.table subassignment with `on = `

When making a subassigment,
the RHS length must either be 1 (single values are ok) or match the LHS length exactly,
as the error message says when rule is not followed.
However, the following works:
tab.01 <- data.table( a = 1L:5L, b = 11L:15L )
tab.02 <- data.table( a = c(1L, 1L, 2L), x = c(11L, 12L, 22L) )
tab.01[ tab.02, x := i.x, on = "a"]
# a b x
# 1: 1 11 12
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA
The column x is not functionally dependent on the column a. Yet, an assignment is made and, if my guess is right, the last element of the subgroup is assigned.
Can this default behaviour be changed, e.g. to choose the first element? The following trials do not work:
mult = "first" has no effect.
tab.01[ tab.02, x := first(i.x), on = "a" ] assigns the value 11L to all matches.
tab.01[ tab.02, x := first(i.x), on = "a", by = "a"]
results in an error, because i.x is not available anymore (or any other column in i).
tab.01[ tab.02, x := first(i.x), on = "a", by = .EACHI ] does not raise an error, but does not fix anything either. The values in the group a reassigned in the order of the rows, hence the last value is kept.
One can use a version of tab.02 with functionally dependent columns:
tab.02[ , y := f_fd(x), by = "a" ] # e.g. f_fd <- data.table::first
tab.01[ tab.02, x := y, on = "a"]
Is this the concisest way to perform this task?
I believe there's no built-in method specifically for accomplishing this. However, it is possible to do this update without modifying tab.02.
You could create a subset
tab.01[tab.02[rowid(a) == 1], x := i.x, on = "a"][]
# a b x
# 1: 1 11 11
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA
or order before joining
tab.01[tab.02[order(-x)], x := i.x, on = "a"][]
# a b x
# 1: 1 11 11
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA

Find and subset patterns in data table

Suppose that we have a data table with missing values (see example below).
library(data.table)
mat <- matrix(rnorm(50), ncol = 5)
mat[c(1,3,5,9,10,11,14,37,38)] <- NA
DT <- as.data.table(mat)
In total, we have 5 unique missing data patterns in our example (see unique(!is.na(DT))).
Suppose now further that we would like to find these patterns and identify them according to their frequency of occurrence (starting with the most frequent pattern indicated by 1).
DTna <- as.data.table(!is.na(DT))
DTna <- DTna[, n := .N, by = names(x = DTna)]
DTna <- DTna[, id := 1:nrow(x = DTna)]
DTna <- DTna[order(n, decreasing = TRUE)]
DTna <- DTna[, m := .GRP, by = eval(names(x = DT))]
Finally, observations with a particular pattern should be subsetted according to a prespecification (here e.g. 1 for the most frequent pattern).
pattern <- 1
i <- DTna[m == pattern, id]
DT[i]
In summary, I need to find observations which share the same missing data pattern and subsequently subset them according to a prespecification (e.g. the most frequent pattern). Please note that I need to subset DT instead of DTna.
Question
So far, the above code works as expected, but is there a more elegant way using data.table?
I would add a grouping column to DT to join and filter on:
DT[, nag := do.call(paste0, lapply(.SD, function(x) +is.na(x)))]
nagDT = DT[, .N, by=nag][order(-N), nagid := .I][, setorder(.SD, nagid)]
# nag N nagid
# 1: 10000 4 1
# 2: 00000 2 2
# 3: 00010 2 3
# 4: 11000 1 4
# 5: 01000 1 5
# subsetting
my_id = 1L
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0]
which gives
V1 V2 V3 V4 V5 nag
1: NA 1.3306093 -2.1030978 0.06115726 -0.2527502 10000
2: NA 0.2852518 -0.1894425 0.86698633 -0.2099998 10000
3: NA -0.1325032 -0.5201166 -0.94392417 0.6515976 10000
4: NA 0.3199076 -1.0152518 -1.61417902 -0.6458374 10000
If you want to omit the new column in the result:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"]
And to also omit the blank columns:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"][,
Filter(function(x) !anyNA(x), .SD)]
An alternative, which is undoubtedly inferior (but nonetheless provided for variety), is
DT[, patCnt := setDT(stack(transpose(DT)))[,
paste(+(is.na(values)), collapse=""), by="ind"][,
patCnt := .N, by=(V1)]$patCnt]
which returns
DT
V1 V2 V3 V4 V5 patCnt
1: NA NA -1.5062011 -0.9846015 0.12153714 1
2: 1.4176784 -0.08078952 -0.8101335 0.6437340 -0.49474613 2
3: NA -0.08410076 -1.1709337 -0.9182901 0.67985806 4
4: 0.2104999 NA -0.1458075 0.8192693 0.05217464 1
5: NA -0.73361504 2.1431392 -1.0041705 0.29198857 4
6: 0.3841267 -0.75943774 0.6931461 -1.3417511 -1.53291515 2
7: -0.8011166 0.26857593 1.1249757 NA -0.57850361 2
8: -1.5518674 0.52004986 1.6505470 NA -0.34061924 2
9: NA 0.83135928 0.9155882 0.1856450 0.31346976 4
10: NA 0.60328545 1.3042894 -0.5835755 -0.17132227 4
Then subset
DT[patCnt == max(patCnt)]
V1 V2 V3 V4 V5 patCnt
1: NA -0.08410076 -1.1709337 -0.9182901 0.6798581 4
2: NA -0.73361504 2.1431392 -1.0041705 0.2919886 4
3: NA 0.83135928 0.9155882 0.1856450 0.3134698 4
4: NA 0.60328545 1.3042894 -0.5835755 -0.1713223 4

Retaining by variables in R data.table by-without-by

I'd like to retain the by variables in a by-without-by operation using data.table.
I have a by-without-by that used to work (ca. 2 years ago), and now with the latest version of data.table I think the behavior must have changed.
Here's a reproducible example:
library(data.table)
dt <- data.table( by1 = letters[1:3], by2 = LETTERS[1:3], x = runif(3) )
by <- c("by1","by2")
allPermutationsOfByvars <- do.call(CJ, sapply(dt[,by,with=FALSE], unique, simplify=FALSE)) ## CJ() to form index
setkeyv(dt, by)
dt[ allPermutationsOfByvars, list( x = x ) ]
Which produces:
> dt[ allPermutationsOfByvars, list( x = x ) ]
x
1: 0.9880997
2: NA
3: NA
4: NA
5: 0.4650647
6: NA
7: NA
8: NA
9: 0.4899873
I could just do:
> cbind( allPermutationsOfByvars, dt[ allPermutationsOfByvars, list( x = x ) ] )
by1 by2 x
1: a A 0.9880997
2: a B NA
3: a C NA
4: b A NA
5: b B 0.4650647
6: b C NA
7: c A NA
8: c B NA
9: c C 0.4899873
Which indeed works, but is inelegant and possibly inefficient.
Is there an argument I'm missing or a clever stratagem to retain the by variables?
Add by = .EACHI to get the "by-without-by" aka by-EACH-element-of-I:
dt[allPermutationsOfByvars, x, by = .EACHI]
And this is how I'd have done the initial part:
allPermutationsOfByvars = dt[, do.call(CJ, unique(setDT(mget(by))))]
Finally, the on argument is usually the better choice now (vs setkey).

Inline ifelse assignment in data.table

Let the following data set be given:
library('data.table')
set.seed(1234)
DT <- data.table(x = LETTERS[1:10], y =sample(10))
my.rows <- sample(1:dim(DT)[1], 3)
I want to add a new column to the data set such that, whenever the rows of the data set match the row numbers given by my.rows the entry is populated with, say, true, or false otherwise.
I have got DT[my.rows, z:= "true"], which gives
head(DT)
x y z
1: A 2 NA
2: B 6 NA
3: C 5 true
4: D 8 NA
5: E 9 true
6: F 4 NA
but I do not know how to automatically populate the else condition as well, at the same time. I guess I should make use of some sort of inline ifelse but I am lacking the correct syntax.
We can compare the 'my.rows' with the sequence of row using %in% to create a logical vector and assign (:=) it to create 'z' column.
DT[, z:= 1:.N %in% my.rows ]
Or another option would be to create 'z' as a column of 'FALSE', using 'my.rows' as 'i', we assign the elements in 'z' that correspond to 'i' as 'TRUE'.
DT[, z:= FALSE][my.rows, z:= TRUE]
DT <- cbind(DT,z = ifelse(DT[, .I] %in% my.rows , T, NA))
> DT
# x y z
# 1: A 2 NA
# 2: B 6 NA
# 3: C 5 TRUE
# 4: D 8 NA
# 5: E 9 TRUE
# 6: F 4 NA
# 7: G 1 TRUE
# 8: H 7 NA
# 9: I 10 NA
#10: J 3 NA

Subsetting and assignment on several columns of a data table

Lets say I have a data table like the one below:
library(data.table)
N = 10
x = data.table(id = 1:N,
segm = sample(c("A","B","C"),N,replace=T), r = rnorm(N,20,5),
aa = sample(0:1,N,replace=T), ab = sample(0:1,N,replace=T),
ba = sample(0:1,N,replace=T), bb = sample(0:1,N,replace=T))
I'd like to know how to substitute the 1 values for NA but only for the columns aa, ab, ba and bb using the data table package. I know how to do this using data frame.
I tried using the following:
f = c("aa","ab","ba","bb")
x[,f,with=F][x[,f,with=F]==1] <- "NA"
but I'm getting an error: Error in [<-.data.table(*tmp*, , f, with = F, value = list(aa = c("0", : unused argument (with = F)
To sum up, my question is: How can I subset and assign on several columns of a data table at the same time.
The line of code:
x[f==1,f:="NA"]
is just not working. Why?
Any help is appreciate.
It's possible to accomplish this in another way, for this particular case:
x[, (f) := lapply(.SD, function(x) x * (x | NA)), .SDcols=f]
We use the fact that TRUE | NA = TRUE and FALSE | NA = NA here. The ( in the LHS of := sees it as an expression (rather than a variable name) and therefore evaluates it to obtain the columns contained in it. Specifying .SDcols provides .SD with just the columns f, what we want. And we apply this hack of a function to replace each column, by reference.
DT[f == 1, f := NA]
doesn't work because:
Let's write your expression as DT[i, LHS := RHS]. i being an expression gets evaluated within the scope of DT. [.data.table tries to find a column f within the scope of DT and since there isn't any, it'll try to find in the calling scope and gets the value stored in it, which then becomes: c("aa", "ab", "ba", "bb") == 1. This evaluates to FALSE, FALSE, FALSE, FALSE resulting in an empty data.table - the assignment in j will have no effect.
Also note the ( in LHS in my answer. This is so that we can still conveniently use DT[, f := val] where f is the column name.
There's nothing wrong with using a for() loop here.
Given the nature of your problem, with a different subset of rows being operated on in each of the four columns, you're going to need to use some sort of loop; you might as well construct an explicit one that allows you to take full advantage of data.table's modify-by-reference := operator.
for (i in f)
x[get(i)==1, (i):=NA]
x
# id segm r aa ab ba bb
# 1: 1 C 15.203246 NA NA 0 0
# 2: 2 B 23.536583 NA 0 0 NA
# 3: 3 A 16.404203 NA 0 NA 0
# 4: 4 A 18.673618 0 0 NA NA
# 5: 5 C 30.528967 NA 0 NA NA
# 6: 6 A 18.887781 0 NA NA NA
# 7: 7 C 24.476124 0 0 NA NA
# 8: 8 B 26.862686 0 0 NA 0
# 9: 9 C 9.047837 0 0 0 NA
# 10: 10 C 17.532379 0 0 NA NA

Resources