Sub-assign rows by reference using data.table - r

I have the following data.table:
DT1 <- data.table(col1 = c(1,2,3,4,5,6,7), col2 = letters[1:7], col3 = rep(TRUE,7))
col1 col2 col3
1: 1 a TRUE
2: 2 b TRUE
3: 3 c TRUE
4: 4 d TRUE
5: 5 e TRUE
6: 6 f TRUE
7: 7 g TRUE
Then I define:
vec <- c(2,5,6)
And with:
DT1[col1 == vec, col3 := FALSE]
I obtain:
col1 col2 col3
1: 1 a TRUE
2: 2 b TRUE
3: 3 c TRUE
4: 4 d TRUE
5: 5 e FALSE
6: 6 f FALSE
7: 7 g TRUE
I expect col3 of second line to be set to FALSE here, which seems to be not the case.
But for example, this works as I expect:
DT1[vec, col3 := FALSE]
What am I missing?

data.table has the format DT[i,j,by] with i meaning location / where, j meaning select / update / compute / assign and by meaning group by.
So the mistake that you are making here is the following:
In your assignment: DT1[col1==vec, ...] part is equivalent to the following index:
DT1$col1 == vec
This is like comparing the elements col1 column of DT1 with vec. Since vec has only 3 elements, the elements are rolled over, and due to specific values in your vec and col1, the 5th and 6th elements turns out to be TRUE after rolling.
The correct way to do what you want to do is:
Method 1: (preferred)
DT1[vec, col3 := FALSE]
Method 2: (equivalent to data.frame, but not preferred for data.table)
DT1$col3[vec] <- FALSE
or, the following will also work:
DT1[vec]$col3 <- FALSE
Method 3: Here is another possibility (although slower than the first method):
DT1[col1 %in% vec, col3 := FALSE]
Hope this helps!!

Use %in% as it returns a logical vector:
> DT1<-data.table(col1=c(1,2,3,4,5,6,7),col2=letters[1:7],col3=rep(TRUE,7))
> vec <- c(2,5,6)
> DT1[col1 %in% vec, col3 := FALSE]
> DT1
col1 col2 col3
1: 1 a TRUE
2: 2 b FALSE
3: 3 c TRUE
4: 4 d TRUE
5: 5 e FALSE
6: 6 f FALSE
7: 7 g TRUE

Related

Can %in% be used in base R to match value pairs?

I'm familiar with %in% generally, and I'm looking for a base R solution, if one exists.
Suppose I want to know whether a particular combination of values from multiple fields in a data frame exists in another data frame. As a work-around, sometimes I concatenate all these values into a single field and match on the custom concatenation, but I'm wondering if there's a way to pass the value combinations to %in% directly.
I'm imagining syntax similar to deduplicating on unique combinations of values across multiple columns, whose syntax works like this, by way of a generic example:
df[!duplicated(df[,c("col1","col2","col3")]),]
I was sort of expecting something like this to work, but I see why it doesn't:
df1[df1[,c("col1","col2")] %in% df2[,c("col1","col2")],]
... above, I'm attempting to ask which value pairs in df1 also exist as value pairs in df2.
You can use mapply to create a logical matrix of matches and then use it to subset df1.
Test data.
set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4),
col2 = sample(1:10, 4), col3 = 11:14)
Here I start by putting the columns in a vector, it simplifies the code.
cols <- c("col1", "col2")
(i <- mapply(\(x, y) x %in% y, df1[cols], df2[cols]))
# col1 col2
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] TRUE TRUE
# [5,] FALSE FALSE
# [6,] TRUE TRUE
# [7,] TRUE TRUE
# [8,] FALSE FALSE
# [9,] FALSE TRUE
#[10,] FALSE FALSE
Now subset. The question is not very clear on which of the following is asked for.
# at least one column match
j <- rowSums(i) > 0L
df1[j, ]
# col1 col2 col3
#3 c 3 13
#4 d 4 14
#6 f 6 16
#7 g 7 17
#9 i 9 19
# all columns match
k <- rowSums(i) == length(cols)
df1[k, ]
# col1 col2 col3
#4 d 4 14
#6 f 6 16
#7 g 7 17
I think just doing a merge() by the two columns of interest get you what you need. You can then subset the merged output to just columns from the original data.frame. This would return only rows of your query data.frame where col1 and col2 match their cognate values in the reference data.frame. Please clarify if that's NOT your goal.
# simulate two DFs with some common values in col1 and col2
x <- data.frame(col1 = LETTERS[1:5],
col2 = 1:5,
col3 = runif(5))
y <- data.frame(col1 = LETTERS[4:8],
col2 = 4:8,
col3 = runif(5))
x
#> col1 col2 col3
#> 1 A 1 0.4306611
#> 2 B 2 0.7149893
#> 3 C 3 0.2808990
#> 4 D 4 0.4383580
#> 5 E 5 0.1372991
y
#> col1 col2 col3
#> 1 D 4 0.40191250
#> 2 E 5 0.94833538
#> 3 F 6 0.85608320
#> 4 G 7 0.05758958
#> 5 H 8 0.29011770
# merge without adding .x suffix to col3 from x
# then subset to only keep columns from x
merge(x, y,
by = c("col1", "col2"),
suffixes = c("", ".drop"))[,1:ncol(x)]
#> col1 col2 col3
#> 1 D 4 0.4383580
#> 2 E 5 0.1372991
Created on 2022-01-08 by the reprex package (v2.0.1)

data.table - check if one column is in another (list) column

I have a data table with a column containing lists. I want to check if another column is present in the list column as.
library(data.table)
dt <- data.table("a" = 1:3, "b" = list(1:2, 3:4, 5:6))
I tried:
dt[, is_a_in_b := a %in% b]
dt
# a b is_a_in_b
# 1: 1 1,2 FALSE
# 2: 2 3,4 FALSE
# 3: 3 5,6 FALSE
which does not give the correct result. The desired table would be
dt
# a b is_a_in_b
# 1: 1 1,2 TRUE
# 2: 2 3,4 FALSE
# 3: 3 5,6 FALSE
You can use the mapply function with applying the function %in% to two vectors: a and b. In effect it takes a pair of vectors (lists) and produces for every index ix the result of a[ix] %in% b[[ix]].
dt[, is_a_in_b := mapply('%in%', a, b)]
> dt
a b is_a_in_b
1: 1 1,2 TRUE
2: 2 3,4 FALSE
3: 3 5,6 FALSE

how ifelse (in data.table) works [duplicate]

This question already has answers here:
Why can't R's ifelse statements return vectors?
(9 answers)
Closed 7 years ago.
In my data.table, I wanted to numerate entries if there are more than one in each by group:
dt1 <- data.table(col1=1:4, col2 = c('A', 'B', 'B', 'C'))
# col1 col2
# 1: 1 A
# 2: 2 B
# 3: 3 B
# 4: 4 C
dt1[, col3:={
if (.N>1) {paste0((1:.N), "_", col2)} else {col2};
}, by=col2]
# col1 col2 col3
# 1: 1 A A
# 2: 2 B 1_B
# 3: 3 B 2_B
# 4: 4 C C
This works fine, but didn't work when I tried to use ifelse() instead:
dt1[, col4:=ifelse (.N>1, paste0((1:.N), "_", col2), col2), by=col2]
# col1 col2 col3 col4
# 1: 1 A A A
# 2: 2 B 1_B 1_B
# 3: 3 B 2_B 1_B
# 4: 4 C C C
can anyone explain why?
This is only by proxy related to data.table; at core is that ifelse is designed for use like:
ifelse(test, yes, no)
where test, yes, and no all have the same length -- the output will be the same length as test, and all the elements corresponding to where test is TRUE will be the corresponding element from yes, and similarly for where test is FALSE.
When test is a scalar and yes or no are vectors, as in your case, you have to look at what ifelse is doing to understand what's going on:
Relevant source:
if (any(test[ok])) #is any element of `test` `TRUE`?
ans[test & ok] <- rep(yes, length.out = length(ans))[test &
ok]
What is rep(c(1, 2), length.out = 1)? It's just 1 -- the second element is truncated.
That's what's happened here -- the value of ifelse is only the first element of paste0(1:.N, "_", col2). When passed to `:=`, this single element is recycled.
When your logical condition is a scalar, you should use if, not ifelse. I'll also add that I do my damndest to avoid using ifelse in general because it's slow.

Inline ifelse assignment in data.table

Let the following data set be given:
library('data.table')
set.seed(1234)
DT <- data.table(x = LETTERS[1:10], y =sample(10))
my.rows <- sample(1:dim(DT)[1], 3)
I want to add a new column to the data set such that, whenever the rows of the data set match the row numbers given by my.rows the entry is populated with, say, true, or false otherwise.
I have got DT[my.rows, z:= "true"], which gives
head(DT)
x y z
1: A 2 NA
2: B 6 NA
3: C 5 true
4: D 8 NA
5: E 9 true
6: F 4 NA
but I do not know how to automatically populate the else condition as well, at the same time. I guess I should make use of some sort of inline ifelse but I am lacking the correct syntax.
We can compare the 'my.rows' with the sequence of row using %in% to create a logical vector and assign (:=) it to create 'z' column.
DT[, z:= 1:.N %in% my.rows ]
Or another option would be to create 'z' as a column of 'FALSE', using 'my.rows' as 'i', we assign the elements in 'z' that correspond to 'i' as 'TRUE'.
DT[, z:= FALSE][my.rows, z:= TRUE]
DT <- cbind(DT,z = ifelse(DT[, .I] %in% my.rows , T, NA))
> DT
# x y z
# 1: A 2 NA
# 2: B 6 NA
# 3: C 5 TRUE
# 4: D 8 NA
# 5: E 9 TRUE
# 6: F 4 NA
# 7: G 1 TRUE
# 8: H 7 NA
# 9: I 10 NA
#10: J 3 NA

How to pass different arguments to each group in grouping of data.table?

Example:
Here is a data table called dt:
> library(data.table)
> dt <- data.table(colA=rep(letters[1:3],each=3), colB=0:8)
> dt
colA colB
1: a 0
2: a 1
3: a 2
4: b 3
5: b 4
6: b 5
7: c 6
8: c 7
9: c 8
I want to know:
For colA equals "a", is there any values in colB > 2?
For colA equals "b", is there any values in colB > 3?
For colA equals "c", is there any values in colB > 4?
I create a vector called arg to hold arguments for group "a", "b" & "c":
arg <- c(2,3,4)
Could anyone give me a simple way to pass arg to grouping of dt by colA?
Here is my desired result:
colA V1
1: a FALSE
2: b TRUE
3: c TRUE
This is my first question here and I tried to make it simple. Thank you in advance.
For each subgroup that it operates on, [.data.table() stores information about the current value(s) of the grouping variable(s) in a variable named .BY.
If you first set up a named vector that maps the grouping variable's levels to the desired parameter values, you can use .BY to index into it, extracting the appropriate values, like so:
arg <- setNames(c(2, 3, 4), c("a", "b", "c"))
arg
# a b c
# 2 3 4
dt[, any(colB > arg[unlist(.BY)]), by="colA"]
# colA V1
# 1: a FALSE
# 2: b TRUE
# 3: c TRUE
dt[ , thresh := (2:4)[as.numeric(factor(colA))] ]
dt
colA colB thresh
1: a 0 2
2: a 1 2
3: a 2 2
4: b 3 3
5: b 4 3
6: b 5 3
7: c 6 4
8: c 7 4
9: c 8 4
dt[, any(colB > thresh),by=colA]
colA V1
1: a FALSE
2: b TRUE
3: c TRUE
Probably not the most elegant way, but I will give it a shot...
#List components of each group
ref <- dt[,list(colB.list=list(I(colB))),by=colA][,ord:=.I]
#Feed arguements
ref[,arg:=c(2,3,4)]
#Use comparison function
ref[,V1:=mapply(FUN=function(X,Y){sum(colB.list[[X]]>Y)>0},X=ord,Y=arg)]

Resources