Unexpected result from data.table when looking in other table - r

I am trying to check if a value from a data table is present in another data table. However, I do not get a correct output:
> dt1 <- data.table(x=c(8,5,3), y=rnorm(3))
> dt2 <- data.table(a=c(1,2,3,4,5), b=rnorm(5))
> setkey(dt1,x)
> setkey(dt2,a)
>
> dt1
x y
1: 3 0.84929113
2: 5 1.33433818
3: 8 0.04170333
> dt2
a b
1: 1 2.00634915
2: 2 -1.53137195
3: 3 -1.49436741
4: 4 -1.66878993
5: 5 -0.06394713
>
> dt1[,is_present_in_dt2:=nrow(dt2[x, nomatch=0L])]
> dt1
x y is_present_in_dt2
1: 3 0.84929113 3
2: 5 1.33433818 3
3: 8 0.04170333 3
Expected result:
x y is_present_in_dt2
1: 3 0.84929113 1
2: 5 1.33433818 1
3: 8 0.04170333 0

I think this is actually more straight forward than you are thinking. Think of it as sub-setting d1 with d2 in the i statement.
dt1 <- data.table(x=c(8,5,3), y=rnorm(3))
dt2 <- data.table(a=c(1,2,3,4,5), b=rnorm(5))
setkey(dt1,x)
setkey(dt2,a)
dt1[dt2, presnt := 1] #Where they merge, make it a 1
dt1[!dt2, presnt := 0] #Where they don't merge, make it a 0
And the result:
x y presnt
1: 3 -0.6938894 1
2: 5 0.4891611 1
3: 8 -1.8227498 0
And another way to think of it:
overlap <- intersect(dt1$x,dt2$a)
dt1[x %in% overlap, present := 1]
dt1[!(x %in% overlap), present := 0]
The first way is much faster. The second way might help the understanding of the first way.

Related

How to check if values in individiual rows of a data.table are identical

Suppose I have the following data.table:
dt <- data.table(a = 1:2, b = 1:2, c = c(1, 1))
# dt
# a b c
# 1: 1 1 1
# 2: 2 2 1
What would be the fastest way to create a fourth column d indicating that the preexisting values in each row are all identical, so that the resulting data.table will look like the following?
# dt
# a b c d
# 1: 1 1 1 identical
# 2: 2 2 1 not_identical
I want to avoid using duplicated function and want to stick to using identical or a similar function even if it means iterating through items within each row.
uniqueN can be applied grouped by row and create a logical expression (== 1)
library(data.table)
dt[, d := c("not_identical", "identical")[(uniqueN(unlist(.SD)) == 1) +
1], 1:nrow(dt)]
-output
dt
# a b c d
#1: 1 1 1 identical
#2: 2 2 1 not_identical
Or another efficient approach might be to do comparison with the first column, and create an expression with rowSums
dt[, d := c("identical", "not_identical")[1 + rowSums(.SD[[1]] != .SD) > 0 ] ]
Here is another data.table option using var
dt[, d := ifelse(var(unlist(.SD)) == 0, "identical", "non_identical"), seq(nrow(dt))]
which gives
> dt
a b c d
1: 1 1 1 identical
2: 2 2 1 non_identical

What does ".N" mean in data.table?

I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols

Select dataframe if both values exists

Here is example:
df1 <- data.frame(x=1:2, account=c(-1,-1))
df2 <- data.frame(x=1:3, account=c(1,-1,1))
df3 <- data.frame(x=1, account=c(-1))
ls <- list(df1,df2,df3)
Failed attempt:
for(i in 1:length(ls)){
d <- ls[[i]]; if(d$account %in% c(-1,1)) { dout <- d} else {next}
}
I also tried: (not sure why this doesn't work)
grepl(paste(c(-1,1), collapse="|"), as.character(df1$account))
gives: (which is correct, since | means or, so one of the values is matched)
[1] TRUE TRUE
however, I have tried this:
df1 <- data.frame(x=1:2, account=c(-1,1))
grepl(paste(c(-1,1), collapse="&"), as.character(df1$account))
gives:
[1] FALSE FALSE
I would like to store only the subset of dataframes that contain both -1,1 values in column account otherwise neglect.
Desired result:
d
x account
1 1 1
2 2 -1
3 3 1
Or, you could stop using a list of data.frames:
library(data.table)
DT <- rbindlist(ls, idcol="id")
# id x account
# 1: 1 1 -1
# 2: 1 2 -1
# 3: 2 1 1
# 4: 2 2 -1
# 5: 2 3 1
# 6: 3 1 -1
And filter the single table:
DT[, if (uniqueN(account) > 1) .SD, by=id]
# id x account
# 1: 2 1 1
# 2: 2 2 -1
# 3: 2 3 1
(This follows #akrun's answer; uniqueN(x) is a fast shortcut to length(unique(x)).)
We could loop through the list and check whether the length of unique elements in 'account' is greater than 1 (assuming that there are only -1 and 1 as possible elements). Use this logical index to filter the list.
ls[sapply(ls, function(x) length(unique(x$account))>1)]

R data.table doing an inner join on a field and operating on another?

I have the following scenario, I first create a data table as shown below
x = data.table(f1 = c('a','b','c','d'))
x = x[,rn := .I]
This yields
> x
f1 rn
1: a 1
2: b 2
3: c 3
4: d 4
>
Where rn is simply the row number. Now, I have another data.table y as
y = data.table(f2=c('b','c','f'))
What I would like to be able to do is for elements in y that are in x, I want to subtract 2 from the corresponding values in rn. So the expected data.table is
x
f1 rn
1: a 1
2: b 0
3: c 1
4: d 4
How does one get to this? x[y] and y[x] don't help at all as they just do joins.
You can use %chin% in i to subset x by the required rows and then run your j expression...
x[ f1 %chin% y$f2 , rn := rn - 2L ]
x
# f1 rn
#1: a 1
#2: b 0
#3: c 1
#4: d 4
%chin% is a fast version of the %in% operator specifically for character vectors that comes with data.table. Note that 2 should be 2L to specify an "integer" type, otherwise you will get a warning (obviously don't use this if you are dealing with "numeric" data types).
If your data is keyed, you can use a join like so:
setkey(x, f1)
x[y, rn := rn - 2L]
x
# f1 rn
#1: a 1
#2: b 0
#3: c 1
#4: d 4

data.table: create new columns with lapply

i have a data.table and want to apply a function to on each subset of a row.
Normaly one would do as follows: DT[, lapply(.SD, function), by = y]
But in my case the function does not return a atomic vector but simply a vector.
Is there a chance to do something like this?
library(data.table)
set.seed(9)
DT <- data.table(x1=letters[sample(x=2L,size=6,replace=TRUE)],
x2=letters[sample(x=2L,size=6,replace=TRUE)],
y=rep(1:2,3), key="y")
DT
# x1 x2 y
#1: a a 1
#2: a b 1
#3: a a 1
#4: a a 2
#5: a b 2
#6: a a 2
DT[, lapply(.SD, table), by = y]
# Desired Result, something like this:
# x1_a x2_a x2_b
# 3 2 1
# 3 2 1
Thanks in advance, and also: I would not mind if the result of the function must have a fixed length.
You simply need to unlist the table and then coerce back to a list:
> DTCounts <- DT[, as.list(unlist(lapply(.SD, table))), by=y]
> DTCounts
y x1.a x2.a x2.b
1: 1 3 2 1
2: 2 3 2 1
.
if you do not like the dots in the names, you can sub them out:
> setnames(DTCounts, sub("\\.", "_", names(DTCounts)))
> DTCounts
y x1_a x2_a x2_b
1: 1 3 2 1
2: 2 3 2 1
Note that if not all values in a column are present for each group
(ie, if x2=c("a", "b") when y=1, but x2=c("b", "b") when y=2)
then the above breaks.
The solution is to make the columns factors before counting.
DT[, lapply(.SD, is.factor)]
## OR
columnsToConvert <- c("x1", "x2") # or .. <- setdiff(names(DT), "y")
DT <- cbind(DT[, lapply(.SD, factor), .SDcols=columnsToConvert], y=DT[, y])

Resources