between vs inrange in data.table - r

In R's data.table, when should one choose between %between% and %inrange% for subsetting operations? I've read the help page for ?between and I'm still scratching my head as to the differences.
library(data.table)
X = data.table(a=1:5, b=6:10, c=c(5:1))
> X[b %between% c(7,9)]
a b c
1: 2 7 4
2: 3 8 3
3: 4 9 2
> X[b %inrange% c(7,9)]
a b c
1: 2 7 4
2: 3 8 3
3: 4 9 2
They look the same to me. Could someone please explain why there exist both operations?

> X
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
Using the example in the comments:
> X[a %between% list(c, b)]
a b c
1: 3 8 3
2: 4 9 2
3: 5 10 1
> X[a %inrange% list(c, b)]
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
It seems between looks at each row individually and checks to see if the value in a is such that c <= a <= b for that row.
inrange looks for the smallest scalar value in c, say cmin and the largest scalar value in b, bmax, forming a range [cmin, bmax], and then checks to see if a lies in this range [cmin, bmax], for each row in the a column.

Related

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

Mixing by and .SDcols in data.table

I am trying to mix by and .SDcols in data.table cran 1.9.6 (and also tested on dev from github, so it is likely a misundertanding on my part)
f = function(x){
print(x);
res=data.table(X=x,Y=x*x);
return(res)
}
DT = data.table(x=1:4, y=rep(c('a','b'),2))
DT[,c('A','B'):=lapply(.SD,FUN=f),.SDcols='x',by=y]
I get:
[1] 1 3
Error in `[.data.table`(DT, , `:=`(c("A", "B"), lapply(.SD, FUN = f)), :
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
I would expect
x y A B
1: 1 a 1 1
2: 2 b 2 4
3: 3 a 3 9
4: 4 b 4 16
I would have expected the by operation to take place and SDcols to be replace by 'x' Could someone explain why I am wrong here ?
All the following works, as #Frank pinpointed, the problem was in the level nesting of the list by lapply
DT[,f(.SD[[1]]),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,lapply(.SD, f)[[1]],.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,rbindlist(lapply(.SD, f)),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,sapply(.SD, f),.SDcols='x',by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,mapply(FUN=f, mget('x')),by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16

Replace row values in data.table using 'by' and conditions

I am trying to replace certain row values in a column according to conditions in another column, within a grouping.
EDIT: edited to highligh the recursive nature of the problem.
E.g.
DT = data.table(y=rep(c(1,3), each = 3)
,v=as.numeric(c(1,2,4,4,5,8))
,x=as.numeric(rep(c(9:11),each=2)),key=c("y","v"))
DT
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 10
4: 3 4 10
5: 3 5 11
6: 3 8 11
Within each 'y', I then want to replace values of 'x' where 'v' has an observation v+t (e.g. t = 3), with 2222 (or in reality the results of a function) to following result:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 10
5: 3 5 11
6: 3 8 2222
I have tried the following, but to no avail.
DT[which((v-3) %in% v), x:= 2222, y][]
And it mysteriously (?) results in:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 2222
5: 3 5 2222
6: 3 8 2222
Running:
DT[,print(which((v-3) %in% v)), by =y]
Indicates that it does the correct indexing within the groups, but what happens from (or the lack thereof) I don't understand.
You could try using replace (which could have some overhead because it copies whole x)
DT[, x:=replace(x, which(v %in% (v+3)), 2222), by=y]
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Alternatively, you could create a logical index column and then do the assignment in the next step
DT[,indx:=v %in% (v+3), by=y][(indx), x:=2222, by=y][, indx:=NULL]
DT
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Or slightly modifying your own approach using .I in order to create an index
indx <- DT[, .I[which((v-3) %in% v)], by = y]$V1
DT[indx, x := 2222]

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

data.table 1.8.1.: "DT1 = DT2" is not the same as DT1 = copy(DT2)?

I've noticed some inconsistent (inconsistent to me) behaviour in data.table when using different assignment operators. I have to admit I never quite got the difference between "=" and copy(), so maybe we can shed some light here. If you use "=" or "<-" instead of copy() below, upon changing the copied data.table, the original data.table will change as well.
Please execute the following commands and you will see what I mean
library(data.table)
example(data.table)
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
DT2 = DT
now i'll change the v column of DT2:
DT2[ ,v:=3L]
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
but look what happened to DT:
DT
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
it changed as well.
so: changing DT2 changed the original DT. not so if I use copy():
example(data.table) # reset DT
DT3 <- copy(DT)
DT3[, v:= 3L]
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
is this behaviour expected?
Yes. This is expected behaviour, and well documented.
Since data.table uses references to the original object to achieve modify-in-place, it is very fast.
For this reason, if you really want to copy the data, you need to use copy(DT)
From the documentation for ?copy:
The data.table is modified by reference, and returned (invisibly) so
it can be used in compound statements; e.g., setkey(DT,a)[J("foo")].
If you require a copy, take a copy first (using DT2=copy(DT)). copy()
may also sometimes be useful before := is used to subassign to a
column by reference. See ?copy.
See also this question :
Understanding exactly when a data.table is a reference to vs a copy of another

Resources