Subsetting data.table by values with repetition - r

I would like to subset data.table in a way which would provide some rows repeatedly. This works well with indices but I am not sure how to do it in a simple way with values especially when the values does not appear in one row only.
E.g.:
library(data.table)
dt<-data.table(x1=c('a','a','b','b','c','c'),x2=c(1,2,3,4,5,6))
xsel<-c('a','b','a','a','c','b')
dt[x1%in%xsel,]
will provide this output:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: c 5
6: c 6
I would like to get it in the original order and with repetition exactly as it is in the xsel vector. Is it possible to do it in a reasonably simple way without looping? Thanks.

Using:
setkey(dt, x1) # set the key
dt[J(xsel)] # or: dt[.(xsel)]
gives:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: a 1
6: a 2
7: a 1
8: a 2
9: c 5
10: c 6
11: b 3
12: b 4
Without setting the key, you could use:
dt[.(xsel), on = .(x1)]

Related

Renaming multiple columns in R data.table

This is related to this question from Henrik
Assign multiple columns using := in data.table, by group
But what if I want to create a new data.table with given column names instead of assigning new columns to an existing one?
f <- function(x){list(head(x,2),tail(x,2))}
dt <- data.table(group=sample(c('a','b'),10,replace = TRUE),val=1:10)
> dt
group val
1: b 1
2: b 2
3: a 3
4: b 4
5: a 5
6: b 6
7: a 7
8: a 8
9: b 9
10: b 10
I want to get a new data.table with predefined column names by calling the function f:
dt[,c('head','tail')=f(val),by=group]
I wish to get this:
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
But it gives me an error. What I can do is create the table then change the column names, but that seems cumbersome:
> dt2 <- dt[,f(val),by=group]
> dt2
group V1 V2
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
> colnames(dt2)[-1] <- c('head','tail')
> dt2
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
Is it something I can do with one call?
From running your code as-is, this is the error I get:
dt[,c('head','tail')=f(val),by=group]
# Error: unexpected '=' in "dt2[,c('head','tail')="
The problem is using = instead of := for assignment.
On to your problem of wanting a new data.table:
dt2 <- dt[, setNames(f(val), c('head', 'tail')), by = group]

I found a strange thing (bug?) about 'combn' function and 'data.table' package [all possible combinations by group]

I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.

How to sort a data.table using a target vector

So, I have the following data.table
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
> DT
x y
1: b 1
2: b 2
3: b 3
4: a 1
5: a 2
6: a 3
7: c 1
8: c 2
9: c 3
And I have the following vector
k <- c("2","3","1")
I want to use k as a target vector to sort DT using y and get something like this.
> DT
x y
1: b 2
2: a 2
3: c 2
4: b 3
5: a 3
6: c 3
7: b 1
8: a 1
9: c 1
Any ideas? If I use DT[order(k)] I get a subset of the original data, and that isn't what I am looking for.
Throw a call to match() in there.
DT[order(match(y, as.numeric(k)))]
# x y
# 1: b 2
# 2: a 2
# 3: c 2
# 4: b 3
# 5: a 3
# 6: c 3
# 7: b 1
# 8: a 1
# 9: c 1
Actually DT[order(match(y, k))] would work as well, but it is probably safest to make the arguments to match() of the same class just in case.
Note: match() is known to be sub-optimal in some cases. If you have a large number of rows, you may want to switch to fastmatch::fmatch for faster matching.
You can do this:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
setkey(DT,y)
DT[data.table(as.numeric(k))]
or (from the comment of Richard)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
DT[data.table(y = as.numeric(k)), on = "y"]

Extract the best attributes from a data.table

I have a data.table:
> (a <- data.table(id=c(1,1,1,2,2,3),
attribute=c("a","b","c","a","b","c"),
importance=1:6,
key=c("id","importance")))
id attribute importance
1: 1 a 1
2: 1 b 2
3: 1 c 3
4: 2 a 4
5: 2 b 5
6: 3 c 6
I want:
--1-- sort it by the second key in the decreasing order (i.e., the most important attributes should come first)
--2-- select the top 2 (or 10) attributes for each id, i.e.:
id attribute importance
3: 1 c 3
2: 1 b 2
5: 2 b 5
4: 2 a 4
6: 3 c 6
--3-- pivot the above:
id attribute.1 importance.1 attribute.2 importance.2
1 c 3 b 2
2 b 5 a 4
3 c 6 NA NA
It appears that the last operation can be done with something like:
a[,{
tmp <- .SD[.N:1];
list(a1 = tmp$attribute[1],
i1 = tmp$importance[1])
}, by=id]
Is this The Right Way?
How do I do the first two tasks?
I'd do the first two tasks like this:
a[a[, .I[.N:(.N-1)], by=list(id)]$V1]
The inner a[, .I[.N:(.N-1)], ,by=list(id)] gives you the indices in the order you require for every unique group in id. Then you subset a with the V1 column (which has the indices in the order you require).
You'll have to take care of negative indices here, maybe something like:
a[a[, .I[seq.int(.N, max(.N-1L, 1L))], by=list(id)]$V1]

Duplicates after aggregating wiith data.table

I'm new to data.table, and trying to find how many rows in my table have the same value in two columns. The resulting table had multiple rows containing the same key combination. Can someone help me with what I'm doing wrong?
labs_raw_df <- data.table(labs_raw)
setkey(labs_raw_df, NAT, LAB_TST_AN_LAB_TST_CD)
lab_pt_count <- labs_raw_df[,
list(n=.N)
,by=list(NAT, LAB_TST_AN_LAB_TST_CD)]
Both columns are character.
Writing an answer since this is too long for a comment.
I assume that you use data.table 1.8.6.
Let's create some dummy data:
set.seed(42)
labs_raw_df <- data.frame(NAT=sample(c("A","B","C"),20,TRUE),
LAB_TST_AN_LAB_TST_CD=sample(c("A","B","C"),20,TRUE),
value=sample(0:1,20,TRUE))
Now your code (with some minor corrections of naming):
library(data.table)
labs_raw_dt <- data.table(labs_raw_df)
setkey(labs_raw_dt, NAT, LAB_TST_AN_LAB_TST_CD)
lab_pt_count <- labs_raw_dt[,
list(n=.N),
by=list(NAT, LAB_TST_AN_LAB_TST_CD)]
print(lab_pt_count)
NAT LAB_TST_AN_LAB_TST_CD n
1: A A 1
2: A C 3
3: B A 2
4: B B 3
5: B C 2
6: C A 2
7: C B 2
8: C C 5
This is the expected result. Can you elaborate on how that doesn't meet your expectation?
Of course we can simplify a bit:
lab_pt_count <- labs_raw_dt[,
.N,
by=key(labs_raw_dt)]
print(lab_pt_count)
NAT LAB_TST_AN_LAB_TST_CD N
1: A A 1
2: A C 3
3: B A 2
4: B B 3
5: B C 2
6: C A 2
7: C B 2
8: C C 5
But the result is the same.

Resources