Extract the best attributes from a data.table - r

I have a data.table:
> (a <- data.table(id=c(1,1,1,2,2,3),
attribute=c("a","b","c","a","b","c"),
importance=1:6,
key=c("id","importance")))
id attribute importance
1: 1 a 1
2: 1 b 2
3: 1 c 3
4: 2 a 4
5: 2 b 5
6: 3 c 6
I want:
--1-- sort it by the second key in the decreasing order (i.e., the most important attributes should come first)
--2-- select the top 2 (or 10) attributes for each id, i.e.:
id attribute importance
3: 1 c 3
2: 1 b 2
5: 2 b 5
4: 2 a 4
6: 3 c 6
--3-- pivot the above:
id attribute.1 importance.1 attribute.2 importance.2
1 c 3 b 2
2 b 5 a 4
3 c 6 NA NA
It appears that the last operation can be done with something like:
a[,{
tmp <- .SD[.N:1];
list(a1 = tmp$attribute[1],
i1 = tmp$importance[1])
}, by=id]
Is this The Right Way?
How do I do the first two tasks?

I'd do the first two tasks like this:
a[a[, .I[.N:(.N-1)], by=list(id)]$V1]
The inner a[, .I[.N:(.N-1)], ,by=list(id)] gives you the indices in the order you require for every unique group in id. Then you subset a with the V1 column (which has the indices in the order you require).
You'll have to take care of negative indices here, maybe something like:
a[a[, .I[seq.int(.N, max(.N-1L, 1L))], by=list(id)]$V1]

Related

count characters based on the order they appear

How does one count the characters based on the order they appear in a single length string. Below is an minimal example:
x <- "abbccdddaab"
First thought was this but it only counts them irrespective of order:
table(unlist(strsplit(x, "\\b")))
a b c d
3 3 2 3
But the desired output is:
a b c d a b
1 2 2 3 2 1
I would imagine the solution would require a for loop?
We can use rle instead of table as rle returns the output as a list of values and lengths based on checking whether the adjacent elements are same or not
out <- rle(strsplit(x, "\\b")[[1]])
setNames(out$lengths, out$values)
# a b c d a b
# 1 2 2 3 2 1
Using data.table::rleid :
x <- "abbccdddaab"
tmp <- strsplit(x, "\\b")[[1]]
table(data.table::rleid(tmp))
#1 2 3 4 5 6
#1 2 2 3 2 1

Using another data table to condition on columns in a primary data table r

Suppose I have two data tables, and I want to use the second one, which contains a row with some column values, to condition the first one.
Specifically, I want to use d2 to select rows where its variables are less than or equal to the values.
d1 = data.table('d'=1,'v1'=1:10, 'v2'=1:10)
d2 = data.table('v1'=5, 'v2'=5)
So I would want the output to be
d v1 v2
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 1 4 4
5: 1 5 5
But I want to do this without referencing specific names unless it's in a very general way, e.g. names(d2).
You could do it with a bit of text manipulation and a join:
d2[d1, on=sprintf("%1$s>=%1$s", names(d2)), nomatch=0]
# v1 v2 d
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: 4 4 1
#5: 5 5 1
It works because the sprintf expands to:
sprintf("%1$s>=%1$s", names(d2))
#[1] "v1>=v1" "v2>=v2"

I found a strange thing (bug?) about 'combn' function and 'data.table' package [all possible combinations by group]

I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.

Keep only 'by' variables when collapsing data.table

I have a very large data.table:
DT <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,3),b=c(1,1,2,2),c=1:12)
And I need to collapse it by several variables, e.g. list(a,b). Easy:
DT[,sum(c),by=list(a,b)]
a b V1
1: 1 1 3
2: 1 2 7
3: 2 1 11
4: 2 2 15
5: 3 1 19
6: 3 2 23
However, I don't want to take any operation on c, I just want to drop it:
DT[,,by=list(a,b)] # includes a,b,c, thus does not collapse
DT[,list(),by=list(a,b)] # zero rows
DT[,a,by=list(a,b)] # what I want but adds extraneous column a after 'by' columns
How can I specify X below to get the indicated result?
DT[,X,by=list(a,b)]
a b
1: 1 1
2: 1 2
3: 2 1
4: 2 2
5: 3 1
6: 3 2
unique.data.table has a by argument, you could then subset result to get the columns you want.
eg
unique(DT, by = c('a', 'b'))[, c('a','b')]

R data.table subsetting on multiple conditions.

With the below data set, how do I write a data.table call that subsets this table and returns all customer ID's and associated orders for that customer IF that customer has ever purchased SKU 1?
Expected result should return a table that excludes cid 3 and 5 on that condition and every row for customers matching sku==1.
I am getting stuck as I don't know how to write a "contains" statement, == literal returns only sku's matching condition... I am sure there is a better way..
library("data.table")
df<-data.frame(cid=c(1,1,1,1,1,2,2,2,2,2,3,4,5,5,6,6),
order=c(1,1,1,2,3,4,4,4,5,5,6,7,8,8,9,9),
sku=c(1,2,3,2,3,1,2,3,1,3,2,1,2,3,1,2))
dt=as.data.table(df)
This is similar to a previous answer, but here the subsetting works in a more data.table like manner.
First, lets take the cids that meet our condition:
matching_cids = dt[sku==1, cid]
the %in% operator allows us to filter to just those items that are contained in the list. so, using the above:
dt[cid %in% matching_cids]
or on one line:
> dt[cid %in% dt[sku==1, cid]]
cid order sku
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 2 2
5: 1 3 3
6: 2 4 1
7: 2 4 2
8: 2 4 3
9: 2 5 1
10: 2 5 3
11: 4 7 1
12: 6 9 1
13: 6 9 2
I would have thought that it was more (?!) data.table to use keys. I couldn't quite work out how to stick the whole lot on a single line, but I think that this would be a bit quicker on large data, because as I understand it (and I may very well be mistaken) this is the only solution presented thus far that avoids vector scanning (which is slow compared to binary search):
# Set initial key
setkey(dt,sku)
# Select only rows with 1 in the sku and return first example of each, setting key to customer id
dts <- dt[ J(1) , .SD[1] , keyby = cid ]
# change key of dt to cid to match customer id
setkey(dt,cid)
# join based on common key
dt[dts,.SD]
# cid order sku
# 1: 1 1 1
# 2: 1 1 2
# 3: 1 2 2
# 4: 1 1 3
# 5: 1 3 3
# 6: 2 4 1
# 7: 2 5 1
# 8: 2 4 2
# 9: 2 4 3
#10: 2 5 3
#11: 4 7 1
#12: 6 9 1
#13: 6 9 2
An alternative that you can do on one line is to use a data.table merge like so...
setkey(dt,sku)
merge( dt[ J(1) , .SD[1] , keyby = cid ] , dt , by = "cid" )

Resources