R data.table subsetting on multiple conditions. - r

With the below data set, how do I write a data.table call that subsets this table and returns all customer ID's and associated orders for that customer IF that customer has ever purchased SKU 1?
Expected result should return a table that excludes cid 3 and 5 on that condition and every row for customers matching sku==1.
I am getting stuck as I don't know how to write a "contains" statement, == literal returns only sku's matching condition... I am sure there is a better way..
library("data.table")
df<-data.frame(cid=c(1,1,1,1,1,2,2,2,2,2,3,4,5,5,6,6),
order=c(1,1,1,2,3,4,4,4,5,5,6,7,8,8,9,9),
sku=c(1,2,3,2,3,1,2,3,1,3,2,1,2,3,1,2))
dt=as.data.table(df)

This is similar to a previous answer, but here the subsetting works in a more data.table like manner.
First, lets take the cids that meet our condition:
matching_cids = dt[sku==1, cid]
the %in% operator allows us to filter to just those items that are contained in the list. so, using the above:
dt[cid %in% matching_cids]
or on one line:
> dt[cid %in% dt[sku==1, cid]]
cid order sku
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 2 2
5: 1 3 3
6: 2 4 1
7: 2 4 2
8: 2 4 3
9: 2 5 1
10: 2 5 3
11: 4 7 1
12: 6 9 1
13: 6 9 2

I would have thought that it was more (?!) data.table to use keys. I couldn't quite work out how to stick the whole lot on a single line, but I think that this would be a bit quicker on large data, because as I understand it (and I may very well be mistaken) this is the only solution presented thus far that avoids vector scanning (which is slow compared to binary search):
# Set initial key
setkey(dt,sku)
# Select only rows with 1 in the sku and return first example of each, setting key to customer id
dts <- dt[ J(1) , .SD[1] , keyby = cid ]
# change key of dt to cid to match customer id
setkey(dt,cid)
# join based on common key
dt[dts,.SD]
# cid order sku
# 1: 1 1 1
# 2: 1 1 2
# 3: 1 2 2
# 4: 1 1 3
# 5: 1 3 3
# 6: 2 4 1
# 7: 2 5 1
# 8: 2 4 2
# 9: 2 4 3
#10: 2 5 3
#11: 4 7 1
#12: 6 9 1
#13: 6 9 2
An alternative that you can do on one line is to use a data.table merge like so...
setkey(dt,sku)
merge( dt[ J(1) , .SD[1] , keyby = cid ] , dt , by = "cid" )

Related

Using another data table to condition on columns in a primary data table r

Suppose I have two data tables, and I want to use the second one, which contains a row with some column values, to condition the first one.
Specifically, I want to use d2 to select rows where its variables are less than or equal to the values.
d1 = data.table('d'=1,'v1'=1:10, 'v2'=1:10)
d2 = data.table('v1'=5, 'v2'=5)
So I would want the output to be
d v1 v2
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 1 4 4
5: 1 5 5
But I want to do this without referencing specific names unless it's in a very general way, e.g. names(d2).
You could do it with a bit of text manipulation and a join:
d2[d1, on=sprintf("%1$s>=%1$s", names(d2)), nomatch=0]
# v1 v2 d
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: 4 4 1
#5: 5 5 1
It works because the sprintf expands to:
sprintf("%1$s>=%1$s", names(d2))
#[1] "v1>=v1" "v2>=v2"

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

ifelse assignment in data.table

I am a teacher, and would like to correctly use the data.table package in R to automatically grade student answers in a log file, i.e. add a column called correct if the student answer to a particular question, is the correct answer to that question, and 0 otherwise. I can do this easily if each question has only 1 answer, but I am getting tripped up if a question has multiple possible answers (questions and their possible correct answers are stored in another table)
Below is a MWE:
set.seed(123)
question_table <- data.table(id=c(1,1,2,2,3,4),correct_ans=sample(1:4,6,replace = T))
log <- data.table(student=sample(letters[1:3],10,replace = T),
question_id=c(1,1,1,2,2,2,3,3,4,4),
student_answer= c(2,4,1,3,2,4,4,5,2,1))
My question lies in what is the correct data.table way to use ifelse in j, especially if we depend on another table?
log[,correct:=ifelse(student_answer %in%
question_table[log$question_id %in% id]$correct_ans,1,0)]
As can be seen below, question 1 and 2 both have multiple possible correct answers.
> question_table
id correct_ans
1: 1 2
2: 1 4
3: 2 2
4: 2 4
5: 3 4
6: 4 1
While the correct column is calculated without errors, something isn't right: e.g. when student b answers question, he is awarded a correct score, even though he answered incorrectly. Only some entries of the correct column are off, which leads me to believe there is something i am not getting with how variables have are scoped.
> log
student question_id student_answer correct
1: b 1 2 1
2: c 1 4 1
3: b 1 1 1 <- ?
4: b 2 3 0
5: c 2 2 1
6: b 2 4 1
7: c 3 4 1
8: b 3 5 0
9: a 4 2 1 <- ?
10: c 4 1 1
I considered making a helper column with the correct ans in the log table by joining with question_table, but that does not work since the key is not unique in the latter.
Any and all help would be appreciated.
Thanks in advance.
You can use a join:
# initialize to zero
log[, correct := 0L ]
# update to 1 if matched
log[question_table, on=c(question_id = "id", student_answer = "correct_ans"),
correct := 1L ]
student question_id student_answer correct
1: b 1 2 1
2: c 1 4 1
3: b 1 1 0
4: b 2 3 0
5: c 2 2 1
6: b 2 4 1
7: c 3 4 1
8: b 3 5 0
9: a 4 2 0
10: c 4 1 1
How it works. The syntax for an update join is X[Y, on=cols, xvar := z]:
If col names differ between X and Y, use on=c(xcol = "ycol", xcol2 = "ycol2") or, in version 1.9.7+, .(xcol = ycol, xcol2 = ycol2).
xvar := z will only operate on the rows of X that are matched. Sometimes, it is also useful to use by=.EACHI here, depending on how many rows of X are matched by each in Y and how complicated the expression for z is.
See ?data.table for full documentation on the syntax.

Keep only 'by' variables when collapsing data.table

I have a very large data.table:
DT <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,3),b=c(1,1,2,2),c=1:12)
And I need to collapse it by several variables, e.g. list(a,b). Easy:
DT[,sum(c),by=list(a,b)]
a b V1
1: 1 1 3
2: 1 2 7
3: 2 1 11
4: 2 2 15
5: 3 1 19
6: 3 2 23
However, I don't want to take any operation on c, I just want to drop it:
DT[,,by=list(a,b)] # includes a,b,c, thus does not collapse
DT[,list(),by=list(a,b)] # zero rows
DT[,a,by=list(a,b)] # what I want but adds extraneous column a after 'by' columns
How can I specify X below to get the indicated result?
DT[,X,by=list(a,b)]
a b
1: 1 1
2: 1 2
3: 2 1
4: 2 2
5: 3 1
6: 3 2
unique.data.table has a by argument, you could then subset result to get the columns you want.
eg
unique(DT, by = c('a', 'b'))[, c('a','b')]

Extract the best attributes from a data.table

I have a data.table:
> (a <- data.table(id=c(1,1,1,2,2,3),
attribute=c("a","b","c","a","b","c"),
importance=1:6,
key=c("id","importance")))
id attribute importance
1: 1 a 1
2: 1 b 2
3: 1 c 3
4: 2 a 4
5: 2 b 5
6: 3 c 6
I want:
--1-- sort it by the second key in the decreasing order (i.e., the most important attributes should come first)
--2-- select the top 2 (or 10) attributes for each id, i.e.:
id attribute importance
3: 1 c 3
2: 1 b 2
5: 2 b 5
4: 2 a 4
6: 3 c 6
--3-- pivot the above:
id attribute.1 importance.1 attribute.2 importance.2
1 c 3 b 2
2 b 5 a 4
3 c 6 NA NA
It appears that the last operation can be done with something like:
a[,{
tmp <- .SD[.N:1];
list(a1 = tmp$attribute[1],
i1 = tmp$importance[1])
}, by=id]
Is this The Right Way?
How do I do the first two tasks?
I'd do the first two tasks like this:
a[a[, .I[.N:(.N-1)], by=list(id)]$V1]
The inner a[, .I[.N:(.N-1)], ,by=list(id)] gives you the indices in the order you require for every unique group in id. Then you subset a with the V1 column (which has the indices in the order you require).
You'll have to take care of negative indices here, maybe something like:
a[a[, .I[seq.int(.N, max(.N-1L, 1L))], by=list(id)]$V1]

Resources