ifelse assignment in data.table - r

I am a teacher, and would like to correctly use the data.table package in R to automatically grade student answers in a log file, i.e. add a column called correct if the student answer to a particular question, is the correct answer to that question, and 0 otherwise. I can do this easily if each question has only 1 answer, but I am getting tripped up if a question has multiple possible answers (questions and their possible correct answers are stored in another table)
Below is a MWE:
set.seed(123)
question_table <- data.table(id=c(1,1,2,2,3,4),correct_ans=sample(1:4,6,replace = T))
log <- data.table(student=sample(letters[1:3],10,replace = T),
question_id=c(1,1,1,2,2,2,3,3,4,4),
student_answer= c(2,4,1,3,2,4,4,5,2,1))
My question lies in what is the correct data.table way to use ifelse in j, especially if we depend on another table?
log[,correct:=ifelse(student_answer %in%
question_table[log$question_id %in% id]$correct_ans,1,0)]
As can be seen below, question 1 and 2 both have multiple possible correct answers.
> question_table
id correct_ans
1: 1 2
2: 1 4
3: 2 2
4: 2 4
5: 3 4
6: 4 1
While the correct column is calculated without errors, something isn't right: e.g. when student b answers question, he is awarded a correct score, even though he answered incorrectly. Only some entries of the correct column are off, which leads me to believe there is something i am not getting with how variables have are scoped.
> log
student question_id student_answer correct
1: b 1 2 1
2: c 1 4 1
3: b 1 1 1 <- ?
4: b 2 3 0
5: c 2 2 1
6: b 2 4 1
7: c 3 4 1
8: b 3 5 0
9: a 4 2 1 <- ?
10: c 4 1 1
I considered making a helper column with the correct ans in the log table by joining with question_table, but that does not work since the key is not unique in the latter.
Any and all help would be appreciated.
Thanks in advance.

You can use a join:
# initialize to zero
log[, correct := 0L ]
# update to 1 if matched
log[question_table, on=c(question_id = "id", student_answer = "correct_ans"),
correct := 1L ]
student question_id student_answer correct
1: b 1 2 1
2: c 1 4 1
3: b 1 1 0
4: b 2 3 0
5: c 2 2 1
6: b 2 4 1
7: c 3 4 1
8: b 3 5 0
9: a 4 2 0
10: c 4 1 1
How it works. The syntax for an update join is X[Y, on=cols, xvar := z]:
If col names differ between X and Y, use on=c(xcol = "ycol", xcol2 = "ycol2") or, in version 1.9.7+, .(xcol = ycol, xcol2 = ycol2).
xvar := z will only operate on the rows of X that are matched. Sometimes, it is also useful to use by=.EACHI here, depending on how many rows of X are matched by each in Y and how complicated the expression for z is.
See ?data.table for full documentation on the syntax.

Related

Cross join in Data.table doesnt seem to retain column names

data.table documentation says this, see ?CJ:
x = c(1,1,2)
y = c(4,6,4)
CJ(x, y) # output columns are automatically named 'x' and 'y'
However when I run the example, it doesnt seem to be retained
x = c(1,1,2)
y = c(4,6,4)
CJ(x, y)
V1 V2
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6
That names are retained is not mentioned in the main body of the help file ?CJ, that is in the Details or Value section. However, there appears to be mention that names are retained as a comment in the examples section of the help file (and it looks like this is where you got your example).
Digging around in the CJ function, which appears to be entirely implemented in R, there is a block near the end,
if (getOption("datatable.CJ.names", FALSE))
vnames = name_dots(...)$vnames
Running getOption("datatable.CJ.names", FALSE) returns FALSE with data.table version 1.12.0. When we set this to TRUE with
options("datatable.CJ.names"=TRUE)
then the code
x = c(1,1,2)
y = c(4,6,4)
CJ(x, y)
returns
x y
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6
However, you are also able to directly provide names (which is not mentioned in the help file).
CJ(uu=x, vv=y)
which returns
uu vv
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6
Note that this overrides the above option.

Subsetting data.table by values with repetition

I would like to subset data.table in a way which would provide some rows repeatedly. This works well with indices but I am not sure how to do it in a simple way with values especially when the values does not appear in one row only.
E.g.:
library(data.table)
dt<-data.table(x1=c('a','a','b','b','c','c'),x2=c(1,2,3,4,5,6))
xsel<-c('a','b','a','a','c','b')
dt[x1%in%xsel,]
will provide this output:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: c 5
6: c 6
I would like to get it in the original order and with repetition exactly as it is in the xsel vector. Is it possible to do it in a reasonably simple way without looping? Thanks.
Using:
setkey(dt, x1) # set the key
dt[J(xsel)] # or: dt[.(xsel)]
gives:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: a 1
6: a 2
7: a 1
8: a 2
9: c 5
10: c 6
11: b 3
12: b 4
Without setting the key, you could use:
dt[.(xsel), on = .(x1)]

How can I skip groups while subsetting with key by in data.table?

I have this DT:
dt=data.table(ID=c(rep(letters[1:2],each=4),'b'),value=seq(1,9))
ID value
1: a 1
2: a 2
3: a 3
4: a 4
5: b 5
6: b 6
7: b 7
8: b 8
9: b 9
I need to eliminate groups while subsetting but only when the data fulfils some condition. Something like this does not work:
dt[,{if (.N==4) .SD else NULL
v1},by="ID"]
So that I need to remove the groups that do not meet the condition. In this example I would like to skip the groups which length is different than 4. So that I get:
ID value
1: a 1
2: a 2
3: a 3
4: a 4
But I haven't been able to work this around, I would appreciate any help.
#jangorecki came up with the answer:
does dt[, if (.N==4) .SD, by="ID"] answer your question?

Extract the best attributes from a data.table

I have a data.table:
> (a <- data.table(id=c(1,1,1,2,2,3),
attribute=c("a","b","c","a","b","c"),
importance=1:6,
key=c("id","importance")))
id attribute importance
1: 1 a 1
2: 1 b 2
3: 1 c 3
4: 2 a 4
5: 2 b 5
6: 3 c 6
I want:
--1-- sort it by the second key in the decreasing order (i.e., the most important attributes should come first)
--2-- select the top 2 (or 10) attributes for each id, i.e.:
id attribute importance
3: 1 c 3
2: 1 b 2
5: 2 b 5
4: 2 a 4
6: 3 c 6
--3-- pivot the above:
id attribute.1 importance.1 attribute.2 importance.2
1 c 3 b 2
2 b 5 a 4
3 c 6 NA NA
It appears that the last operation can be done with something like:
a[,{
tmp <- .SD[.N:1];
list(a1 = tmp$attribute[1],
i1 = tmp$importance[1])
}, by=id]
Is this The Right Way?
How do I do the first two tasks?
I'd do the first two tasks like this:
a[a[, .I[.N:(.N-1)], by=list(id)]$V1]
The inner a[, .I[.N:(.N-1)], ,by=list(id)] gives you the indices in the order you require for every unique group in id. Then you subset a with the V1 column (which has the indices in the order you require).
You'll have to take care of negative indices here, maybe something like:
a[a[, .I[seq.int(.N, max(.N-1L, 1L))], by=list(id)]$V1]

R data.table subsetting on multiple conditions.

With the below data set, how do I write a data.table call that subsets this table and returns all customer ID's and associated orders for that customer IF that customer has ever purchased SKU 1?
Expected result should return a table that excludes cid 3 and 5 on that condition and every row for customers matching sku==1.
I am getting stuck as I don't know how to write a "contains" statement, == literal returns only sku's matching condition... I am sure there is a better way..
library("data.table")
df<-data.frame(cid=c(1,1,1,1,1,2,2,2,2,2,3,4,5,5,6,6),
order=c(1,1,1,2,3,4,4,4,5,5,6,7,8,8,9,9),
sku=c(1,2,3,2,3,1,2,3,1,3,2,1,2,3,1,2))
dt=as.data.table(df)
This is similar to a previous answer, but here the subsetting works in a more data.table like manner.
First, lets take the cids that meet our condition:
matching_cids = dt[sku==1, cid]
the %in% operator allows us to filter to just those items that are contained in the list. so, using the above:
dt[cid %in% matching_cids]
or on one line:
> dt[cid %in% dt[sku==1, cid]]
cid order sku
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 2 2
5: 1 3 3
6: 2 4 1
7: 2 4 2
8: 2 4 3
9: 2 5 1
10: 2 5 3
11: 4 7 1
12: 6 9 1
13: 6 9 2
I would have thought that it was more (?!) data.table to use keys. I couldn't quite work out how to stick the whole lot on a single line, but I think that this would be a bit quicker on large data, because as I understand it (and I may very well be mistaken) this is the only solution presented thus far that avoids vector scanning (which is slow compared to binary search):
# Set initial key
setkey(dt,sku)
# Select only rows with 1 in the sku and return first example of each, setting key to customer id
dts <- dt[ J(1) , .SD[1] , keyby = cid ]
# change key of dt to cid to match customer id
setkey(dt,cid)
# join based on common key
dt[dts,.SD]
# cid order sku
# 1: 1 1 1
# 2: 1 1 2
# 3: 1 2 2
# 4: 1 1 3
# 5: 1 3 3
# 6: 2 4 1
# 7: 2 5 1
# 8: 2 4 2
# 9: 2 4 3
#10: 2 5 3
#11: 4 7 1
#12: 6 9 1
#13: 6 9 2
An alternative that you can do on one line is to use a data.table merge like so...
setkey(dt,sku)
merge( dt[ J(1) , .SD[1] , keyby = cid ] , dt , by = "cid" )

Resources