Complex filtering with data.table R - r

I am trying to select information by different group in a data.frame (or data.table), but didn't find the proper way of doing it. Consider the following example:
DF <- data.table(value=c(seq(5,1,-1),c(5,5,3,2,1)),group=rep(c("A","B"),each=5),status=rep(c("D","A","A","A","A"),2))
value group status
1: 5 A D
2: 4 A A
3: 3 A A
4: 2 A A
5: 1 A A
6: 5 B D
7: 5 B A
8: 3 B A
9: 2 B A
10: 1 B A
I'd like now to get the max value by group when the status is alive ("A"). I have tried this:
DF[,.I[value==max(value[status!="D"])],by=group]
group V1
1: A 2
2: B 6
3: B 7
But the 6th row is status "D" (dead) and I'd like to avoid that row. I can't subset the data like this:
DF[status!="D",.I[value==max(value[status!="D"])],by=group]
as I need to compute different stats by groups, such as (doesn't work):
DF[,list("max"=max(value[status!="D"],na.rm=T),"group"=group[.I[value==max(value[status=="D"],na.rm=T)]]),by=group]]
Any hint would be greatly appreciated!

If we need an index based on 'status' that are not 'D' and 'value' is max of 'value' grouped by 'group'
i1 <- DF[status != "D", .I[value == max(value)], by = group]$V1
Use the index for further summarizing
DF[i1, .SD[value == max(value)], group]

Related

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

Get the last row of a previous group in data.table

This is what my data table looks like:
library(data.table)
dt <- fread('
Product Group LastProductOfPriorGroup
A 1 NA
B 1 NA
C 2 B
D 2 B
E 2 B
F 3 E
G 3 E
')
The LastProductOfPriorGroup column is my desired column. I am trying to fetch the product from last row of the prior group. So in the first two rows, there are no prior groups and therefore it is NA. In the third row, the product in the last row of the prior group 1 is B. I am trying to accomplish this by
dt[,LastGroupProduct:= shift(Product,1), by=shift(Group,1)]
to no avail.
You could do
dt[, newcol := shift(dt[, last(Product), by = Group]$V1)[.GRP], by = Group]
This results in the following updated dt, where newcol matches your desired column with the unnecessarily long name. ;)
Product Group LastProductOfPriorGroup newcol
1: A 1 NA NA
2: B 1 NA NA
3: C 2 B B
4: D 2 B B
5: E 2 B B
6: F 3 E E
7: G 3 E E
Let's break the code down from the inside out. I will use ... to denote the accumulated code:
dt[, last(Product), by = Group]$V1 is getting the last values from each group as a character vector.
shift(...) shifts the character vector in the previous call
dt[, newcol := ...[.GRP], by = Group] groups by Group and uses the internal .GRP values for indexing
Update: Frank brings up a good point about my code above calculating the shift for every group over and over again. To avoid that, we can use either
shifted <- shift(dt[, last(Product), Group]$V1)
dt[, newcol := shifted[.GRP], by = Group]
so that we don't calculate the shift for every group. Or, we can take Frank's nice suggestion in the comments and do the following.
dt[dt[, last(Product), by = Group][, v := shift(V1)], on="Group", newcol := i.v]
Another way is to save the last group's value in a variable.
this = NA_character_ # initialize
dt[, LastProductOfPriorGroup:={ last<-this; this<-last(Product); last }, by=Group]
dt
Product Group LastProductOfPriorGroup
1: A 1 NA
2: B 1 NA
3: C 2 B
4: D 2 B
5: E 2 B
6: F 3 E
7: G 3 E
NB: last() is a data.table function which returns the last item of a vector (of the Product column in this case).
This should also be fast since no logic is being invoked to fetch the last group's value; it just relies on the groups running in order (which they do).

What does ".N" mean in data.table?

I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols

R: By group, check if for each unique value of one var, there is at least one observation where the value of the var equals the value of another var

I think I am on the right direction with this code, but I am not quite there yet.
I tried finding something useful on Google and SE, but I did not seem to be able to formulate the question in a way that gets me the answer I am looking for.
I could write a for-loop for this, comparing for each id and for each unique value of a per row, but I strive to achieve a higher level of R-understanding and thus want to avoid loops.
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
a <- c(1,1,1,2,2,2,3,3,4,4,4,5,5,5,6)
b <- c(1,2,3,3,3,4,3,4,5,4,4,5,6,7,8)
require(data.table)
dt <- data.table(id, a, b)
dt
dt[,unique(a) %in% b, by=id]
tmp <- dt[,unique(a) %in% b, by=id]
tmp$id[tmp$V1 == FALSE]
In my example, IDs 2, 3 and 5 should be the result, the decision rule being: "By id, check if for each unique value of a if there is at least one observation where the value of b equals value of a."
However, my code only outputs IDs 2 and 5, but not 3. This is because for ID 3, the 4 is matched with the 4 of the previous observation.
The result should either output the IDs for which the condition is not met, or add a dummy variable to the original table that indicated whether the condition is met for the ID.
How about
dt[, all(sapply(unique(a), function(i) any(a == i & b == i))), by = id]
# id V1
#1: 1 TRUE
#2: 2 FALSE
#3: 3 FALSE
#4: 4 TRUE
#5: 5 FALSE
If you want to add a dummy variable to the original table, you can modify it like
dt[, check:=all(sapply(unique(a), function(i) any(a == i & b == i))), by = id]
I wondered if I can find are more data.table-esk solution for this old question using the enhanced join capabilities which were introduced to data.table in version 1.9.6 (on CRAN 19 Sep 2015). With that version, data.table has gained the ability to join without having to set keys by using the on argument.
Variant 1
dt[a == b][dt[, unique(a), by = id], on = .(id, a == V1)][is.na(b), unique(id)]
[1] 2 3 5
First, the rows of dt where a and b are equal are selected. Only these rows are right joined with the unique values of a for each id. The result of the join is
dt[a == b][dt[, unique(a), by = id], on = .(id, a == V1)]
id a b
1: 1 1 1
2: 2 2 NA
3: 3 3 3
4: 3 4 NA
5: 4 4 4
6: 4 4 4
7: 4 5 5
8: 5 5 NA
9: 5 6 NA
The NA values in column b indicate that no match is found. Any id which has an NA value indicates that OP's condition is not met.
Variant 2
dt[dt[, unique(a), by = id], on = .(id, a == V1, b == V1), unique(id[is.na(x.a)])]
[1] 2 3 5
This variant right joins dt (unfiltered!) with the unique values of a for each id but the join conditions require matches in id as well as matches in a and b. (This resembles the a == i & b == i expression in konvas' accepted answer. Finally, those ids are returned which have at least one NA value in the join result indicating a missing match.

Pairing rows in data.table

I have a data table similar to the one obtained with the following command:
dt <- data.table(
time = 1:8,
part = rep(c(1, 1, 2, 2), 2),
type = rep(c('A', 'B'), 4),
data = rep(c(runif(1), 0), 4))
Basically, such a table contains two different type of instances (A or B). The time column contains a timestamp for when a request arrived to or leaved from a certain part. If the instance type is A, the timestamp states the arrival time (enter), and if the type is B, the timestamp states the leaving time (exit).
time part type data
1: 1 1 A 0.5842668
2: 2 1 B 0.0000000
3: 3 2 A 0.5842668
4: 4 2 B 0.0000000
5: 5 1 A 0.5842668
6: 6 1 B 0.0000000
7: 7 2 A 0.5842668
8: 8 2 B 0.0000000
I would like to pair A and B instances, and obtain the following data table:
part data enter.time exit.time
1: 1 0.4658239 1 2
2: 1 0.4658239 5 6
3: 2 0.4658239 3 4
4: 2 0.4658239 7 8
I have tried the following:
pair.types <- function(x) {
a.type <- x[type == 'A']
b.type <- x[type == 'B']
return(data.table(
enter.time = a.type$time,
exit.time = b.type$time,
data = a.type$data))
}
dt[, c('enter.time', 'exit.time', 'data') := pair.types(.SD), by = list(part)]
But, that gives me the following, which is not exactly what I want:
time part type data enter.time exit.time
1: 1 1 A 0.3441592 1 2
2: 2 1 B 0.3441592 5 6
3: 3 2 A 0.3441592 3 4
4: 4 2 B 0.3441592 7 8
5: 5 1 A 0.3441592 1 2
6: 6 1 B 0.3441592 5 6
7: 7 2 A 0.3441592 3 4
8: 8 2 B 0.3441592 7 8
It is kind of close, but since column 'type' is kept, some rows are duplicated. Perhaps, I can try to remove columns 'time' and 'type', and then remove the second half of rows. But, I am not sure whether that will work in all the cases, and I would like to learn a better way to do this operation.
Assuming your data looks like your example data:
dt[, list(part = part[1],
data = data[1],
enter.time = time[1],
exit.time = time[2]),
by = as.integer((seq_len(nrow(dt)) + 1)/2)]
# by = rep(seq(1, nrow(dt), 2), each = 2)]
# ^^^ a slightly shorter and a little more readable alternative
The idea is very simple - group rows in groups of 2 (that's the by part), i.e. each group will be one A and one B, then for each group take first part and first data and then the enter and exit times are just the first and second time's respectively. This is likely how you'd do this if you followed the by-hand logic, making it easy to read (once you know just a tiny bit about how data.table works).
Another way:
setkey(dt, "type")
dt.out <- cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
# part data entry.time exit.time
# 1: 1 0.1294204 1 2
# 2: 2 0.1294204 3 4
# 3: 1 0.1294204 5 6
# 4: 2 0.1294204 7 8
If you want you can now do setkey(dt.out, "part") to get the same order.
The idea: Your problem seems a simple "reshaping" one to me. The way I've approached it is first to create a key column as type. Now, we can subset data.table for a specific value in the key column by: dt[J("A")]. This would return the entire data.table. Since you want the column time renamed, I explicitly mention which columns to subset using:
dt[J("A"), list(part, data, entry.time = time)]
Of course this'll return also the type column (= A) which we've to remove. So, I've added a [, type := NULL] to remove column type by reference.
Now we've the first part. All we need is the exit.time. This can be obtained similarly as:
dt[J("B"), list(time)] # I don't name the column here
But this gives a data.table when you need just the time column, which can be accessed by:
dt[J("B"), list(time)]$time
So, while using cbind I name this column as exit.time to get the final result as:
cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
Hope this helps.

Resources