I want to use data.table to create a function that only keeps rows where the ID column(s) - stored as a vector of strings - are duplicated. Note that where there are multiple ID columns, I only want to keep rows where the combination of ID columns is duplicated.
library(data.table)
dt <- data.table(x = c(1:5,5), y = rep(c(1,3,5), each = 2), z = rep(1:3, 2))
get_duplicate_id_rows1 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str]
}
get_duplicate_id_rows1(dt, c("x", "y"))
#> x y z
#> 1: 5 5 2
#> 2: 5 5 3
get_duplicate_id_rows1(dt[, .(x,y)], c("x", "y"))
#> Empty data.table (0 rows and 2 cols): x,y
As above, my first attempt works when the data table has one non-ID column. However, when all of the columns are ID columns, then the data table has no rows. I think this is because, as per ?data.table, .SD includes all variables of the original data table, except the grouping rows. Consequently, .SD has zero columns, which seems to be causing my issue.
get_duplicate_id_rows2 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str, .SDcols = names(dt_in)]
}
get_duplicate_id_rows2(dt, c("x", "y"))
#> x y x y z
#> 1: 5 5 5 5 2
#> 2: 5 5 5 5 3
get_duplicate_id_rows2(dt[, .(x,y)], c("x", "y"))
#> x y x y
#> 1: 5 5 5 5
#> 2: 5 5 5 5
My second attempt tries to circumvent my issues with my first attempt by using .SDcols. This does resolve the issue where all the columns in my data table are ID columns. However, here the column names in id_str are duplicated.
I think this is because one set of column names comes from the by argument and the other set of column names comes from .SDcols, although I'm not certain about this, because in my first attempt, the resultant data table had zero rows, not zero columns.
Consequently, I'd love to understand what's going on here, and what the most efficient solution to my problem is - particularly for large data sets, which is why I'm moving from tidyverse to data.table.
Created on 2020-04-09 by the reprex package (v0.3.0)
We can use .I to get the index of groups with frequency count greater than 1, extract the column and subset the data.table
dt[dt[, .I[.N >1], .(x, y)]$V1]
NOTE: It should be faster than .SD
Here is another option:
dt[dt[rowid(x, y) > 1], on=.(x, y), .SD]
In the example, your explanation for returning 0 row is correct. As grouping columns are used for grouping, it will be identical for each group and can be accessed via .BY and hence .SD need not have these columns to prevent duplication.
By default, when by is used, these are also returned as the leftmost columns in the output, hence in get_duplicate_id_rows2, you see x, y and then columns from .SD as specified in .SDcols.
Lastly, regarding efficiency, you can time the various options posted here using microbenchmark with your actual dataset and share your results.
Related
My data looks like
Name Pd1 Pd2 Pd3 Pd4
A 2 6 8 9
B 6 3 7 1
I want to collect the name of columns that has values from highest to lowest.
I wish to see my data like
Name pdts
A c(pd4,pd3,pd2,pd1)
B c(Pd3,Pd1,Pd2,Pd4)
Kindly help me to do this in R.
You can use data.table with apply function with sort.list to do this:
library(data.table)
setDT(df)
df <- df[, list(list((colnames(.SD)[c(t(apply(.SD, 1, function(x) sort.list(x, decreasing = T))))]))) ,Name]
print(df)
Name V1
1: A Pd4,Pd3,Pd2,Pd1
2: B Pd3,Pd1,Pd2,Pd4
Explanation:
1.apply(.SD, 1, function(x) sort.list(x, decreasing = T) - Gives the indexes of columns row wise.
2.t - We transpose the result to get a row wise vector.
3.[c(t(apply(.SD, 1, function(x) sort.list(x, decreasing = T))))] - this complete function return the sorted index of columns wise, that's the thing we need to solve this problem.
4.colnames(.SD) - .SD is a special symbol used in data.table. It basically refers to the grouped data and here we get the column names.
5.Finally, we sort the column names based on the indexes we got in step 3.
6.And, we group by Name column to get the solution for each Name.
7.You might find this overwhelming, so to understand, do it step by step and see how the solution evolves.
I have a data.table and need to know the index of the row containing a minimal value under a given condition. Simple example:
dt <- data.table(i=11:13, val=21:23)
# i val
# 1: 11 21
# 2: 12 22
# 3: 13 23
Now, suppose I'd like to know in which row val is minimal under the condition i>=12, which is 2 in this case.
What didn't work:
dt[i>=12, which.min(val)]
# [1] 1
returns 1, because within dt[i>=12] it is the first row.
Also
dt[i>=12, .I[which.min(val)]]
# [1] 1
returned 1, because .I is only supposed to be used with grouping.
What did work:
To apply .I correctly, I added a grouping column:
dt[i>=12, g:=TRUE]
dt[i>=12, .I[which.min(val)], by=g][, V1]
# [1] 2
Note, that g is NA for i<12, thus which.min excludes that group from the result.
But, this requires extra computational power to add the column and perform the grouping. My productive data.table has several millions of rows and I have to find the minimum very often, so I'd like to avoid any extra computations.
Do you have any idea, how to efficiently solve this?
But, this requires extra computational power to add the column and perform the grouping.
So, keep the data sorted by it if it's so important:
setorder(dt, val)
dt[.(i_min = 12), on=.(i >= i_min), mult="first", which = TRUE]
# 2
This can also be extended to check more threshold i values. Just give a vector in i_min =:
dt[.(i_min = 9:14), on=.(i >= i_min), mult="first", which = TRUE]
# [1] 1 1 1 2 3 NA
How it works
x[i, on=, ...] is the syntax for a join.
i can be another table or equivalently a list of equal-length vectors.
.() is a shorthand for list().
on= can have inequalities for a "non-equi join".
mult= can determine what happens when a row of i has more than one match in x.
which=TRUE will return row numbers of x instead of the full joined table.
You can use the fact that which.min will ignore NA values to "mask" the values you don't want to consider:
dt[,which.min(ifelse(i>=12, val, NA))]
As a simple example of this behavior, which.min(c(NA, 2, 1)) returns 3, because the 3rd element is the min among all the non-NA values.
I'm really sorry to ask this dumb question but I don't get what is going wrong.
I have a dataset which I convert into a data.table object :
#generate 100,000 ids associated to a group in a data-set called base
id=c(1:100000)
group=sample(c(1:5),100000,TRUE)
base=cbind(id,group)
base=as.data.table(base)
I make a basic group by computation to get the number of rows by group, and the result table still contains the same number of rows
counting=base[,COUNT:= .N, by = group]
nrow(counting)
#100000
What did I miss? Is there an option in data.table in order to address my problem?
Taking akrun's comment, I decided to provide an answer. It seems that you were not sure how to summarise your data and got confused. First, one point about constructing a data set:
set.seed(123)
id = c(1:100000)
group = sample(c(1:5),100000,TRUE)
base = data.frame(id,group)
setDT(base)
base
id group
1: 1 2
2: 2 4
3: 3 3
4: 4 5
5: 5 5
....
When you use cbind() on multiple vectors, they are coerced to the same class to make a matrix. The safer way to go is to use data.frame(), which allows mixed column classes. And, if you have a data.frame, you can turn it into a data.table by reference with setDT, without needing to assign the result.
Adding a new column. Your code was basically adding a new column in the data.table object. When you use :=, you are doing the equivalent of mutate() in dplyr or transform() in base R, with one important difference. With :=, the column is added to the data.table by reference, so there is no need to assign the result.
base[, COUNT := .N, by = group]
base
id group COUNT
1: 1 2 20099
2: 2 4 19934
3: 3 3 20001
4: 4 5 19933
5: 5 5 19933
...
Here, you are counting how many data points exist for each group, and you are assigning the values to all rows. For instance, the total count of group 2 is 20099. You give this number to all rows with group == 2. You are creating a new column, not summarizing the data. Hence, you still have 100000 rows. The number of rows in base is the same as ever. There is currently no function to modify the number of rows by reference.
Summarising the data. If you want to count how many data points exist for each group and summarize the data, you want the following.
dt2 <- base[, .(COUNT = .N), by = group]
dt2
group COUNT
1: 2 20099
2: 4 19934
3: 3 20001
4: 5 19933
5: 1 20033
dim(dt2)
[1] 5 2
Here, you want to make sure that you use =, not := since you are summarising the data. It is necessary to assign the result because we are creating a new data.table. I hope this clears up your mind.
Have you noticed?
base$regroup = group
base[, .(Count = .N, regroup), by = group]
gives 100,000 rows even though group and regroup are identical?
I cannot seem to find any documentation on what exactly .EACHI does in data.table. I see a brief mention of it in the documentation:
Aggregation for a subset of known groups is particularly efficient
when passing those groups in i and setting by=.EACHI. When i is a
data.table, DT[i,j,by=.EACHI] evaluates j for the groups of DT that
each row in i joins to. We call this grouping by each i.
But what does "groups" in the context of DT mean? Is a group determined by the key that is set on DT? Is the group every distinct row that uses all the columns as the key? I fully understand how to run something like DT[i,j,by=my_grouping_variable] but am confused as to how .EACHI would work. Could someone explain please?
I've added this to the list here. And hopefully we'll be able to deliver as planned.
The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:
X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")
We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .
X[Y]
# x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a
Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:
# < 1.9.4
X[Y, .N]
# x N
# 1: 2 2
# 2: 6 1
What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by.
The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.
Additionally data.table users requested this to be explicit - see this and this for more context.
Hence by=.EACHI was added. Now, when we do:
X[Y, .N]
# [1] 3
it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.
And,
X[Y, .N, by=.EACHI]
evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.
X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7
If we run .N for each, then we should get 2,1.
X[Y, .N, by=.EACHI]
# x N
# 1: 2 2
# 2: 6 1
So we now have both functionalities.
The data.table package in R provides the option:
which: ‘TRUE’ returns the integer row numbers of ‘x’ that ‘i’
matches to.
However, I see no way of obtaining, within j, the integer row numbers of 'x' within the groups established using by.
For example, given...
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
...I would like to know the indices into DT for each value of y.
The value to me is that I am using a data.table in parallel with Another Data Structure (ADS) to which I intend to perform groupwise computations based on the efficiently computed groupings of the data.table.
For example, assuming ADS is a vector with a value for each row in DT:
ADS<-sample(100,nrow(DT))
I can, as a workaround, compute the groupwise mean of ADS determined by DT$y the group if I first add a new sequence column to the data.table.
DT[,seqNum:=seq_len(nrow(DT))]
DT[,mean(ADS[seqNum]),by=y]
Which gives the result I want at the cost of adding a new column.
I realize that in this example I can get the same answer using tapply:
tapply(ADS,DT$y,mean)
However, I will not then get the performance benefit of data.tables efficient grouping (especially when the 'by' columns are indexed).
Perhaps there is some syntax I am overlooking???
Perhaps this is an easy feature to add to data.table and I should request it (wink, wink)???
Proposed syntax: optionally set '.which' to the group indices, allowing to write:
DT[,mean(ADS[.which]),by=y,which=TRUE]
Available since data.table 1.8.3 you can use .I in the j of a data.table to get the row indices by groups...
DT[ , list( yidx = list(.I) ) , by = y ]
# y yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9
A keyed data.table will be sorted so that groups are stored in contiguous blocks. In that case, you could use .N to extract the group-wise indexing information:
DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
setkey(DT, y)
ii <- DT[,.N, by=y]
ii[, start := cumsum(N) - N[1] + 1][,end := cumsum(N)][, N := NULL]
# y start end
# 1: 1 1 3
# 2: 3 4 6
# 3: 6 7 9
(Personally, I'd probably just add an indexing column like your suggested seqNum. Seems simpler, I don't think it will affect performance too much unless you are really pushing the limits.)