I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]
Related
DC<-data.table(l=c(0,0,1,4,5),d=c(1,2,0,0,1),y=c(0,1,0,1,7))
Hello,
how can I get a count of a particular value in a column using data.table?
I tried the following:
DC[, lapply(.SD, function(x) length(which(DC==0)))]
But this returns the count of zeros in the entire dataset, not indexing by column. So, how do I index by column?
Thanks
The question is not very well formulated, but I think #Sathish answered it perfectly in the comments.
Let's write it here one more time: to me colSums(DC == 0) is one answer to the question.
All credit goes to #Sathish. Very helpful.
If I understand your question, you'd like to form a frequency count of the values that comprise a given data table column. If that is true, let's say that you want to do so on column d of the data table you supplied:
> DC <- data.table(l=c(0,0,1,4,5), d=c(1,2,0,0,1), y=c(0,1,0,1,7))
> DC[, .N, by = d]
d N
1: 1 2
2: 2 1
3: 0 2
Then, if you want the count of a particular value in d, you would do so by accessing the corresponding row of the above aggregation as follows:
> DC[, .N, by = d][d == 0, N]
[1] 2
I'm kind of new to data.tables and I have a table containing DNA genomic coordinates like this:
chrom pause strand coverage
1: 1 3025794 + 1
2: 1 3102057 + 2
3: 1 3102058 + 2
4: 1 3102078 + 1
5: 1 3108840 - 1
6: 1 3133041 + 1
I wrote a custom function that I want to apply to each row of my circa 2 million rows table, it uses GenomicFeatures' mapToTranscripts to retrieve two related values in form of a string and a new coordinate. I want to add them to my table in two new columns, like this:
chrom pause strand coverage transcriptID CDS
1: 1 3025794 + 1 ENSMUST00000116652 196
2: 1 3102057 + 2 ENSMUST00000116652 35
3: 1 3102058 + 2 ENSMUST00000156816 888
4: 1 3102078 + 1 ENSMUST00000156816 883
5: 1 3108840 - 1 ENSMUST00000156816 882
6: 1 3133041 + 1 ENSMUST00000156816 880
The function is the following:
get_feature <- function(dt){
coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand)
hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE)
tx_id <- tx_names[as.character(seqnames(hit))]
cds_coordinate <- sapply(ranges(hit), '[[', 1)
if(length(tx_id) == 0 || length(cds_coordinate) == 0) {
out <- list('NaN', 0)
} else {
out <- list(tx_id, cds_coordinate)
}
return(out)
}
Then, I do:
counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]
And I get this error, indicating that the function is returning two lists of shorter length than the original table, instead of one new element per row:
Warning messages:
1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... :
Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... :
Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).
I assumed that using the .I operator would apply the function on a row basis and return one value per row. I also made sure the function was not returning empty values using the if statement.
Then I tried this mock version of the function:
get_feature <- function(dt) {
return('I should be returned once for each row')
}
And called it like this:
new.table <- counts[, get_feature(.SD), by = .I]
It makes a 1 row data table, instead of one the original length. So I concluded that my function, or maybe the way I'm calling it, is collapsing the elements of the resulting vector somehow. What am I doing wrong?
Update (with solution): As #StatLearner pointed out, it is explained in this answer that, as explained in ?data.table, .I is only intended for use in j (as in DT[i,j,by=]). Therefore, by=.I is equivalent to by=NULL and the proper syntax is by=1:nrow(dt) in order to group by row number and apply the function row-wise.
Unfortunately, for my particular case, this is utterly inefficient and I calculated an execution time of 20 seconds for 100 rows. For my 36 million row dataset that takes 3 months to complete.
In my case, I had to give up and use the mapToTranscripts function on the entire table like this, which takes a couple of seconds and was obviously the intended use.
get_features <- function(dt){
coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
tx_hit <- as.character(seqnames(hits)) # get transcript number
tx_id <- tx_names[tx_hit] # get transcript name from translation table
return(data.table('transcriptID'= tx_id,
'CDS_coordinate' = start(hits))
}
density <- counts[, get_features(.SD)]
Then map back to the genome using mapFromTranscripts from GenomicFeatures package so I could use a data.tables join to retrieve information from the original table, which was the intended purpose of what I was trying to do.
The way I do it when I need to apply a function for each row in a data.table is grouping it by row number:
counts[, get_feature(.SD), by = 1:nrow(counts)]
As explained in this answer,.I is not intended for using in by since it should return the sequence of row indices that is produced by grouping. The reason why by = .I doesn't throw an error is that data.table creates object .I equals NULL in data.table namespace, hence by = .I is equivalent to by = NULL.
Note that using by=1:nrow(dt) groups by row number and allows your function to access only a single row from data.table:
require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
pause = sample((3 * 10^6):(3.2 * 10^6), size = 100),
strand = sample(c('-','+'), size = 100, replace = TRUE),
coverage = sample.int(3, size = 100, replace = TRUE))
get_feature <- function(dt){
coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
rowNum <- nrow(coordinate)
return(list(text = 'Number of rows in dt', rowNum = rowNum))
}
counts[, get_feature(.SD), by = 1:nrow(counts)]
will produce a data.table with the same number of rows as in counts, but coordinate will contain just a single row from counts
nrow text rowNum
1: 1 Number of rows in dt 1
2: 2 Number of rows in dt 1
3: 3 Number of rows in dt 1
4: 4 Number of rows in dt 1
5: 5 Number of rows in dt 1
while by = NULL will supply the entire data.table to the function:
counts[, get_feature(.SD), by = NULL]
text rowNum
1: Number of rows in dt 100
which is the intended way for by to work.
Given a data.table such as:
library(data.table)
n = 5000
set.seed(123)
pop = data.table(id=1:n, age=sample(18:80, n, replace=TRUE))
and a function which converts a numeric vector into an ordered factor, such as:
toAgeGroups <- function(x){
groups=c('Under 40','40-64','65+')
grp = findInterval(x, c(40,65)) +1
factor(groups[grp], levels=groups, ordered=TRUE)
}
I am seeing unexpected results when grouping on the output of this function as a key and indexing with .GRP.
pop[, .(age_segment_id = .GRP, pop_count=.N), keyby=.(age_segment = toAgeGroups(age))]
returns:
age_segment age_segment_id pop_count
1: Under 40 1 1743
2: 40-64 3 2015
3: 65+ 2 1242
I would have expected the age_segment_id values to be c(1,2,3), not c(1,3,2), but .GRP seems set on order of occurrence in underlying data (as in by= order) rather than sorted order (as in keyby=).
I was planning on using .GRP as an index for some additional labelling, but instead I need to do something like:
pop[, .(pop_count=.N), keyby=.(age_segment = toAgeGroups(age))][, age_segment_id := .I][]
to get what I want.
Is this expected behavior? If so, is there a better workaround?
(v. 1.9.6)
This issue should no longer occur in versions 1.9.8+ of data.table.
library(data.table) #1.9.8+
pop[, .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
# age_segment age_segment_id pop_count
# 1: Under 40 1 1743
# 2: 40-64 2 2015
# 3: 65+ 3 1242
For some more, see the discussion here. Basically, how by works internally returns sorted rows for each group, then re-sorts the table back to its original order.
The change recognized that this re-sort is unnecessary if keyby is specified, so now your approach works as you expected.
Before (through 1.9.6), keyby would just re-sort the answer at the end by running setkey, as documented in ?data.table:
[keyby is the s]ame as by, but with an additional setkey() run on the by columns of the result.
Thus, on less-than-brand-new versions of data.table, you'd have to fix your code as:
pop[(order(age), .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
I'm a noob to data.table in R and I'd like to skip the last value of z=3 in this example:
> DT = data.table(y=c(1,1,2,2,2,3,3,3,4,4),x=1:10,z=c(1,1,1,1,2,2,2,2,3,3))
> DT[,list(list(predict(smooth.spline(x,y),c(4,5,6))$y)),by=z]
Error in smooth.spline(x, y) : need at least four unique 'x' values
If I simply delete z=3 I get the answer I want:
> DT = data.table(y=c(1,1,2,2,2,3,3,3),x=1:8,z=c(1,1,1,1,2,2,2,2))
> DT[,list(list(predict(smooth.spline(x,y),c(4,5,6))$y)),by=z]
z V1
1: 1 2.09999998977689,2.49999997903384,2.89999996829078
2: 2 0.999895853971133,2.04533519691888,2.90932467439562
What a great package!
Omitting the rows were z is 3 is as simple as
DT[z!=3, <whatever expression you'd like>]
If your data.table is keyed by z then you can use
DT[!.(3), .....]
If you simply want to omit the results when .N <4, then you can use if (not ifelse). If .N <4, then nothing is returned
DT[,if(.N>=4){ list(list(predict(smooth.spline(x,y),c(4,5,6))$y))},by=z]
# z V1
# 1: 1 2.1000000266026,2.50000003412706,2.90000004165153
# 2: 2 0.999895884129996,2.04533520266699,2.90932466433092
How do you use data.table to select a subset of rows by rank? I have a large data set and I hope to do so efficiently.
> dt <- data.table(id=1:200, category=sample(LETTERS, 200, replace=T))
> dt[,count:=length(id), by=category]
> dt
id category count
1: 1 O 13
2: 2 O 13
---
199: 170 N 3
200: 171 H 3
What I want to do is to efficiently change the category to 'OTHER' for any category not in the k most common ones. Something along the lines of:
dt[rank > 5,category:="OTHER", by=category]
I'm new to data.table and I'm not quite sure how to get the rank in an efficient way. Here's a way that works, but seems clunky.
counts <- unique(dt$count)
decision <- max(counts[rank(-counts)>5])
dt[count<=decision, category:='OTHER']
I would appreciate any advice. To be honest, I don't even need to the 'count' column if it's not necessary.