Counting existing pairs of data in large data table

Counting existing pairs of data in large data table - r

This question is based on the response given by #Arun here. In the response, #Arun proposes a clever way to avoid creating sparse matrices by simply only looking at occurring pairs, hence avoiding the saving of many zeros and the doubling of pairs A-B and B-A.
The following is copy-pasted from his answer:
Step 1: Construct sample data of your dimensions approximately:
require(data.table) ## 1.9.4+
set.seed(1L) ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)
Step 2: Pre-processing:
DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT) ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id) ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]
Step 3: Solution:
ans = DT[DT, {
tmp = runlen < i.runlen;
list(motif[tmp], i.motif[any(tmp)])
},
by=.EACHI][, .N, by="V1,V2"]
Running this works fine provided you have enough memory on your computer. I also humbly admit I have no ideawhat exactly the code is doing to create the wanted results so I'm just looking at input and output, agnostic of the process.
When applying the exact same code to my data, what seems to happen is that pairs appear that are not in the original data.
I'm running the following code which is a slightly adapted version of what #Arun had provided. The adaptation is because I need to run the code for 17 different blocks. I.e. I'm looking for which pairs occur within a specific block.
cooc <- data.frame()
for(j in 1:17){
DT <- dt[block == j,c("pid", "cid"), with =F]
DT$pid <- as.factor(DT$pid)
setorder(DT)
setkey(DT,pid)
DT[, runlen := .I]
ans <- DT[DT, {
tmp = runlen < i.runlen;
list(cid[tmp],i.cid[any(tmp)])
},
by= .EACHI][, .N, by="V1,V2"]
ans$block <- j
cooc <- data.table(rbind(cooc,ans))
rm(ans)
}
For as far as I understand the code, it's all identical, just looped with for to do the same thing for 17 blocks. both pid and cid are just integers that identify a variable of interest.
For j = 1 , the following goes:
DT[cid == 39] # cid is my equivalent of motif above and pid is my equivalent of id above
pid cid runlen
20319 39 3614
This suggests there is only one pid for which cid equals 39
Now when I look into the resulting ans datatable I get the following:
ans[V1 == 39]
V1 V2 N block
1: 39 41 1 1
2: 39 42 1 1
3: 39 44 1 1
4: 39 47 1 1
5: 39 7027 1 1
6: 39 7043 1 1
7: 39 7174 1 1
8: 39 9434 1 1
9: 39 11493 1 1
10: 39 18815 1 1
11: 39 18875 1 1
12: 39 18896 1 1
13: 39 18909 1 1
14: 39 18924 1 1
15: 39 18928 1 1
16: 39 18929 1 1
17: 39 18931 1 1
18: 39 18932 1 1
19: 39 19265 1 1
20: 39 19410 1 1
Suddenly, there are 20 occurrences of V1 (if I understand the code correctly, this is the equivalent of what was cid). Yet in DT there is only 1 pid assigned to cid.
I have no idea how to reproduce this finding so I tried to show what seems to be inconsistent. I don't think the original code has this problem so I'm hoping someone can explain where the additional occurrences of cid == 39 come from, based on the info I have given here.

Two things:
First, I don't understand what's wrong with the result you get. Starting from
require(data.table)
set.seed(1L)
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)
let me recreate the data that helps answer your Q.
# keep only one of 'motif_456'
DT2 = rbind(DT[1L], DT[motif != "motif_456"])
DT2[1L]
# id motif
# 1: 49338 motif_456
DT2[ , .N, by=motif]
# motif N
# 1: motif_456 1
# 2: motif_639 1637
# 3: motif_984 1649
# 4: motif_1559 1531
# 5: motif_347 1603
# ---
# 1712: motif_46 1623
# 1713: motif_521 1613
# 1714: motif_803 1568
# 1715: motif_603 1573
# 1716: motif_461 1591
Let's check all motifs corresponding to id = 49338:
DT2[id == 49338, motif]
# [1] "motif_456" "motif_553" "motif_1048" "motif_1680" "motif_171" "motif_1706"
# [7] "motif_707" "motif_163" "motif_489" "motif_107" "motif_1419" "motif_595"
# [13] "motif_1223" "motif_1274" "motif_1164" "motif_427" "motif_505" "motif_1041"
# [19] "motif_1321" "motif_1231" "motif_1498" "motif_837" "motif_298" "motif_649"
# [25] "motif_631"
So obviously for all these motifs' combination with motif_456 the result should be 1. And that's what the data.table solution provides. Here's the relevant result after running data.table solution:
# data.table solution takes 11.2 secs
ans[V1 == "motif_456", .N] + ans[V2 == "motif_456", .N]
# [1] 24
Second, while the data.table answer does well, we can do this more efficiently with the solution shown by #nograpes. Let's try it on DT2:
require(Matrix)
DT2[, names(DT2) := lapply(.SD, as.factor)]
s <- sparseMatrix(
as.integer(DT2$id),
as.integer(DT2$motif),
dimnames = list(levels(DT2$id),levels(DT2$motif)),
x = TRUE)
co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix
ans = setDT(list(motif1 = levels(DT2$motif)[tab$i],
motif2 = levels(DT2$motif)[tab$j],
number = tab$x))
# Matrix solution takes 2.4 secs
ans[motif1 == "motif_456", .N] + ans[motif2 == "motif_456", .N]
# [1] 24

Related

How to filter and subset data by group using data.table?

In running the code at the bottom, I add a "total" column to data frame testDF. I need "ID" for instances where "total" > 0. So the output I'm looking for in this example is simply 1 and 50, those ID's where "total" > 0. How would I efficiently compute this using data.table? Noting that the actual database this will be run against has millions of rows so I'm hoping to avoid unnecessary calculations.
I include seemingly extraneous columns "Period_1", "Period_2", and "State", because when I was fooling around with data.table subsetting, in running things like lapply(.SD, sum), by=.(ID)][, if(sum(PUR) > 0) .SD, by=ID], I was getting errors like
"Error in sum(Period_2) : invalid 'type' (character) of argument"
I'll use these outputs for a "join", which is something I can do in data.table (I think).
Here's a view of the output when running the code:
Code:
library(data.table)
testDF <-
data.frame(
ID = as.numeric(c(rep(1,3),rep(50,3),rep(60,3))),
Period_1 = as.numeric(c(1:3,1:3,1:3)),
Period_2 = c("2012-06","2012-07","2012-08","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
PUR = as.numeric(c(rep(10,3),21:23,rep(0,3))),
CA = as.numeric(c(rep(5,3),11:13,rep(0,3))),
State = c("XX","AA","XX","AA","BB","CC","SS","XX","AA")
)
testDF_Adv <- testDF
setDT(testDF_Adv)[, total := sum(PUR + CA), by=list(ID)]
testDF_Adv <- as.data.frame(testDF_Adv)
testDF_Adv

You can simply do this:
setDT(testDF)[, if(sum(PUR+CA)>0) ID,ID][,ID]
Output:
[1] 1 50

Below, I just create a new data table by filtering out the rows with 0 totals. Are you looking to do this in one step so it will be more efficient?
library(data.table)
testDF <-
data.frame(
ID = as.numeric(c(rep(1,3),rep(50,3),rep(60,3))),
Period_1 = as.numeric(c(1:3,1:3,1:3)),
Period_2 = c("2012-06","2012-07","2012-08","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
PUR = as.numeric(c(rep(10,3),21:23,rep(0,3))),
CA = as.numeric(c(rep(5,3),11:13,rep(0,3))),
State = c("XX","AA","XX","AA","BB","CC","SS","XX","AA")
)
testDF_Adv <- testDF
setDT(testDF_Adv)[, total:=sum(PUR+CA),by=list(ID)]
testDF2 = testDF_Adv[total>0,]
testDF2

You can do this in one step adding additional subset in brackets.
setDT(testDF_Adv)[, total := sum(PUR + CA), by=list(ID)][total > 0]
# ID Period_1 Period_2 PUR CA State total
# 1: 1 1 2012-06 10 5 XX 45
# 2: 1 2 2012-07 10 5 AA 45
# 3: 1 3 2012-08 10 5 XX 45
# 4: 50 1 2013-06 21 11 AA 102
# 5: 50 2 2013-07 22 12 BB 102
# 6: 50 3 2013-08 23 13 CC 102

We can try
> setDT(testDF)[, .(ID[sum(PUR + CA) > 0]), ID]$V1
[1] 1 50

Search indexes in data.table R

I have a data.table, proce, where each line defines a "special procedure". Now, I have another data.table with the patient procedures, codes. For each person, I want to extract the indexes of "special procedures" that match with his/her procedures (if they have any). Here is an example:
library(data.table)
proce <- data.table(v1 = c('o09513','o721','o701','z370'), v2 = c('0w8nxzz','10d07z6','0tqd7zz','0uqg0zz'),
v3 = c('3e030vj','3e033vj',NA,NA))
codes <- data.table(a1 = c(list(c('o721','10d07z6','3e033vj')),
list(c('z370','0uqg0zz',"0tqd7zz","o701")),
list(c('o09513','o721','o701','z370','0uqg8zz'))))
> proce
v1 v2 v3
1: o09513 0w8nxzz 3e030vj
2: o721 10d07z6 3e033vj
3: o701 0tqd7zz <NA>
4: z370 0uqg0zz <NA>
> codes
a1
1: o721,10d07z6,3e033vj
2: z370,0uqg0zz,0tqd7zz,o701
3: o09513,o721,o701,z370,0uqg8zz
Implementation here, but since both tables have hundred thousands of lines, it's slow.
index_procedures <- list()
for(i in 1:nrow(codes)){ # i <- 2
a2 <- unlist(codes[i,a1])
index_procedures[[i]] <- which(apply(proce[,.(v1,v2,v3)], 1,function(x) all(x[!is.na(x)] %in% a2)))
}
index_procedures
> index_procedures
[[1]]
[1] 2
[[2]]
[1] 3 4
[[3]]
integer(0)

If I understand correctly,
codes contains procedure steps which have been applied to a patient. One row in codes refers to one patient.
proce contains procedure steps which constitute a special procedure.
The OP wants to identify which special procedures have been applied on each patient (if any). Thereby, a special procedure is only considered to have been applied on a patient if all of its procedure steps have applied.
To solve this, I suggest to reshape all data in a tidy format, i.e., in long format, first.
Then we can join on procedure steps, filter for complete special procedures and aggregate to get one per patient:
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
, n_steps := .N, by = pid][]
lp[lc, on = .(step)][
, .N == first(n_steps), by = .(cid, pid)][
(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
Note that the pids are shown in a condensed form for demonstration only; other output formats are available as well depending on subsequent processing steps.
If it is required to show all patients even if they have not received a special procedure:
lp[lc, on = .(step)][, .N == first(n_steps), by = .(cid, pid)][
V1 | is.na(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
3: 3
Commented code
# reshape data to long format, thereby adding a row number to identify patients
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
# reshape data to long format, thereby adding a row number to identify special procdures
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
# count the number of procedure steps which constitute a special procedure
, n_steps := .N, by = pid][]
# join on procedure steps
lp[lc, on = .(step)][
# group by patient and special procedure and test for completeness of steps
, .N == first(n_steps), by = .(cid, pid)][
# filter for complete special procedures and aggregate to get one row per patient
(V1), .(pid = toString(sort(pid))), by = cid]
After reshaping, lc is
cid step
1: 1 o721
2: 1 10d07z6
3: 1 3e033vj
4: 2 z370
5: 2 0uqg0zz
6: 2 0tqd7zz
7: 2 o701
8: 3 o09513
9: 3 o721
10: 3 o701
11: 3 z370
12: 3 0uqg8zz
and lp is
pid variable step n_steps
1: 1 v1 o09513 3
2: 2 v1 o721 3
3: 3 v1 o701 2
4: 4 v1 z370 2
5: 1 v2 0w8nxzz 3
6: 2 v2 10d07z6 3
7: 3 v2 0tqd7zz 2
8: 4 v2 0uqg0zz 2
9: 1 v3 3e030vj 3
10: 2 v3 3e033vj 3

I'm not sure about performance, but the following code might be an alternative:
pl <- split(as.matrix(proce), seq_len(nrow(proce)))
pl <- lapply(pl, na.omit)
codes[, indexes := lapply(a1, function(x) which(unlist(lapply(pl, function(p) all(p %in% x)))) )]

Group-wise conditional subsetting where feasible

I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id group x y z
1: 1 1 109.3400 208.6732 308.7595
2: 2 1 101.6920 201.0989 310.1080
3: 3 1 119.4697 217.8550 313.9384
4: 4 1 111.4261 205.2945 317.3651
5: 5 1 100.4024 212.2826 305.1375
6: 6 1 114.4711 203.6988 319.4913
in several stages, unless it results in an empty subset. In this case, I would like to skip that specific subsetting. In an earlier question, Frank has found a great solution for this:
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
where I can enter the data, and the cut-offs for each variable manually.
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id group x y z
1: 55 2 119.2634 219.0044 315.6556
I now wonder how this (or something even better!) could be applied to a case where the cut-offs are in a second data.table
c <- data.table(group=1:2, x=c(110,119), y=c(210,219), z=c(310,319))
> c
group x y z
1: 1 110 210 310
2: 2 119 219 319
and specified for each group separately.
If I were to use f(.), I thought of a join of c into dat but can't figure it out. But perhaps there is a smarter way entirely.

First, I would change how c is constructed. You currently have it set up with one column per filter, but a long format would allow you to use multiple filters on the same column similar to your initial example (i.e. two filters on y):
c <- data.table(group=c(1,2,1,2,1,2,1),variable = c("x","x","y","y","z","z","y"), c_val = c(110,119,210,219,310,319,1e6))
c[, c_id := 1:.N]
c
group variable c_val c_id
1: 1 x 110 1
2: 2 x 119 2
3: 1 y 210 3
4: 2 y 219 4
5: 1 z 310 5
6: 2 z 319 6
7: 1 y 1000000 7
you can then merge your filters to your data.
dat_cut <- melt(dat, id.vars = c("id", "group"), value.name = "dat_val")
output <- merge(dat_cut, c, by = c("group","variable"), allow.cartesian = TRUE)
This line then tests the filters - you can expand this line if you want to expand your filter logic (greater than / less than, equal etc.), and can code that logic back to c
output <- output[dat_val > c_val]
You then want to find any line where the number of filters met is equal to the unique total number of filters met, for that group:
output[,req_match := uniqueN(c_id), by = .(group)] # number of filters where a condition was met.
selection <- output[,.N,by = .(id, group, req_match)][N == req_match, id]
If a filter did not match any rows, it will be excluded here.
Then you can filter your initial dataset for the solution:
dat[id %in% selection]
id group x y z
1: 3 1 119.4697 217.8550 313.9384
2: 18 1 117.2930 216.5670 310.4617
3: 35 1 110.4283 218.6130 312.0904
4: 50 1 119.2519 214.2517 318.8567

Can I use the R data.table join capability to select rows and perform some operation?

I'm not sure how to get the row indices resulting from the join of two data.tables.
To setup a simplified example, suppose dt is a data.table having column 'a' which is a letter from the alphabet, 'b' is some other piece of information.
I want to add a column 'c' and set it to either 'vowel' or 'consonant' depending on column 'a'. I have another data table dtv which serves as a table of vowels. Can I use the join capability of a data.table to efficiently perform this operation?
require(data.table)
dt <- data.table ( a = sample(letters, 25, replace = T),
b = sample(50:100, 25, replace = F))
dtv <- data.table( vowel = c( 'a','e','i','o','u') )
setkey(dt,a)
The next line of code gives me a data.table of rows with vowels
dt[dtv, nomatch=0]
But how do I grab the row indices so I can tag the row's as vowels or consonants?
dt[, c := 'consonant']
dt[{ `a` found in vowel list }, c := 'vowel']
# I want to do this where column 'a' is a vowel

Since V 1.9.4 data.table is optimized to use a binary join on %in% in case the data set is already keyed. So #Richards answer should have the same perfomance for the newest data.table versions (btw, %in% had a bug when used while datatable.auto.index = TRUE, so please make sure you have data.table v 1.9.6+ installed if you are going to use it)
Below is an illustration of data.table using a binary join when using the %in% function
require(data.table)
set.seed(123)
dt <- data.table ( a = sample(letters, 25, replace = T),
b = sample(50:100, 25, replace = F))
dtv <- data.table( vowel = c( 'a','e','i','o','u') )
setkey(dt, a)
options(datatable.verbose = TRUE)
dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
# a b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77
Either way, you were almost there and you can easily modify c while joining
dt[, c := 'consonant']
dt[dtv, c := 'vowel']
Or if you want to avoid joining unnecessary columns from dtv (in case they are present) you could join only to the first column in dtv
dt[dtv$vowel, c := 'consonant']
Notice that I haven't use .() or J(). data.table will perform a binary join instead of row indexing by default in case ith element is not of type integer or numeric. This is matters if you, for example, would want to perform a binary join over column b (which is of type integer). Compare
setkey(dt, b)
dt[80:85]
# a b <~~~ binary join wan't triggered, instead an attempt to subset by rows 80:85 was made
# 1: NA NA
# 2: NA NA
# 3: NA NA
# 4: NA NA
# 5: NA NA
# 6: NA NA
And
dt[.(80:85)] # or dt[J(80:85)]
# Starting bmerge ...done in 0 secs <~~~ binary join was triggered
# a b
# 1: x 80
# 2: x 81
# 3: NA 82
# 4: NA 83
# 5: o 84
# 6: NA 85
Another difference between the two methods is that %in% won't return unmatched instances, compare
setkey(dt, a)
dt[a %in% dtv$vowel]
# Starting bmerge ...done in 0 secs
# a b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77
And
dt[dtv$vowel]
# Starting bmerge ...done in 0 secs
# a b
# 1: a NA <~~~ unmatched values returned
# 2: e NA <~~~ unmatched values returned
# 3: i 87
# 4: o 84
# 5: o 62
# 6: u 77
For this specific case it doesn't matter because the := won't modify unmatched values, but you can use nomatch = 0L in other cases
dt[dtv$vowel, nomatch = 0L]
# Starting bmerge ...done in 0 secs
# a b
# 1: i 87
# 2: o 84
# 3: o 62
# 4: u 77
Don't forget to set options(datatable.verbose = FALSE) if you don't want data.table to be so verbose.

There's really no need to use a merge/join. We can use %in%.
dt[, c := "consonant"]
dt[a %in% dtv$vowel, c := "vowel"]
or the same thing in one line -
dt[, c := "consonant"][a %in% dtv$vowel, c := "vowel"]
Alternatively (and better), we can do both of those steps in a single call with the following.
dt[, c := c("consonant", "vowel")[a %in% dtv$vowel + 1L]]

How to change the last value in each group by reference, in data.table

For a data.table DT grouped by site, sorted by time t, I need to change the last value of a variable in each group. I assume it should be possible to do this by reference using :=, but I haven't found a way that works yet.
Sample data:
require(data.table) # using 1.8.11
DT <- data.table(site=c(rep("A",5), rep("B",4)),t=c(1:5,1:4),a=as.double(c(11:15,21:24)))
setkey(DT, site, t)
DT
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 15
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 24
The desired result is to change the last value of a in each group, for example to 999, so the result looks like:
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 999
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 999
It seems like .I and/or .N should be used, but I haven't found a form that works. The use of := in the same statement as .I[.N] gives an error. The following gives me the row numbers where the assignment is to be made:
DT[, .I[.N], by=site]
# site V1
# 1: A 5
# 2: B 9
but I don't seem to be able to use this with a := assignment. The following give errors:
DT[.N, a:=999, by=site]
# Null data.table (0 rows and 0 cols)
DT[, .I[.N, a:=999], by=site]
# Error in `:=`(a, 999) :
# := and `:=`(...) are defined for use in j, once only and in particular ways.
# See help(":="). Check is.data.table(DT) is TRUE.
DT[.I[.N], a:=999, by=site]
# Null data.table (0 rows and 0 cols)
Is there a way to do this by reference in data.table? Or is this better done another way in R?

Currently you can use:
DT[DT[, .I[.N], by = site][['V1']], a := 999]
# or, avoiding the overhead of a second call to `[.data.table`
set(DT, i = DT[,.I[.N],by='site'][['V1']], j = 'a', value = 999L)
alternative approaches:
use replace...
DT[, a := replace(a, .N, 999), by = site]
or shift the replacement to the RHS, wrapped by {} and return the full vector
DT[, a := {a[.N] <- 999L; a}, by = site]
or use mult='last' and take advantage of by-without-by. This requires the data.table to be keyed by the groups of interest.
DT[unique(site), a := 999, mult = 'last']
There is a feature request #2793 that would allow
DT[, a[.N] := 999]
but this is yet to be implemented

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Counting existing pairs of data in large data table - r

Related

How to filter and subset data by group using data.table?

Search indexes in data.table R

Group-wise conditional subsetting where feasible

Can I use the R data.table join capability to select rows and perform some operation?

How to change the last value in each group by reference, in data.table

Categories

Resources