Search indexes in data.table R - r

I have a data.table, proce, where each line defines a "special procedure". Now, I have another data.table with the patient procedures, codes. For each person, I want to extract the indexes of "special procedures" that match with his/her procedures (if they have any). Here is an example:
library(data.table)
proce <- data.table(v1 = c('o09513','o721','o701','z370'), v2 = c('0w8nxzz','10d07z6','0tqd7zz','0uqg0zz'),
v3 = c('3e030vj','3e033vj',NA,NA))
codes <- data.table(a1 = c(list(c('o721','10d07z6','3e033vj')),
list(c('z370','0uqg0zz',"0tqd7zz","o701")),
list(c('o09513','o721','o701','z370','0uqg8zz'))))
> proce
v1 v2 v3
1: o09513 0w8nxzz 3e030vj
2: o721 10d07z6 3e033vj
3: o701 0tqd7zz <NA>
4: z370 0uqg0zz <NA>
> codes
a1
1: o721,10d07z6,3e033vj
2: z370,0uqg0zz,0tqd7zz,o701
3: o09513,o721,o701,z370,0uqg8zz
Implementation here, but since both tables have hundred thousands of lines, it's slow.
index_procedures <- list()
for(i in 1:nrow(codes)){ # i <- 2
a2 <- unlist(codes[i,a1])
index_procedures[[i]] <- which(apply(proce[,.(v1,v2,v3)], 1,function(x) all(x[!is.na(x)] %in% a2)))
}
index_procedures
> index_procedures
[[1]]
[1] 2
[[2]]
[1] 3 4
[[3]]
integer(0)

If I understand correctly,
codes contains procedure steps which have been applied to a patient. One row in codes refers to one patient.
proce contains procedure steps which constitute a special procedure.
The OP wants to identify which special procedures have been applied on each patient (if any). Thereby, a special procedure is only considered to have been applied on a patient if all of its procedure steps have applied.
To solve this, I suggest to reshape all data in a tidy format, i.e., in long format, first.
Then we can join on procedure steps, filter for complete special procedures and aggregate to get one per patient:
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
, n_steps := .N, by = pid][]
lp[lc, on = .(step)][
, .N == first(n_steps), by = .(cid, pid)][
(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
Note that the pids are shown in a condensed form for demonstration only; other output formats are available as well depending on subsequent processing steps.
If it is required to show all patients even if they have not received a special procedure:
lp[lc, on = .(step)][, .N == first(n_steps), by = .(cid, pid)][
V1 | is.na(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
3: 3
Commented code
# reshape data to long format, thereby adding a row number to identify patients
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
# reshape data to long format, thereby adding a row number to identify special procdures
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
# count the number of procedure steps which constitute a special procedure
, n_steps := .N, by = pid][]
# join on procedure steps
lp[lc, on = .(step)][
# group by patient and special procedure and test for completeness of steps
, .N == first(n_steps), by = .(cid, pid)][
# filter for complete special procedures and aggregate to get one row per patient
(V1), .(pid = toString(sort(pid))), by = cid]
After reshaping, lc is
cid step
1: 1 o721
2: 1 10d07z6
3: 1 3e033vj
4: 2 z370
5: 2 0uqg0zz
6: 2 0tqd7zz
7: 2 o701
8: 3 o09513
9: 3 o721
10: 3 o701
11: 3 z370
12: 3 0uqg8zz
and lp is
pid variable step n_steps
1: 1 v1 o09513 3
2: 2 v1 o721 3
3: 3 v1 o701 2
4: 4 v1 z370 2
5: 1 v2 0w8nxzz 3
6: 2 v2 10d07z6 3
7: 3 v2 0tqd7zz 2
8: 4 v2 0uqg0zz 2
9: 1 v3 3e030vj 3
10: 2 v3 3e033vj 3

I'm not sure about performance, but the following code might be an alternative:
pl <- split(as.matrix(proce), seq_len(nrow(proce)))
pl <- lapply(pl, na.omit)
codes[, indexes := lapply(a1, function(x) which(unlist(lapply(pl, function(p) all(p %in% x)))) )]

Related

Find and subset patterns in data table

Suppose that we have a data table with missing values (see example below).
library(data.table)
mat <- matrix(rnorm(50), ncol = 5)
mat[c(1,3,5,9,10,11,14,37,38)] <- NA
DT <- as.data.table(mat)
In total, we have 5 unique missing data patterns in our example (see unique(!is.na(DT))).
Suppose now further that we would like to find these patterns and identify them according to their frequency of occurrence (starting with the most frequent pattern indicated by 1).
DTna <- as.data.table(!is.na(DT))
DTna <- DTna[, n := .N, by = names(x = DTna)]
DTna <- DTna[, id := 1:nrow(x = DTna)]
DTna <- DTna[order(n, decreasing = TRUE)]
DTna <- DTna[, m := .GRP, by = eval(names(x = DT))]
Finally, observations with a particular pattern should be subsetted according to a prespecification (here e.g. 1 for the most frequent pattern).
pattern <- 1
i <- DTna[m == pattern, id]
DT[i]
In summary, I need to find observations which share the same missing data pattern and subsequently subset them according to a prespecification (e.g. the most frequent pattern). Please note that I need to subset DT instead of DTna.
Question
So far, the above code works as expected, but is there a more elegant way using data.table?
I would add a grouping column to DT to join and filter on:
DT[, nag := do.call(paste0, lapply(.SD, function(x) +is.na(x)))]
nagDT = DT[, .N, by=nag][order(-N), nagid := .I][, setorder(.SD, nagid)]
# nag N nagid
# 1: 10000 4 1
# 2: 00000 2 2
# 3: 00010 2 3
# 4: 11000 1 4
# 5: 01000 1 5
# subsetting
my_id = 1L
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0]
which gives
V1 V2 V3 V4 V5 nag
1: NA 1.3306093 -2.1030978 0.06115726 -0.2527502 10000
2: NA 0.2852518 -0.1894425 0.86698633 -0.2099998 10000
3: NA -0.1325032 -0.5201166 -0.94392417 0.6515976 10000
4: NA 0.3199076 -1.0152518 -1.61417902 -0.6458374 10000
If you want to omit the new column in the result:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"]
And to also omit the blank columns:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"][,
Filter(function(x) !anyNA(x), .SD)]
An alternative, which is undoubtedly inferior (but nonetheless provided for variety), is
DT[, patCnt := setDT(stack(transpose(DT)))[,
paste(+(is.na(values)), collapse=""), by="ind"][,
patCnt := .N, by=(V1)]$patCnt]
which returns
DT
V1 V2 V3 V4 V5 patCnt
1: NA NA -1.5062011 -0.9846015 0.12153714 1
2: 1.4176784 -0.08078952 -0.8101335 0.6437340 -0.49474613 2
3: NA -0.08410076 -1.1709337 -0.9182901 0.67985806 4
4: 0.2104999 NA -0.1458075 0.8192693 0.05217464 1
5: NA -0.73361504 2.1431392 -1.0041705 0.29198857 4
6: 0.3841267 -0.75943774 0.6931461 -1.3417511 -1.53291515 2
7: -0.8011166 0.26857593 1.1249757 NA -0.57850361 2
8: -1.5518674 0.52004986 1.6505470 NA -0.34061924 2
9: NA 0.83135928 0.9155882 0.1856450 0.31346976 4
10: NA 0.60328545 1.3042894 -0.5835755 -0.17132227 4
Then subset
DT[patCnt == max(patCnt)]
V1 V2 V3 V4 V5 patCnt
1: NA -0.08410076 -1.1709337 -0.9182901 0.6798581 4
2: NA -0.73361504 2.1431392 -1.0041705 0.2919886 4
3: NA 0.83135928 0.9155882 0.1856450 0.3134698 4
4: NA 0.60328545 1.3042894 -0.5835755 -0.1713223 4

Expand data.table with combinations of two columns given condition in another column

I have a data.table that gives me the connections between locations (origin and destination) for different bus routes (route_id).
library(data.table)
library(magrittr)
# data for reproducible example
dt <- data.table( origin = c('A','B','C', 'F', 'G', 'H'),
destination = c('B','C','D', 'G', 'H', 'I'),
freq = c(2,2,2,10,10,10),
route_id = c(1,1,1,2,2,2), stringsAsFactors=FALSE )
# > dt
# origin destination freq route_id
# 1: A B 2 1
# 2: B C 2 1
# 3: C D 2 1
# 4: F G 10 2
# 5: G H 10 2
# 6: H I 10 2
For the purposes of what I'd want to do, if there is a route_id that gives a connection A-B and a connection B-C, then I want to add to the data a connection A-C for that same route_id and so on.
Problems: So far, I've created a simple code that does this job but:
it uses a for loop that takes a long time (my real data has hundreds of thousands observations)
it still does not cope well with direction. The direction of the connections matter here. So although there is a B-C connection in the original data, there should be no C-B in the output.
My slow solution
# loop
# a) get a data subset corresponding to each route_id
# b) get all combinations of origin-destination pairs
# c) row bind the new pairs to original data
for (i in unique(dt$route_id)) {
temp <- dt[ route_id== i,]
subset_of_pairs <- expand.grid(temp$origin, temp$destination) %>% setDT()
setnames(subset_of_pairs, c("origin", "destination"))
dt <- rbind(dt, subset_of_pairs, fill=T)
}
# assign route_id and freq to new pairs
dt[, route_id := route_id[1L], by=origin]
dt[, freq := freq[1L], by=route_id]
# Keepe only different pairs that are unique
dt[, origin := as.character(origin) ][, destination := as.character(destination) ]
dt <- dt[ origin != destination, ][order(route_id, origin, destination)]
dt <- unique(dt)
Desired output
origin destination freq route_id
1: A B 2 1
2: A C 2 1
3: A D 2 1
4: B C 2 1
5: B D 2 1
6: C D 2 1
7: F G 10 2
8: F H 10 2
9: F I 10 2
10: G H 10 2
11: G I 10 2
12: H I 10 2
One way:
res = dt[, {
stops = c(origin, last(destination))
pairs = combn(.N + 1L, 2L)
.(o = stops[pairs[1,]], d = stops[pairs[2,]])
}, by=route_id]
route_id o d
1: 1 A B
2: 1 A C
3: 1 A D
4: 1 B C
5: 1 B D
6: 1 C D
7: 2 F G
8: 2 F H
9: 2 F I
10: 2 G H
11: 2 G I
12: 2 H I
This is assuming that c(origin, last(destination)) is a full list of stops in order. If dt does not contain enough info to construct a complete order, the task becomes much more difficult.
If vars from dt are needed, an update join like res[dt, on=.(route_id), freq := i.freq] works.
Tasks like this always risk running out of memory. In this case, the OP has up to a million rows containing groups of up to 341 stops, so the end result could be as large as 1e6/341*choose(341,2) = 170 million rows. That's manageable, but in general this sort of analysis does not scale.
How it works
Generally, data.table syntax can be treated just like a loop over groups:
DT[, {
...
}, by=g]
This has a few advantages over loops:
Nothing created in the ... body will pollute the workspace.
All columns can be referenced by name.
Special symbols .N, .SD, .GRP and .BY are available, along with .() for list().
In the code above, pairs finds pairs of indices taken from 1 .. #stops (=.N+1 where .N is the number of rows in the subset of the data associated with a given route_id). It is a matrix with the first row corresponding to the first element of a pair; and the second row with the second. The ... should evaluate to a list of columns; and here list() is abbreviated as .().
Further improvements
I guess the time is mostly devoted to computing combn many times. If multiple routes have the same #stops, this can be addressed by computing beforehand:
Ns = dt[,.N, by=route_id][, unique(N)]
cb = lapply(setNames(,Ns), combn, 2)
Then grab pairs = cb[[as.character(.N)]] in the main code. Alternately, define a pairs function that uses memoization to avoid recomputing.

Counting existing pairs of data in large data table

This question is based on the response given by #Arun here. In the response, #Arun proposes a clever way to avoid creating sparse matrices by simply only looking at occurring pairs, hence avoiding the saving of many zeros and the doubling of pairs A-B and B-A.
The following is copy-pasted from his answer:
Step 1: Construct sample data of your dimensions approximately:
require(data.table) ## 1.9.4+
set.seed(1L) ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)
Step 2: Pre-processing:
DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT) ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id) ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]
Step 3: Solution:
ans = DT[DT, {
tmp = runlen < i.runlen;
list(motif[tmp], i.motif[any(tmp)])
},
by=.EACHI][, .N, by="V1,V2"]
Running this works fine provided you have enough memory on your computer. I also humbly admit I have no ideawhat exactly the code is doing to create the wanted results so I'm just looking at input and output, agnostic of the process.
When applying the exact same code to my data, what seems to happen is that pairs appear that are not in the original data.
I'm running the following code which is a slightly adapted version of what #Arun had provided. The adaptation is because I need to run the code for 17 different blocks. I.e. I'm looking for which pairs occur within a specific block.
cooc <- data.frame()
for(j in 1:17){
DT <- dt[block == j,c("pid", "cid"), with =F]
DT$pid <- as.factor(DT$pid)
setorder(DT)
setkey(DT,pid)
DT[, runlen := .I]
ans <- DT[DT, {
tmp = runlen < i.runlen;
list(cid[tmp],i.cid[any(tmp)])
},
by= .EACHI][, .N, by="V1,V2"]
ans$block <- j
cooc <- data.table(rbind(cooc,ans))
rm(ans)
}
For as far as I understand the code, it's all identical, just looped with for to do the same thing for 17 blocks. both pid and cid are just integers that identify a variable of interest.
For j = 1 , the following goes:
DT[cid == 39] # cid is my equivalent of motif above and pid is my equivalent of id above
pid cid runlen
20319 39 3614
This suggests there is only one pid for which cid equals 39
Now when I look into the resulting ans datatable I get the following:
ans[V1 == 39]
V1 V2 N block
1: 39 41 1 1
2: 39 42 1 1
3: 39 44 1 1
4: 39 47 1 1
5: 39 7027 1 1
6: 39 7043 1 1
7: 39 7174 1 1
8: 39 9434 1 1
9: 39 11493 1 1
10: 39 18815 1 1
11: 39 18875 1 1
12: 39 18896 1 1
13: 39 18909 1 1
14: 39 18924 1 1
15: 39 18928 1 1
16: 39 18929 1 1
17: 39 18931 1 1
18: 39 18932 1 1
19: 39 19265 1 1
20: 39 19410 1 1
Suddenly, there are 20 occurrences of V1 (if I understand the code correctly, this is the equivalent of what was cid). Yet in DT there is only 1 pid assigned to cid.
I have no idea how to reproduce this finding so I tried to show what seems to be inconsistent. I don't think the original code has this problem so I'm hoping someone can explain where the additional occurrences of cid == 39 come from, based on the info I have given here.
Two things:
First, I don't understand what's wrong with the result you get. Starting from
require(data.table)
set.seed(1L)
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)
let me recreate the data that helps answer your Q.
# keep only one of 'motif_456'
DT2 = rbind(DT[1L], DT[motif != "motif_456"])
DT2[1L]
# id motif
# 1: 49338 motif_456
DT2[ , .N, by=motif]
# motif N
# 1: motif_456 1
# 2: motif_639 1637
# 3: motif_984 1649
# 4: motif_1559 1531
# 5: motif_347 1603
# ---
# 1712: motif_46 1623
# 1713: motif_521 1613
# 1714: motif_803 1568
# 1715: motif_603 1573
# 1716: motif_461 1591
Let's check all motifs corresponding to id = 49338:
DT2[id == 49338, motif]
# [1] "motif_456" "motif_553" "motif_1048" "motif_1680" "motif_171" "motif_1706"
# [7] "motif_707" "motif_163" "motif_489" "motif_107" "motif_1419" "motif_595"
# [13] "motif_1223" "motif_1274" "motif_1164" "motif_427" "motif_505" "motif_1041"
# [19] "motif_1321" "motif_1231" "motif_1498" "motif_837" "motif_298" "motif_649"
# [25] "motif_631"
So obviously for all these motifs' combination with motif_456 the result should be 1. And that's what the data.table solution provides. Here's the relevant result after running data.table solution:
# data.table solution takes 11.2 secs
ans[V1 == "motif_456", .N] + ans[V2 == "motif_456", .N]
# [1] 24
Second, while the data.table answer does well, we can do this more efficiently with the solution shown by #nograpes. Let's try it on DT2:
require(Matrix)
DT2[, names(DT2) := lapply(.SD, as.factor)]
s <- sparseMatrix(
as.integer(DT2$id),
as.integer(DT2$motif),
dimnames = list(levels(DT2$id),levels(DT2$motif)),
x = TRUE)
co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix
ans = setDT(list(motif1 = levels(DT2$motif)[tab$i],
motif2 = levels(DT2$motif)[tab$j],
number = tab$x))
# Matrix solution takes 2.4 secs
ans[motif1 == "motif_456", .N] + ans[motif2 == "motif_456", .N]
# [1] 24

How to change the last value in each group by reference, in data.table

For a data.table DT grouped by site, sorted by time t, I need to change the last value of a variable in each group. I assume it should be possible to do this by reference using :=, but I haven't found a way that works yet.
Sample data:
require(data.table) # using 1.8.11
DT <- data.table(site=c(rep("A",5), rep("B",4)),t=c(1:5,1:4),a=as.double(c(11:15,21:24)))
setkey(DT, site, t)
DT
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 15
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 24
The desired result is to change the last value of a in each group, for example to 999, so the result looks like:
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 999
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 999
It seems like .I and/or .N should be used, but I haven't found a form that works. The use of := in the same statement as .I[.N] gives an error. The following gives me the row numbers where the assignment is to be made:
DT[, .I[.N], by=site]
# site V1
# 1: A 5
# 2: B 9
but I don't seem to be able to use this with a := assignment. The following give errors:
DT[.N, a:=999, by=site]
# Null data.table (0 rows and 0 cols)
DT[, .I[.N, a:=999], by=site]
# Error in `:=`(a, 999) :
# := and `:=`(...) are defined for use in j, once only and in particular ways.
# See help(":="). Check is.data.table(DT) is TRUE.
DT[.I[.N], a:=999, by=site]
# Null data.table (0 rows and 0 cols)
Is there a way to do this by reference in data.table? Or is this better done another way in R?
Currently you can use:
DT[DT[, .I[.N], by = site][['V1']], a := 999]
# or, avoiding the overhead of a second call to `[.data.table`
set(DT, i = DT[,.I[.N],by='site'][['V1']], j = 'a', value = 999L)
alternative approaches:
use replace...
DT[, a := replace(a, .N, 999), by = site]
or shift the replacement to the RHS, wrapped by {} and return the full vector
DT[, a := {a[.N] <- 999L; a}, by = site]
or use mult='last' and take advantage of by-without-by. This requires the data.table to be keyed by the groups of interest.
DT[unique(site), a := 999, mult = 'last']
There is a feature request #2793 that would allow
DT[, a[.N] := 999]
but this is yet to be implemented

Pairing rows in data.table

I have a data table similar to the one obtained with the following command:
dt <- data.table(
time = 1:8,
part = rep(c(1, 1, 2, 2), 2),
type = rep(c('A', 'B'), 4),
data = rep(c(runif(1), 0), 4))
Basically, such a table contains two different type of instances (A or B). The time column contains a timestamp for when a request arrived to or leaved from a certain part. If the instance type is A, the timestamp states the arrival time (enter), and if the type is B, the timestamp states the leaving time (exit).
time part type data
1: 1 1 A 0.5842668
2: 2 1 B 0.0000000
3: 3 2 A 0.5842668
4: 4 2 B 0.0000000
5: 5 1 A 0.5842668
6: 6 1 B 0.0000000
7: 7 2 A 0.5842668
8: 8 2 B 0.0000000
I would like to pair A and B instances, and obtain the following data table:
part data enter.time exit.time
1: 1 0.4658239 1 2
2: 1 0.4658239 5 6
3: 2 0.4658239 3 4
4: 2 0.4658239 7 8
I have tried the following:
pair.types <- function(x) {
a.type <- x[type == 'A']
b.type <- x[type == 'B']
return(data.table(
enter.time = a.type$time,
exit.time = b.type$time,
data = a.type$data))
}
dt[, c('enter.time', 'exit.time', 'data') := pair.types(.SD), by = list(part)]
But, that gives me the following, which is not exactly what I want:
time part type data enter.time exit.time
1: 1 1 A 0.3441592 1 2
2: 2 1 B 0.3441592 5 6
3: 3 2 A 0.3441592 3 4
4: 4 2 B 0.3441592 7 8
5: 5 1 A 0.3441592 1 2
6: 6 1 B 0.3441592 5 6
7: 7 2 A 0.3441592 3 4
8: 8 2 B 0.3441592 7 8
It is kind of close, but since column 'type' is kept, some rows are duplicated. Perhaps, I can try to remove columns 'time' and 'type', and then remove the second half of rows. But, I am not sure whether that will work in all the cases, and I would like to learn a better way to do this operation.
Assuming your data looks like your example data:
dt[, list(part = part[1],
data = data[1],
enter.time = time[1],
exit.time = time[2]),
by = as.integer((seq_len(nrow(dt)) + 1)/2)]
# by = rep(seq(1, nrow(dt), 2), each = 2)]
# ^^^ a slightly shorter and a little more readable alternative
The idea is very simple - group rows in groups of 2 (that's the by part), i.e. each group will be one A and one B, then for each group take first part and first data and then the enter and exit times are just the first and second time's respectively. This is likely how you'd do this if you followed the by-hand logic, making it easy to read (once you know just a tiny bit about how data.table works).
Another way:
setkey(dt, "type")
dt.out <- cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
# part data entry.time exit.time
# 1: 1 0.1294204 1 2
# 2: 2 0.1294204 3 4
# 3: 1 0.1294204 5 6
# 4: 2 0.1294204 7 8
If you want you can now do setkey(dt.out, "part") to get the same order.
The idea: Your problem seems a simple "reshaping" one to me. The way I've approached it is first to create a key column as type. Now, we can subset data.table for a specific value in the key column by: dt[J("A")]. This would return the entire data.table. Since you want the column time renamed, I explicitly mention which columns to subset using:
dt[J("A"), list(part, data, entry.time = time)]
Of course this'll return also the type column (= A) which we've to remove. So, I've added a [, type := NULL] to remove column type by reference.
Now we've the first part. All we need is the exit.time. This can be obtained similarly as:
dt[J("B"), list(time)] # I don't name the column here
But this gives a data.table when you need just the time column, which can be accessed by:
dt[J("B"), list(time)]$time
So, while using cbind I name this column as exit.time to get the final result as:
cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
Hope this helps.

Resources