Quick search in data.table or quick subset - r

I have a DF with 800k+ rows with repeated (random) values. For each row I need to take a value and find an index of a new row(s) with same value. E.g. "asd" - where else do I see it? The index of the current row is NOT needed.
My current solution: subset a DF and create a temp frame/table by removing current row. Problem - it takes a minute per 1000 iterations. So 800+k rows will take me 13 hours to run. Any ideas? Thanks!
Running on original DF (not subsetted) is < 1 second, but as you can imagine it gives me the index of the current row.
Edit: My real-life DF is more than 1 column. Example below is simplified. I need to take V1[1] and get row numbers of other V1 with value of V1[1], then repeat for V1[2] and so on for each row
library(fastmatch)
library(stringi)
set.seed(12345)
V1 = stringi::stri_rand_strings(800000, 3)
df0 = as.data.table(V1)
mapped = matrix("",nrow=800000)
print(Sys.time())
for (i in 1:1000) {
tmp_df = df0[-i,] #This takes very long time!!!
mapped[i] = fmatch(df0$V1[i],tmp_df$V1)
}
print(Sys.time())
View(mapped)

Data:
library("data.table")
set.seed(12345)
V1 = stringi::stri_rand_strings(80, 3)
df0 <- data.table( sample(V1, 100, replace = TRUE ))
Code:
df0[, id := list(list(.I)), by = V1] # integer id
Output:
head(df0, 10)
# V1 id
# 1: iuR 1,2,21
# 2: iuR 1,2,21
# 3: KXc 3
# 4: LwA 4
# 5: pYn 5
# 6: qoN 6,66
# 7: 5Xt 7
# 8: wBH 8,77
# 9: V9r 9,39,54
# 10: 9ks 10,28,42,48
EDIT - Removed Current Index:
df0[, id2 := 1:.N ]
df0[, id := list(list(unlist(id)[ unlist(id) != .I ] )), by = id2 ]
df0[, id2 := NULL ]
df0[ lengths(id) > 0, ]
head( df0, 10 )
# V1 id
# 1: iuR 2,21
# 2: iuR 1,21
# 3: KXc
# 4: LwA
# 5: pYn
# 6: qoN 66
# 7: 5Xt
# 8: wBH 77
# 9: V9r 39,54
# 10: 9ks 28,42,48

Related

How to avoid redundant calculation within data.table?

I need to find the unique two minima in data column at a data table by two ids id1 and id2:
n <- 12
set.seed(1234)
id1 <- rep(1:2, each = 6)
id2 <- rep(1:6, each = 2)
data <- 100+100*rnorm(n)
dt <- data.table(id1=id1, id2=id2, data=data)
Find below the function that, given the second id id2, calculates the two unique minima at the same time and export them as a vector:
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T))
, by = ids
][order(V1)]
min_1 <- dt$V1[1]
min_2 <- dt$V1[2]
nn <- c(min_1 = min_1, min_2 = min_2)
}
detect_two_lower <- memoise(detect_two_lower)
Then apply the function on the data.table, grouping by = id1:
dt[, `:=` ( min_1 = detect_two_lower(id2, data)[1]
,min_2 = detect_two_lower(id2, data)[2])
, by = id1
]
The calculation runs as expected (see below). Note, however, that the code calls detect_two_lower twice with the same parameters. As a workaround I tried to minimize the reworking with memoise, but I would like to avoid this patch.
Is there a better way to accomplish the same result?
dt
id1 id2 data min_1 min_2
1: 1 1 -20.7065749 -134.5697703 -20.70657
2: 1 1 127.7429242 -134.5697703 -20.70657
3: 1 2 208.4441177 -134.5697703 -20.70657
4: 1 2 -134.5697703 -134.5697703 -20.70657
5: 1 3 142.9124689 -134.5697703 -20.70657
6: 1 3 150.6055892 -134.5697703 -20.70657
7: 2 4 42.5260040 0.1613555 10.99622
8: 2 4 45.3368144 0.1613555 10.99622
9: 2 5 43.5548001 0.1613555 10.99622
10: 2 5 10.9962171 0.1613555 10.99622
11: 2 6 52.2807300 0.1613555 10.99622
12: 2 6 0.1613555 0.1613555 10.99622
Return a list from the function
library(data.table)
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T)), by = ids][order(V1)]
as.list(dt$V1)
}
So you can assign them directly :
dt[, c('min_1', 'min_2') := detect_two_lower(id2, data), id1]

Search indexes in data.table R

I have a data.table, proce, where each line defines a "special procedure". Now, I have another data.table with the patient procedures, codes. For each person, I want to extract the indexes of "special procedures" that match with his/her procedures (if they have any). Here is an example:
library(data.table)
proce <- data.table(v1 = c('o09513','o721','o701','z370'), v2 = c('0w8nxzz','10d07z6','0tqd7zz','0uqg0zz'),
v3 = c('3e030vj','3e033vj',NA,NA))
codes <- data.table(a1 = c(list(c('o721','10d07z6','3e033vj')),
list(c('z370','0uqg0zz',"0tqd7zz","o701")),
list(c('o09513','o721','o701','z370','0uqg8zz'))))
> proce
v1 v2 v3
1: o09513 0w8nxzz 3e030vj
2: o721 10d07z6 3e033vj
3: o701 0tqd7zz <NA>
4: z370 0uqg0zz <NA>
> codes
a1
1: o721,10d07z6,3e033vj
2: z370,0uqg0zz,0tqd7zz,o701
3: o09513,o721,o701,z370,0uqg8zz
Implementation here, but since both tables have hundred thousands of lines, it's slow.
index_procedures <- list()
for(i in 1:nrow(codes)){ # i <- 2
a2 <- unlist(codes[i,a1])
index_procedures[[i]] <- which(apply(proce[,.(v1,v2,v3)], 1,function(x) all(x[!is.na(x)] %in% a2)))
}
index_procedures
> index_procedures
[[1]]
[1] 2
[[2]]
[1] 3 4
[[3]]
integer(0)
If I understand correctly,
codes contains procedure steps which have been applied to a patient. One row in codes refers to one patient.
proce contains procedure steps which constitute a special procedure.
The OP wants to identify which special procedures have been applied on each patient (if any). Thereby, a special procedure is only considered to have been applied on a patient if all of its procedure steps have applied.
To solve this, I suggest to reshape all data in a tidy format, i.e., in long format, first.
Then we can join on procedure steps, filter for complete special procedures and aggregate to get one per patient:
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
, n_steps := .N, by = pid][]
lp[lc, on = .(step)][
, .N == first(n_steps), by = .(cid, pid)][
(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
Note that the pids are shown in a condensed form for demonstration only; other output formats are available as well depending on subsequent processing steps.
If it is required to show all patients even if they have not received a special procedure:
lp[lc, on = .(step)][, .N == first(n_steps), by = .(cid, pid)][
V1 | is.na(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
3: 3
Commented code
# reshape data to long format, thereby adding a row number to identify patients
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
# reshape data to long format, thereby adding a row number to identify special procdures
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
# count the number of procedure steps which constitute a special procedure
, n_steps := .N, by = pid][]
# join on procedure steps
lp[lc, on = .(step)][
# group by patient and special procedure and test for completeness of steps
, .N == first(n_steps), by = .(cid, pid)][
# filter for complete special procedures and aggregate to get one row per patient
(V1), .(pid = toString(sort(pid))), by = cid]
After reshaping, lc is
cid step
1: 1 o721
2: 1 10d07z6
3: 1 3e033vj
4: 2 z370
5: 2 0uqg0zz
6: 2 0tqd7zz
7: 2 o701
8: 3 o09513
9: 3 o721
10: 3 o701
11: 3 z370
12: 3 0uqg8zz
and lp is
pid variable step n_steps
1: 1 v1 o09513 3
2: 2 v1 o721 3
3: 3 v1 o701 2
4: 4 v1 z370 2
5: 1 v2 0w8nxzz 3
6: 2 v2 10d07z6 3
7: 3 v2 0tqd7zz 2
8: 4 v2 0uqg0zz 2
9: 1 v3 3e030vj 3
10: 2 v3 3e033vj 3
I'm not sure about performance, but the following code might be an alternative:
pl <- split(as.matrix(proce), seq_len(nrow(proce)))
pl <- lapply(pl, na.omit)
codes[, indexes := lapply(a1, function(x) which(unlist(lapply(pl, function(p) all(p %in% x)))) )]

Binary search for integer64 in data.table

I have a integer64 indexed data.table object:
library(data.table)
library(bit64)
some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))
#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):
search_values <- some_data[(n+1):length(some_data)]
If these where native integers I could use findInterval() to solve the problem:
values_index <- findInterval(search_values, x$new_col)
but when the arguments to findInterval are integer64, I get:
Warning messages:
1: In as.double.integer64(vec) :
integer precision lost while converting to double
2: In as.double.integer64(x) :
integer precision lost while converting to double
and wrong indexes:
> values_index
[1] 10 10 10 10 10
e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.
Edit:
Desired output:
print(values_index)
9 10 6 10 1
Why?:
value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.
Maybe you want something like this:
findInterval2 <- function(y, x) {
toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
x2 <- copy(x)
x2[, i := .I] # mark the original data set
x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x2, new_col) # order
x2[, index := cumsum(!is.na(i))]
x2[match(y, new_col), index]
}
# x2 is:
# a new_col i index
# 1: 0.56602278 1514772184120000011 1 1
# 2: NA 1514772184120000013 NA 1
# 3: 0.29408237 1514772184120000017 2 2
# 4: 0.28532378 1514772184120000026 3 3
# 5: NA 1514772184120000029 NA 3
# 6: NA 1514772184120000032 NA 3
# 7: 0.66844754 1514772184120000042 4 4
# 8: 0.83008829 1514772184120000043 5 5
# 9: NA 1514772184120000059 NA 5
# 10: NA 1514772184120000060 NA 5
# 11: 0.76992760 1514772184120000068 6 6
# 12: 0.57049677 1514772184120000078 7 7
# 13: 0.14406169 1514772184120000083 8 8
# 14: 0.02044602 1514772184120000085 9 9
# 15: 0.68016024 1514772184120000094 10 10
findInterval2(search_values, x)
# [1] 1 5 3 5 3
If not, then maybe you could change the code as needed.
update
look at this integer example to see that this function gives the same result as base findInterval
now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]
r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)
If I get what you want, then a quick workaround could be:
toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

Find and subset patterns in data table

Suppose that we have a data table with missing values (see example below).
library(data.table)
mat <- matrix(rnorm(50), ncol = 5)
mat[c(1,3,5,9,10,11,14,37,38)] <- NA
DT <- as.data.table(mat)
In total, we have 5 unique missing data patterns in our example (see unique(!is.na(DT))).
Suppose now further that we would like to find these patterns and identify them according to their frequency of occurrence (starting with the most frequent pattern indicated by 1).
DTna <- as.data.table(!is.na(DT))
DTna <- DTna[, n := .N, by = names(x = DTna)]
DTna <- DTna[, id := 1:nrow(x = DTna)]
DTna <- DTna[order(n, decreasing = TRUE)]
DTna <- DTna[, m := .GRP, by = eval(names(x = DT))]
Finally, observations with a particular pattern should be subsetted according to a prespecification (here e.g. 1 for the most frequent pattern).
pattern <- 1
i <- DTna[m == pattern, id]
DT[i]
In summary, I need to find observations which share the same missing data pattern and subsequently subset them according to a prespecification (e.g. the most frequent pattern). Please note that I need to subset DT instead of DTna.
Question
So far, the above code works as expected, but is there a more elegant way using data.table?
I would add a grouping column to DT to join and filter on:
DT[, nag := do.call(paste0, lapply(.SD, function(x) +is.na(x)))]
nagDT = DT[, .N, by=nag][order(-N), nagid := .I][, setorder(.SD, nagid)]
# nag N nagid
# 1: 10000 4 1
# 2: 00000 2 2
# 3: 00010 2 3
# 4: 11000 1 4
# 5: 01000 1 5
# subsetting
my_id = 1L
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0]
which gives
V1 V2 V3 V4 V5 nag
1: NA 1.3306093 -2.1030978 0.06115726 -0.2527502 10000
2: NA 0.2852518 -0.1894425 0.86698633 -0.2099998 10000
3: NA -0.1325032 -0.5201166 -0.94392417 0.6515976 10000
4: NA 0.3199076 -1.0152518 -1.61417902 -0.6458374 10000
If you want to omit the new column in the result:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"]
And to also omit the blank columns:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"][,
Filter(function(x) !anyNA(x), .SD)]
An alternative, which is undoubtedly inferior (but nonetheless provided for variety), is
DT[, patCnt := setDT(stack(transpose(DT)))[,
paste(+(is.na(values)), collapse=""), by="ind"][,
patCnt := .N, by=(V1)]$patCnt]
which returns
DT
V1 V2 V3 V4 V5 patCnt
1: NA NA -1.5062011 -0.9846015 0.12153714 1
2: 1.4176784 -0.08078952 -0.8101335 0.6437340 -0.49474613 2
3: NA -0.08410076 -1.1709337 -0.9182901 0.67985806 4
4: 0.2104999 NA -0.1458075 0.8192693 0.05217464 1
5: NA -0.73361504 2.1431392 -1.0041705 0.29198857 4
6: 0.3841267 -0.75943774 0.6931461 -1.3417511 -1.53291515 2
7: -0.8011166 0.26857593 1.1249757 NA -0.57850361 2
8: -1.5518674 0.52004986 1.6505470 NA -0.34061924 2
9: NA 0.83135928 0.9155882 0.1856450 0.31346976 4
10: NA 0.60328545 1.3042894 -0.5835755 -0.17132227 4
Then subset
DT[patCnt == max(patCnt)]
V1 V2 V3 V4 V5 patCnt
1: NA -0.08410076 -1.1709337 -0.9182901 0.6798581 4
2: NA -0.73361504 2.1431392 -1.0041705 0.2919886 4
3: NA 0.83135928 0.9155882 0.1856450 0.3134698 4
4: NA 0.60328545 1.3042894 -0.5835755 -0.1713223 4

create column in datatable depending on it's values

I have got single column in data table
library(data.table)
DT <- data.table(con=c(1:5))
My result is a data table with new column x calculated as follows: first value should be first value of con(here:1), next(second) value should be calculated by muliplication second value of con times first value of x. Third value of x is a result of multiplcation third value of con times second value of x and so on. Result:
DT <- data.table(con=c(1:5), x = c(1,2,6,24,120))
I tried use shifts but it did non helped, below some lines of my code:
DT <- data.table(con=c(1:5))
DT[, x := shift(con,1, type = "lead")]
DT[, x := shift(x, 1)]
DT[, x := con * x]
You are looking for cumprod
DT[,x:=cumprod(con)]
DT
con x
1: 1 1
2: 2 2
3: 3 6
4: 4 24
5: 5 120
We can use the accumulate function from the purrr package.
library(data.table)
library(purrr)
DT <- data.table(con=c(1:5))
DT[, x := accumulate(con, `*`)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120
Or the Reduce function from the base R.
DT <- data.table(con=c(1:5))
DT[, x:= Reduce(`*`, con, accumulate = TRUE)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120

Resources