How to avoid redundant calculation within data.table? - r

I need to find the unique two minima in data column at a data table by two ids id1 and id2:
n <- 12
set.seed(1234)
id1 <- rep(1:2, each = 6)
id2 <- rep(1:6, each = 2)
data <- 100+100*rnorm(n)
dt <- data.table(id1=id1, id2=id2, data=data)
Find below the function that, given the second id id2, calculates the two unique minima at the same time and export them as a vector:
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T))
, by = ids
][order(V1)]
min_1 <- dt$V1[1]
min_2 <- dt$V1[2]
nn <- c(min_1 = min_1, min_2 = min_2)
}
detect_two_lower <- memoise(detect_two_lower)
Then apply the function on the data.table, grouping by = id1:
dt[, `:=` ( min_1 = detect_two_lower(id2, data)[1]
,min_2 = detect_two_lower(id2, data)[2])
, by = id1
]
The calculation runs as expected (see below). Note, however, that the code calls detect_two_lower twice with the same parameters. As a workaround I tried to minimize the reworking with memoise, but I would like to avoid this patch.
Is there a better way to accomplish the same result?
dt
id1 id2 data min_1 min_2
1: 1 1 -20.7065749 -134.5697703 -20.70657
2: 1 1 127.7429242 -134.5697703 -20.70657
3: 1 2 208.4441177 -134.5697703 -20.70657
4: 1 2 -134.5697703 -134.5697703 -20.70657
5: 1 3 142.9124689 -134.5697703 -20.70657
6: 1 3 150.6055892 -134.5697703 -20.70657
7: 2 4 42.5260040 0.1613555 10.99622
8: 2 4 45.3368144 0.1613555 10.99622
9: 2 5 43.5548001 0.1613555 10.99622
10: 2 5 10.9962171 0.1613555 10.99622
11: 2 6 52.2807300 0.1613555 10.99622
12: 2 6 0.1613555 0.1613555 10.99622

Return a list from the function
library(data.table)
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T)), by = ids][order(V1)]
as.list(dt$V1)
}
So you can assign them directly :
dt[, c('min_1', 'min_2') := detect_two_lower(id2, data), id1]

Related

Search indexes in data.table R

I have a data.table, proce, where each line defines a "special procedure". Now, I have another data.table with the patient procedures, codes. For each person, I want to extract the indexes of "special procedures" that match with his/her procedures (if they have any). Here is an example:
library(data.table)
proce <- data.table(v1 = c('o09513','o721','o701','z370'), v2 = c('0w8nxzz','10d07z6','0tqd7zz','0uqg0zz'),
v3 = c('3e030vj','3e033vj',NA,NA))
codes <- data.table(a1 = c(list(c('o721','10d07z6','3e033vj')),
list(c('z370','0uqg0zz',"0tqd7zz","o701")),
list(c('o09513','o721','o701','z370','0uqg8zz'))))
> proce
v1 v2 v3
1: o09513 0w8nxzz 3e030vj
2: o721 10d07z6 3e033vj
3: o701 0tqd7zz <NA>
4: z370 0uqg0zz <NA>
> codes
a1
1: o721,10d07z6,3e033vj
2: z370,0uqg0zz,0tqd7zz,o701
3: o09513,o721,o701,z370,0uqg8zz
Implementation here, but since both tables have hundred thousands of lines, it's slow.
index_procedures <- list()
for(i in 1:nrow(codes)){ # i <- 2
a2 <- unlist(codes[i,a1])
index_procedures[[i]] <- which(apply(proce[,.(v1,v2,v3)], 1,function(x) all(x[!is.na(x)] %in% a2)))
}
index_procedures
> index_procedures
[[1]]
[1] 2
[[2]]
[1] 3 4
[[3]]
integer(0)
If I understand correctly,
codes contains procedure steps which have been applied to a patient. One row in codes refers to one patient.
proce contains procedure steps which constitute a special procedure.
The OP wants to identify which special procedures have been applied on each patient (if any). Thereby, a special procedure is only considered to have been applied on a patient if all of its procedure steps have applied.
To solve this, I suggest to reshape all data in a tidy format, i.e., in long format, first.
Then we can join on procedure steps, filter for complete special procedures and aggregate to get one per patient:
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
, n_steps := .N, by = pid][]
lp[lc, on = .(step)][
, .N == first(n_steps), by = .(cid, pid)][
(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
Note that the pids are shown in a condensed form for demonstration only; other output formats are available as well depending on subsequent processing steps.
If it is required to show all patients even if they have not received a special procedure:
lp[lc, on = .(step)][, .N == first(n_steps), by = .(cid, pid)][
V1 | is.na(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
3: 3
Commented code
# reshape data to long format, thereby adding a row number to identify patients
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
# reshape data to long format, thereby adding a row number to identify special procdures
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
# count the number of procedure steps which constitute a special procedure
, n_steps := .N, by = pid][]
# join on procedure steps
lp[lc, on = .(step)][
# group by patient and special procedure and test for completeness of steps
, .N == first(n_steps), by = .(cid, pid)][
# filter for complete special procedures and aggregate to get one row per patient
(V1), .(pid = toString(sort(pid))), by = cid]
After reshaping, lc is
cid step
1: 1 o721
2: 1 10d07z6
3: 1 3e033vj
4: 2 z370
5: 2 0uqg0zz
6: 2 0tqd7zz
7: 2 o701
8: 3 o09513
9: 3 o721
10: 3 o701
11: 3 z370
12: 3 0uqg8zz
and lp is
pid variable step n_steps
1: 1 v1 o09513 3
2: 2 v1 o721 3
3: 3 v1 o701 2
4: 4 v1 z370 2
5: 1 v2 0w8nxzz 3
6: 2 v2 10d07z6 3
7: 3 v2 0tqd7zz 2
8: 4 v2 0uqg0zz 2
9: 1 v3 3e030vj 3
10: 2 v3 3e033vj 3
I'm not sure about performance, but the following code might be an alternative:
pl <- split(as.matrix(proce), seq_len(nrow(proce)))
pl <- lapply(pl, na.omit)
codes[, indexes := lapply(a1, function(x) which(unlist(lapply(pl, function(p) all(p %in% x)))) )]

Quick search in data.table or quick subset

I have a DF with 800k+ rows with repeated (random) values. For each row I need to take a value and find an index of a new row(s) with same value. E.g. "asd" - where else do I see it? The index of the current row is NOT needed.
My current solution: subset a DF and create a temp frame/table by removing current row. Problem - it takes a minute per 1000 iterations. So 800+k rows will take me 13 hours to run. Any ideas? Thanks!
Running on original DF (not subsetted) is < 1 second, but as you can imagine it gives me the index of the current row.
Edit: My real-life DF is more than 1 column. Example below is simplified. I need to take V1[1] and get row numbers of other V1 with value of V1[1], then repeat for V1[2] and so on for each row
library(fastmatch)
library(stringi)
set.seed(12345)
V1 = stringi::stri_rand_strings(800000, 3)
df0 = as.data.table(V1)
mapped = matrix("",nrow=800000)
print(Sys.time())
for (i in 1:1000) {
tmp_df = df0[-i,] #This takes very long time!!!
mapped[i] = fmatch(df0$V1[i],tmp_df$V1)
}
print(Sys.time())
View(mapped)
Data:
library("data.table")
set.seed(12345)
V1 = stringi::stri_rand_strings(80, 3)
df0 <- data.table( sample(V1, 100, replace = TRUE ))
Code:
df0[, id := list(list(.I)), by = V1] # integer id
Output:
head(df0, 10)
# V1 id
# 1: iuR 1,2,21
# 2: iuR 1,2,21
# 3: KXc 3
# 4: LwA 4
# 5: pYn 5
# 6: qoN 6,66
# 7: 5Xt 7
# 8: wBH 8,77
# 9: V9r 9,39,54
# 10: 9ks 10,28,42,48
EDIT - Removed Current Index:
df0[, id2 := 1:.N ]
df0[, id := list(list(unlist(id)[ unlist(id) != .I ] )), by = id2 ]
df0[, id2 := NULL ]
df0[ lengths(id) > 0, ]
head( df0, 10 )
# V1 id
# 1: iuR 2,21
# 2: iuR 1,21
# 3: KXc
# 4: LwA
# 5: pYn
# 6: qoN 66
# 7: 5Xt
# 8: wBH 77
# 9: V9r 39,54
# 10: 9ks 28,42,48

create column in datatable depending on it's values

I have got single column in data table
library(data.table)
DT <- data.table(con=c(1:5))
My result is a data table with new column x calculated as follows: first value should be first value of con(here:1), next(second) value should be calculated by muliplication second value of con times first value of x. Third value of x is a result of multiplcation third value of con times second value of x and so on. Result:
DT <- data.table(con=c(1:5), x = c(1,2,6,24,120))
I tried use shifts but it did non helped, below some lines of my code:
DT <- data.table(con=c(1:5))
DT[, x := shift(con,1, type = "lead")]
DT[, x := shift(x, 1)]
DT[, x := con * x]
You are looking for cumprod
DT[,x:=cumprod(con)]
DT
con x
1: 1 1
2: 2 2
3: 3 6
4: 4 24
5: 5 120
We can use the accumulate function from the purrr package.
library(data.table)
library(purrr)
DT <- data.table(con=c(1:5))
DT[, x := accumulate(con, `*`)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120
Or the Reduce function from the base R.
DT <- data.table(con=c(1:5))
DT[, x:= Reduce(`*`, con, accumulate = TRUE)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120

Error using function list in dcast.data.table

I am trying to reshape data using dcast.data.table, but, when I use predefined function list, dcast.data.table throws error.
require(data.table)
require(Hmisc)
n <- 2
contributors <- 1:2
dates <- 2
DT <- data.table(ID = rep(rep(1:n, contributors), each = dates))
DT[, contributor := c(1,1,2,2,2,3)]
DT[, date := c(1,2,1,1,2,2)]
DT[, amount := rnorm(.N)]
DT[, rate := c(1,1,1,3,3,4)]
DT
# ID contributor date amount rate
# 1: 1 1 1 -1.3888607 1
# 2: 1 1 2 -0.2787888 1
# 3: 2 2 1 -0.1333213 1
# 4: 2 2 1 0.6359504 3
# 5: 2 2 2 -0.2842529 3
# 6: 2 3 2 -2.6564554 4
var.list <- as.list(Cs(amount, rate))
collapse <- function(x) paste(x, collapse = ',')
fun.list <- list(sum, collapse)
dcast.data.table(data = DT, ID + contributor ~ date,
fun.aggregate = fun.list,
value.var = var.list, fill = NA)
# Error in aggregate_funs(fun.call, lvals, sep, ...) :
# When 'fun.aggregate' and 'value.var' are both lists, 'value.var' must be either of length =1 or =length(fun.aggregate).
But the lengths are equal:
length(var.list) == length(fun.list)
# [1] TRUE
When fun.aggregate is defined directly in dcast, then there isn't any problems:
dcast.data.table(data = DT, ID + contributor ~ date,
fun.aggregate = list(sum, collapse),
value.var = var.list, fill = NA)
# ID contributor amount_sum_1 amount_sum_2 rate_collapse_1 rate_collapse_2
# 1: 1 1 -1.3888607 -0.2787888 1 1
# 2: 2 2 0.5026291 -0.2842529 1,3 3
# 3: 2 3 NA -2.6564554 NA 4
I would like to know why this is happening and how could I bypass this error, to use predefined function list in dcast.data.table.
For what it's worth, you can build the call to dcast by hand, using substitute() to pass the user provided list literal to dcast, like so:
z = as.data.table(expand.grid(a=LETTERS[1:3],b=1:3,c=5:6,d=3:4,stringsAsFactors =FALSE))[sample(36,9)]
myfun = function(DT,fmla,funs,vars)
do.call("dcast",list(zz,a~.,fun=substitute(funs),value.var = list('c','d')))
myfun(z,a~.,list(sum,mean),list('c','d'))
> a c_sum d_mean
> 1: A 24 3.500000
> 2: B 10 3.500000
> 3: C 18 3.333333
However, your user (i.e. whoever calls myfun() in this example) will have to provide a list literal, as this doesn't get around the internals of dcast which walk the AST of the argument passed to fun.aggregate which expects a list literal.

Remove constant columns with or without NAs

I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?
Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)
There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.
Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.
If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}
Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5
For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method
library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5

Resources