Pairing rows in data.table

Pairing rows in data.table - r

I have a data table similar to the one obtained with the following command:
dt <- data.table(
time = 1:8,
part = rep(c(1, 1, 2, 2), 2),
type = rep(c('A', 'B'), 4),
data = rep(c(runif(1), 0), 4))
Basically, such a table contains two different type of instances (A or B). The time column contains a timestamp for when a request arrived to or leaved from a certain part. If the instance type is A, the timestamp states the arrival time (enter), and if the type is B, the timestamp states the leaving time (exit).
time part type data
1: 1 1 A 0.5842668
2: 2 1 B 0.0000000
3: 3 2 A 0.5842668
4: 4 2 B 0.0000000
5: 5 1 A 0.5842668
6: 6 1 B 0.0000000
7: 7 2 A 0.5842668
8: 8 2 B 0.0000000
I would like to pair A and B instances, and obtain the following data table:
part data enter.time exit.time
1: 1 0.4658239 1 2
2: 1 0.4658239 5 6
3: 2 0.4658239 3 4
4: 2 0.4658239 7 8
I have tried the following:
pair.types <- function(x) {
a.type <- x[type == 'A']
b.type <- x[type == 'B']
return(data.table(
enter.time = a.type$time,
exit.time = b.type$time,
data = a.type$data))
}
dt[, c('enter.time', 'exit.time', 'data') := pair.types(.SD), by = list(part)]
But, that gives me the following, which is not exactly what I want:
time part type data enter.time exit.time
1: 1 1 A 0.3441592 1 2
2: 2 1 B 0.3441592 5 6
3: 3 2 A 0.3441592 3 4
4: 4 2 B 0.3441592 7 8
5: 5 1 A 0.3441592 1 2
6: 6 1 B 0.3441592 5 6
7: 7 2 A 0.3441592 3 4
8: 8 2 B 0.3441592 7 8
It is kind of close, but since column 'type' is kept, some rows are duplicated. Perhaps, I can try to remove columns 'time' and 'type', and then remove the second half of rows. But, I am not sure whether that will work in all the cases, and I would like to learn a better way to do this operation.

Assuming your data looks like your example data:
dt[, list(part = part[1],
data = data[1],
enter.time = time[1],
exit.time = time[2]),
by = as.integer((seq_len(nrow(dt)) + 1)/2)]
# by = rep(seq(1, nrow(dt), 2), each = 2)]
# ^^^ a slightly shorter and a little more readable alternative
The idea is very simple - group rows in groups of 2 (that's the by part), i.e. each group will be one A and one B, then for each group take first part and first data and then the enter and exit times are just the first and second time's respectively. This is likely how you'd do this if you followed the by-hand logic, making it easy to read (once you know just a tiny bit about how data.table works).

Another way:
setkey(dt, "type")
dt.out <- cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
# part data entry.time exit.time
# 1: 1 0.1294204 1 2
# 2: 2 0.1294204 3 4
# 3: 1 0.1294204 5 6
# 4: 2 0.1294204 7 8
If you want you can now do setkey(dt.out, "part") to get the same order.
The idea: Your problem seems a simple "reshaping" one to me. The way I've approached it is first to create a key column as type. Now, we can subset data.table for a specific value in the key column by: dt[J("A")]. This would return the entire data.table. Since you want the column time renamed, I explicitly mention which columns to subset using:
dt[J("A"), list(part, data, entry.time = time)]
Of course this'll return also the type column (= A) which we've to remove. So, I've added a [, type := NULL] to remove column type by reference.
Now we've the first part. All we need is the exit.time. This can be obtained similarly as:
dt[J("B"), list(time)] # I don't name the column here
But this gives a data.table when you need just the time column, which can be accessed by:
dt[J("B"), list(time)]$time
So, while using cbind I name this column as exit.time to get the final result as:
cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
Hope this helps.

Related

Manipulating data.table column recursively on other column condition

I need to calculate a formula in a data frame. Each set of values across few columns have to be, lets say simplicity sake, aggregated. However, I do not want calculation across rows. I want to calculate each set with another set based on condition else where.
This is what I mean:
I have a data.table.
data = data.table(A = c("a","c","b","b","a"),
B = c(1:5),
C = c(1:5)
)
setorder(data, by=A)
> data
A B C
1: a 1 1
2: a 5 5
3: b 3 3
4: b 4 4
5: c 2 2
In column D I need to have and aggregate of values in B and C and values B and C when A is "a". As I have more than one "a", multiple aggregations are needed. From every aggregate minimum should be written in.
Here is an example.
For row 1: (1+1)+(1+1)=4, (5+5)+(1+1)=12, so 4 is minimum - D1 =4.
For row 3: (3+3)+(1+1)=8, (3+3)+(5+5)=16, D3 = 8. And so on.
This is what I expect
> data_new
A B C D
1: a 1 1 4
2: a 5 5 12
3: b 3 3 8
4: b 4 4 10
5: c 2 2 6
I tried this and run into issues.
for (i in data)data[i, D:=(min((data[i,B+C]) + (data[a=="a",(B+C)])))]
The expression below for minimum selection works fine on its own when I substitute i for a row number returning list of two numbers for min() returns proper value. Below answer is 8.
min((data[3,B+C]) + (data[A=="a",(B+C)]))
My previous attempts involved grid.expansion() and intersection(). However, with the size of my data set I ran into memory issue and Rstudio quit on me. As a side note, I need to run the calculations as I could not project the smallest outcome by "a" beforehand - it is a set of coordinates and they do not correlate with the magnitude of an answer.
Any suggestion where is my glaring issue

You can store the value of B + C where A = 'a' in a variable (val). For each row you can take minimum of B + C + val value.
library(data.table)
val <- data[A =='a', B + C]
data[, D := min(B + C + val), seq_len(nrow(data))]
data
# A B C D
#1: a 1 1 4
#2: a 5 5 12
#3: b 3 3 8
#4: b 4 4 10
#5: c 2 2 6
You can also use lapply :
data[, D := lapply(B + C, function(x) min(x + val))]

An option is also to replicate the 'a' rows after taking the min of 'B', 'C' and then do a direct + with the 'B', 'C' columns. The advantage is that, we don't have to group or loop
library(data.table)
Reduce(`+`, (data[A == 'a', .(B = min(B), C = min(C))][rep(seq_len(.N), nrow(data))] + data[, .(B, C)]))
#[1] 4 12 8 10 6
Or in a single line
data[, D := B + C + min(B[A== 'a']) + min(C[A== 'a'])]
data$D
#[1] 4 12 8 10 6

Search indexes in data.table R

I have a data.table, proce, where each line defines a "special procedure". Now, I have another data.table with the patient procedures, codes. For each person, I want to extract the indexes of "special procedures" that match with his/her procedures (if they have any). Here is an example:
library(data.table)
proce <- data.table(v1 = c('o09513','o721','o701','z370'), v2 = c('0w8nxzz','10d07z6','0tqd7zz','0uqg0zz'),
v3 = c('3e030vj','3e033vj',NA,NA))
codes <- data.table(a1 = c(list(c('o721','10d07z6','3e033vj')),
list(c('z370','0uqg0zz',"0tqd7zz","o701")),
list(c('o09513','o721','o701','z370','0uqg8zz'))))
> proce
v1 v2 v3
1: o09513 0w8nxzz 3e030vj
2: o721 10d07z6 3e033vj
3: o701 0tqd7zz <NA>
4: z370 0uqg0zz <NA>
> codes
a1
1: o721,10d07z6,3e033vj
2: z370,0uqg0zz,0tqd7zz,o701
3: o09513,o721,o701,z370,0uqg8zz
Implementation here, but since both tables have hundred thousands of lines, it's slow.
index_procedures <- list()
for(i in 1:nrow(codes)){ # i <- 2
a2 <- unlist(codes[i,a1])
index_procedures[[i]] <- which(apply(proce[,.(v1,v2,v3)], 1,function(x) all(x[!is.na(x)] %in% a2)))
}
index_procedures
> index_procedures
[[1]]
[1] 2
[[2]]
[1] 3 4
[[3]]
integer(0)

If I understand correctly,
codes contains procedure steps which have been applied to a patient. One row in codes refers to one patient.
proce contains procedure steps which constitute a special procedure.
The OP wants to identify which special procedures have been applied on each patient (if any). Thereby, a special procedure is only considered to have been applied on a patient if all of its procedure steps have applied.
To solve this, I suggest to reshape all data in a tidy format, i.e., in long format, first.
Then we can join on procedure steps, filter for complete special procedures and aggregate to get one per patient:
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
, n_steps := .N, by = pid][]
lp[lc, on = .(step)][
, .N == first(n_steps), by = .(cid, pid)][
(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
Note that the pids are shown in a condensed form for demonstration only; other output formats are available as well depending on subsequent processing steps.
If it is required to show all patients even if they have not received a special procedure:
lp[lc, on = .(step)][, .N == first(n_steps), by = .(cid, pid)][
V1 | is.na(V1), .(pid = toString(sort(pid))), by = cid]
cid pid
1: 1 2
2: 2 3, 4
3: 3
Commented code
# reshape data to long format, thereby adding a row number to identify patients
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
# reshape data to long format, thereby adding a row number to identify special procdures
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
# count the number of procedure steps which constitute a special procedure
, n_steps := .N, by = pid][]
# join on procedure steps
lp[lc, on = .(step)][
# group by patient and special procedure and test for completeness of steps
, .N == first(n_steps), by = .(cid, pid)][
# filter for complete special procedures and aggregate to get one row per patient
(V1), .(pid = toString(sort(pid))), by = cid]
After reshaping, lc is
cid step
1: 1 o721
2: 1 10d07z6
3: 1 3e033vj
4: 2 z370
5: 2 0uqg0zz
6: 2 0tqd7zz
7: 2 o701
8: 3 o09513
9: 3 o721
10: 3 o701
11: 3 z370
12: 3 0uqg8zz
and lp is
pid variable step n_steps
1: 1 v1 o09513 3
2: 2 v1 o721 3
3: 3 v1 o701 2
4: 4 v1 z370 2
5: 1 v2 0w8nxzz 3
6: 2 v2 10d07z6 3
7: 3 v2 0tqd7zz 2
8: 4 v2 0uqg0zz 2
9: 1 v3 3e030vj 3
10: 2 v3 3e033vj 3

I'm not sure about performance, but the following code might be an alternative:
pl <- split(as.matrix(proce), seq_len(nrow(proce)))
pl <- lapply(pl, na.omit)
codes[, indexes := lapply(a1, function(x) which(unlist(lapply(pl, function(p) all(p %in% x)))) )]

How to Enter data for only conditioned rows on data table

I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?

I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]

The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]

data.table joins - Select all columns in the i argument

Joining two data.table I can specify the table I want the column from, like
X[Y, i.id] # `id` is taken from Y
My problem is that I have a big table with ~80 columns. Every night a data refresh happens and, according to some parameters, some rows get replaced by a new version of the table (same table, just new data).
current <- data.table(id=1:4, var=1:4, var2=1:4, key="id")
new <- data.table(id=1:4, var=11:14, var2=11:14, key="id")
current[new[c(1,3)], `:=`(var=i.var, var2=i.var2)]
> current
id var var2
1: 1 11 11
2: 2 2 2
3: 3 13 13
4: 4 4 4
As I said, in my real case, I have much more columns so (besides rbind()ing pieces of the two tables) I wonder how can I select all the columns of the data.table used in a join as the i argument? I could spend an half an hour in hard coding all of them but it wouldn't be a maintainable code (in case new columns get added to the tables in future).

How about constructing the j-expression and just eval'ing it?
nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))
> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4

In R, find duplicated dates in a dataset and replace their associated values with their mean

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.

Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2

Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2

Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.

zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pairing rows in data.table - r

Related

Manipulating data.table column recursively on other column condition

Search indexes in data.table R

How to Enter data for only conditioned rows on data table

data.table joins - Select all columns in the i argument

In R, find duplicated dates in a dataset and replace their associated values with their mean

Categories

Resources