Match datasets on a subset of conditions - r

When matching 2 data sets, is it possible to somehow specify the matching such that an observation from the first dataset is matched to the second dataset if at least one of the conditions are met?
Let's say I have the following 2 data.tables:
dt1<- data.table(c1=c(rep('a', 2), rep('b', 2), rep('c', 2)),
c2=c('x','y','x','y','x','z'),
c3.min = c(rep(0,3), rep(-1,3)),
c3.max = c(rep(10,3), rep(11,3)),
x= (1:6))
dt2 <- data.table(c1=c(rep('a', 3), rep('b', 3), rep('c', 4)),
c2=c(rep(c('x','y'), 5)),
c3=c(-1, 2, 0, 10, 11, -1, 3, 6, 3, 12),
y= (1:10))
I have 3 conditions based on which I want to match dt1 to dt2, and the 3rd condition is a range. If I just do a normal merge by these 3 conditions I will get:
> dt2[dt1, on=.(c1,
+ c2,
+ c3 <= c3.max,
+ c3 >= c3.min), nomatch=NA ]
c1 c2 c3 y c3.1 x
1: a x 10 3 0 1
2: a y 10 2 0 2
3: b x 10 NA 0 3
4: b y 11 4 -1 4
5: b y 11 6 -1 4
6: c x 11 7 -1 5
7: c x 11 9 -1 5
8: c z 11 NA -1 6
As you can see the observations from dt1 with x=3 and x=6 aren't matched. My main concern is to find at least one match for as many observations in dt1 as possible, even if I have to relax some conditions. So I want to know if there is anyway to perform a match where dt1 matches with dt2 on at least 1 out of the 3 conditions?
I could write a loop, but in reality my 2 datasets are much bigger than this (the first has 10K observations and the 2nd has 300K observations), and I have 4 conditions in total, so I'm looking for a more efficient way.
Thanks!

My first instinct with this type of problem would be to use the sqldf package, since we need to join using OR conditions, not AND conditions.
library(sqldf)
names(dt1) <- c("c1", "c2", "c3_min", "c3_max", "x") # need to get rid of the "."
query1 <- "select * from dt1
left join dt2
on (dt1.c1 = dt2.c1) or (dt1.c2 = dt2.c2) or (dt2.c3 between dt1.c3_min and dt1.c3_max)"
sqldf(query1)

Related

Compute conditionally across rows in data.table in R

I have a data.table with three relevant columns: id, timepoint and metric (actual size is much larger).
I am trying to calculate the percent change between the metric values at timepoints A and D and use it to create a label (Good metric, Half-decent metric, Subpar metric).
The situation becomes more complicated because if the metric is less than or equal to 2, then the new column should report "Super metric!". If not, then the percent difference should be calculated. Based off of the percent change, the id's will be reported as either "Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%).
If there is an NA value at timepoints A or D, then returning NA is okay. If timepoint A or D are missing, also return NA.
My initial thought was that I could calculate this in data.table without creating unnecessary columns, but I haven't even been able to get the more simple solution where I do the calculations separately and then join them later.
# Example data
library(data.table)
dat <- data.table(id = c(1,1,1,1,2,2,3,3,3,3,4,4,4,6,6,10,10,10,11,11,12,12,14,14),
timepoint = c("A","B","C","D","A","D","A","B","C","D","A","B","C","A","D","A","B","D", "A","D","A","D", "A","D"),
metric = c(NA, 3, 3, 4, 4, 2, 3, 3, 2, 1, 4, 3, NA, NA, 4, 1, 5, 2, 5,3, 5,5,6,3))
Partial solution: first identify the "Super metric" id's, but I would like this to class all instances of "Super metric" id's as such (right now it returns "Super metric" only for timepoint D.
# Inefficient solution
# Step 1: Identify id's that need to be computed
dat1 <- dat[, `:=` (Metric_score = if (metric <= 2 & timepoint == "D")
Metric_score = "Super metric"
else Metric_score = "Calc PC"),
by = 'id,timepoint']
# id timepoint metric Metric_score
# 1: 1 A NA Calc PC
# 2: 1 B 3 Calc PC
# 3: 1 C 3 Calc PC
# 4: 1 D 4 Calc PC
# 5: 2 A 4 Calc PC # Should be Super metric
# 6: 2 D 2 Super metric
Performing the calculation:
This calculates the percent change for all ID's, regardless of whether or not it needs to be calculated
# Step 2: Calculate percent change between timepoint D and A
dat[ , `:=`(col = (metric[timepoint == "A"] - metric[timepoint == "D"])/metric[timepoint == "A"]*100), by = 'id']
Desired output: Class each metric as "Super metric" when final score (timepoint D) is <= 2, otherwise, calculate percent change ((metric#timeD-metric#timeA)/metric#timeA)*100) and classify based on result ("Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%)
id
timepoint
metric
metric_class
1
A
NA
NA
1
B
3
NA
1
C
3
NA
1
D
4
NA
2
A
4
Super metric
2
D
2
Super metric
3
A
3
Super metric
3
B
3
Super metric
3
C
2
Super metric
3
D
1
Super metric
4
A
4
NA
4
B
3
NA
4
C
NA
NA
6
A
NA
NA
6
D
4
NA
10
A
1
Super metric
10
B
5
Super metric
10
D
2
Super metric
11
A
5
Half-decent metric
11
D
3
Half-decent metric
12
A
5
Subpar metric
12
D
5
Subpar metric
14
A
6
Good metric
14
D
3
Good metric
Using fcase should give you a desirable result.
Since 0.5 is both between 0.3-0.5 and >= 0.5 it will take the first case in the list which is "Good metric" in this case, if you want that changed you can simply change the order.
metrics <- dcast.data.table(dat, id~timepoint)
metrics[, metric_class := fcase(D <= 2, "Super metric",
abs(D-A)/A < 0.3, "Subpar metric",
abs(D-A)/A >= 0.5, "Good metric",
between(abs(D-A)/A, 0.3, 0.5), "Half-decent metric")]
dat <- merge(dat, metrics[, .(id, metric_class)], by = "id")
Here is another approach that doesn't require dcast.
metric_class <- function(t,m) {
if("D" %in% t && m[t=="D"]<=2) return(rep("Super metric", length(t)))
mvals = c("a"= m[t=="A"], "d" = m[t=="D"])
val = abs((mvals["d"]-mvals["a"])/mvals["a"])
return(rep(fcase(val<0.3, "Subpar metric", val>=0.5, "Good metric", val>=0.3 & val<0.5, "Half-decent metric"), length(t)))
}
setDT(dat)[, metric_class:=metric_class(timepoint, metric), by=id][]

efficient way to extract conditional data by group from data.table, during non-equi join in r

I have two tables, one with time-series data (dat), and another with some reference points (pts), for a bunch of different observations (time.group and well). Please see minimum example tables:
set.seed(5)
dat = data.table ( time.group = c (rep ("base", 42), rep ("4h", 42)),
well = c (rep ("A1", 20), rep ("B1", 22), rep ("A1", 19), rep ("B1", 23)),
frame = c(1:20, 1:22, 1:19, 1:23),
signal = runif (84, 0, 1) )
pts = data.table (time.group = c (rep ("base", 2), rep ("4h", 2)),
well = rep (c ("A1", "B1"), 2),
frame.start = c (3, 4, 3, 6),
frame.stop = c (17, 18, 12, 19) )
head (dat)
time.group well frame signal
1: base A1 1 0.2002145
2: base A1 2 0.6852186
3: base A1 3 0.9168758
4: base A1 4 0.2843995
5: base A1 5 0.1046501
6: base A1 6 0.7010575
head (pts)
time.group well frame.start frame.stop
1: base A1 3 17
2: base B1 4 18
3: 4h A1 3 12
4: 4h B1 6 19
I would like to extract the frame for each time.group and well, for which the signal is highest in the dat table, between frames of frame.start and frame.stop from the pts table
What is the most efficient way to do so, as I have pretty large data sets with lots of time.groups and wells, and a few other "signal"-like data columns?
These are the strategies I have come up with so far:
Example 1: This works, but I feel that this is redundant/slow, as it essentially has to perform the "by" grouping twice:
dat [pts, .(time.group, well, frame = x.frame, signal), # returns dat's frame column (desired)
on = .(time.group, well, frame >= frame.start, frame <= frame.stop) # non-equi join, groups once
][ ,
.SD [which.max (signal), .(plus = frame)], # extracting frame at max (signal)
by = .(time.group, well)] # groups again
>>>>>
time.group well plus
1: base A1 9
2: base B1 8
3: 4h A1 12
4: 4h B1 8
Example 2: Here, I would get the right numbers if I added the i.plus column with the first frame column (-1), however I can't do that and it trips out because there are two columns named "frame" in the output after the join.
Also, it wouldn't work if frame didn't start from 1 for every group:
dat [pts,
on = .(time.group, well, frame >= frame.start, frame <= frame.stop), # non-equi join
.(i.plus = which.max (signal)), # if I add i.plus and the first column frame, -1, it gives what I want, but there are two columns named frame
by = .EACHI
]
>>>>>>
time.group well frame frame i.plus
1: base A1 3 17 7
2: base B1 4 18 5
3: 4h A1 3 12 10
4: 4h B1 6 19 3
Example 3: This also works and gives the same table from example 1, but just seems like lots of code:
tmp =
dat [pts,
on = .(time.group, well, frame >= frame.start, frame <= frame.stop),
.(plus = .I [which.max (signal)] ), # returns row indeces from orginal data.table (dat)
by = .EACHI][["plus"]]
dat [tmp, .(time.group, well, plus = frame)] # extract from original table
Example 4: And this does not return the original frame column from dat, but returns the columns from pts, so I can't access the frame that corresponds to max (signal) in dat:
dat [pts,
on = .(time.group, well, frame >= frame.start, frame <= frame.stop), # non-equi join
.SD [which.max (signal) ], # does not return original frame column (x.frame), so I can't extract it
by = .EACHI
]
>>>>>>>>
time.group well frame frame signal
1: base A1 3 17 0.9565001
2: base B1 4 18 0.9659641
3: 4h A1 3 12 0.9758776
4: 4h B1 6 19 0.9304595
I'm not sure if I should approach this from an entirely different angle and try to join pts into dat instead, I have no idea! Any insight into if there are more elegant ways of accomplishing this are greatly appreciated!
I'd also like to note that coming up with an optimal strategy to do this is pretty important, as I will be doing these types of data extractions many times, so I've been cracking my head about it for a while now :(
Thank you!
Is this what you're looking for?
dat[pts, on = .(time.group, well, frame >= frame.start, frame <= frame.stop),
.(plus = x.frame[which.max(signal)]),
by = .EACHI]
# time.group well frame frame plus
# 1: base A1 3 17 9
# 2: base B1 4 18 8
# 3: 4h A1 3 12 12
# 4: 4h B1 6 19 8
For some reason, using frame instead of x.frame, i.e., frame[which.max(signal)], returns all NA, which I'd suppose is a bug .. Could you please file an issue by linking to this post? Thanks.

R: find data frame index of multiple conditions

Given two data frames s and q with five observations each:
set.seed(8)
s <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
q <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
> s
id t0 t1
1 Z 8 20
2 Z 3 12
3 X 10 19
4 X 8 21
5 Z 7 13
> q
id t0 t1
1 X 3 30
2 Z 5 12
3 Z 7 23
4 Z 3 21
5 X 7 27
The midpoint for the observations between the variables t0 and t1 is (e.g. for s data):
s$t0+(s$t1-s$t0)/2
To find the index of the (first) observation in s whose midpoint is closest to, say, the first observation in q I can do:
i <- which.min(abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2)))
s[i,]
gives:
id t0 t1
3 X 10 19
But I cannot figure out how to find the same index in the original data s if I also want to condition on the id variable (e.g. pseudo code like: which.min(....) & s$id == q$id[1] - in this case the midpoint is sought among ids being 'X'). This SO is close but not spot on.
Again: I need a index to be used in the original 5-row data set.
Set the which.min argument to infinity when your condition is not obeyed:
val <- abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2))
val[s$id != q$id[1]] <- Inf
i <- which.min(val)
By the way, you can simplify the expression in the first character as:
val <- abs((s$t0+s$t1)/2-(q$t0[1]+q$t1[1])/2)
or even
val <- abs(s$t0+s$t1-q$t0[1]-q$t1[1])/2

data.table execute function on groups of columns

If I have the following data table
m = matrix(1:12, ncol=4)
colnames(m) = c('A1','A2','B1','B2')
d = data.table(m)
is it possible to execute a function on sets of columns?
For example the following would be the sum of A1,A2 and B1,B2.
A B
1: 5 17
2: 7 19
3: 9 21
The solution would preferably work with a 500k x 100 matrix
Solution
A trick would be to split the column into groups.
Then you can use rowSums as Frank suggests (see comments on question):
# using your data example
m <- matrix(1:12, ncol = 4)
colnames(m) <- c('A1', 'A2', 'B1', 'B2')
d <- data.table(m)
# 1) group columns
groups <- split(colnames(d), substr(colnames(d), 1, 1))
# 2) group wise row sums
d[,lapply(groups, function(i) {rowSums(d[, i, with = FALSE])})]
Result
This will return the data.table:
A B
1: 5 17
2: 7 19
3: 9 21
Explanation
split creates a list of column names for each group, defined by a (something coercable to a) factor.
substr(colnames(m), 1, 1) takes the first letter as group id, use a different approach (e.g. sub("([A-Z]).*", "\\1", colnames(m)) for variable number of letters).
lapply is commonly used to apply functions over multiple columns in a data.table. Here we create a list output, named as the groups, containing the rowSums. with = FALSE is important to use the value of i to get the respective columns from d.
Definitely possible...
d[, ":=" (A = A1 + A2, B = B1 + B2)]
d
A1 A2 B1 B2 A B
1: 1 4 7 10 5 17
2: 2 5 8 11 7 19
3: 3 6 9 12 9 21
# Want to drop the old columns?
set(d, j = which(names(d) %in% c("A1", "B1", "A2", "B2")), value = NULL)
d
A B
1: 5 17
2: 7 19
3: 9 21
Whether it is desirable I shall not tell. Probably better to follow Frank's advice (see comments).

subset using R language

I have 2 tables
In the first table, I have two columns. In the first colum , the values run from 1 to 2 million (call them x). In the second column, I have random numbers (call them y) .
In the second table , I have two columns. In the first colum , I have the same x values, but they do not run from 1 to 2 million instead they are in random increasing order like 222 , 249 , 562 .. and so on. In the second column, I have random numbers (call them z) .
Now, I am trying to add a third column to my second table with the y values from first table.I decided to use apply . But, you can use join or merge -- whichever is more efficient. Here x value connects the y and the z.
To start with a minimal data, you can use this code:
t1 <- cbind(1:20, sample(100:999, 20, TRUE))
t2 <- rbind(c(2, 4), c(6, 12), c(17, 18))
apply(t2, 1, function(...) )
Could you help me to fill the ... blanks.
The output should be of the form:
2 4 --
6 12 --
17 18 --
You can use merge for this:
merge(as.data.frame(t2), as.data.frame(t1), by='V1')
V1 V2.x V2.y
1 2 4 751
2 6 12 298
3 17 18 218
Does this meet your requirements?
require(plyr)
t1 <- as.data.frame(cbind(1:20, sample(100:999, 20, TRUE)))
t2 <- as.data.frame(rbind(c(2, 4), c(6, 12), c(17, 18)))
t3 <- join(t2, t1, type = "left", by = "V1")
t3
> t3
V1 V2 V2
1 2 4 779
2 6 12 898
3 17 18 903

Resources