Related
I want to obtain the minimum distance between 2 columns, however the same name may appear in both Column A and Column B. See example below;
Patient1 Patient2 Distance
A B 8
A C 11
A D 19
A E 23
B F 6
C G 25
So the output I need is:
Patient Patient_closest_distance Distance
A B 8
B F 6
c A 11
I have tried using the list function
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
However, I just get the minimum distance for each column, i.e. C will have 2 results as it is in both columns rather than showing the closest patient considering both columns. Also, I only get a list of distances, so I can't see which patient is linked to which;
Patient1 SNP
1: A 8
I have tried using the list function in R Studio
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
This code below works.
# Create sample data frame
df <- data.frame(
Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);
# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)
# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))
# Resort, min to top.
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]
I have four large vectors of unequal length. Below I am providing a toy dataset similar to my original dataset:
a <- c(1021.923, 3491.31, 102.3, 12019.11, 879.2, 583.1)
b <- c(21,32,523,123.1,123.4545,12345,95.434, 879.25, 1021.9,11,12,662)
c <- c(52,21,1021.9288,12019.12, 879.1)
d <- c(432.432,23466.3,45435,3456,123,6688,1021.95)
Is there a way to compare all of these vectors one by one with an allowed threshold of ±0.5 for the match? In other words, I want to report the numbers that are common among all four vectors while allowing a drift of 0.5.
In the case of the toy dataset above, the final answer is:
Match1
a 1021.923
b 1021.900
c 1021.929
d 1021.950
I understand that this is possible for two vectors, but how can I do it for 4 vectors?
RELATED
All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches
Compare two vectors of numbers based on threshold of tolerance (±) of 0.5
Here is a data.table solution.
It is scalable to n vectors, so try feeding it as much as you like.. It also performs well when multiple values have 'hits' in all vectors.
sample data
a <- c(1021.923, 3491.31, 102.3, 12019.11, 879.2, 583.1)
b <- c(21,32,523,123.1,123.4545,12345,95.434, 879.25, 1021.9,11,12,662)
c <- c(52,21,1021.9288,12019.12, 879.1)
d <- c(432.432,23466.3,45435,3456,123,6688,1021.95)
code
library(data.table)
#create list with vectors
l <- list( a,b,c,d )
names(l) <- letters[1:4]
#create data.table to work with
DT <- rbindlist( lapply(l, function(x) {data.table( value = x)} ), idcol = "group")
#add margins to each value
DT[, `:=`( id = 1:.N, start = value - 0.5, end = value + 0.5 ) ]
#set keys for joining
setkey(DT, start, end)
#perform overlap-join
result <- foverlaps(DT,DT)
#cast, to check how the 'hits' each id has in each group (a,b,c,d)
answer <- dcast( result,
group + value ~ i.group,
fun.aggregate = function(x){ x * 1 },
value.var = "i.value",
fill = NA )
#get your final answer
#set columns to look at (i.e. the names from the earlier created list)
cols = names(l)
#keep the rows without NA (use rowSums, because TRUE = 1, FALSE = 0 )
#so if rowSums == 0, then columns in the vactor 'cols' do not contain a 'NA'
answer[ rowSums( is.na( answer[ , ..cols ] ) ) == 0, ]
output
# group value a b c d
# 1: a 1021.923 1021.923 1021.9 1021.929 1021.95
# 2: b 1021.900 1021.923 1021.9 1021.929 1021.95
# 3: c 1021.929 1021.923 1021.9 1021.929 1021.95
# 4: d 1021.950 1021.923 1021.9 1021.929 1021.95
How can I evaluate a column of a data.table with values of the same column, each value against the value of the next two positions. The following example ilustrates the problem and desired result.
library(data.table)
dt <- data.table(a = c(2, 3, 2, 4))
result <- data.table(a = c(2, 3, 2, 4), b = c(T, F, NA, NA))
We can use shift to create two lead columns based on 'a' by specifying n= 1:2. Loop through the columns with lapply, check whether it is equal to 'a', Reduce it to a single logical vector with | and assign it to 'b' column
dt[, b := Reduce(`|`, lapply(shift(a, 1:2, type = 'lead'), `==`, a))]
dt
# a b
#1: 2 TRUE
#2: 3 FALSE
#3: 2 NA
#4: 4 NA
As #Mike H. suggested if we are comparing only for the next values, then doing this individually may be better to understand
dt[, b := (shift(a, 1, type = 'lead') == a) | (shift(a, 2, type = 'lead') ==a)]
You could do a rolling join on row number:
dt[, r := .I]
dt[head(1:.N, -2), found :=
dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, .N, by=.EACHI]$N > 0L]
a r found
1: 2 1 TRUE
2: 3 2 FALSE
3: 2 3 NA
4: 4 4 NA
To see how it works, replace .N with x.r:
dt[head(1:.N, -2), dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, x.r, by=.EACHI]]
a r x.r
1: 2 2 3
2: 3 3 NA
The idea is that we look for the nearest a match starting from r+1 and giving up after rolling one more ahead.
I have a data.table that looks like this:
dt <- data.table(a = 1, b = 1, c = 1)
I need column b to be treated as an integer vector of variable length, so I can append additional elements to it. For instance, I want to add 2 to column b in the first row. I tried
dt[a == 1, b := c(b, 2)]
but that doesn't work. It gives me a warning:
Warning message:
In `[.data.table`(dt, a == 1, `:=`(b, c(b, 2))) :
Supplied 2 items to be assigned to 1 items of column 'b' (1 unused)
What's the right syntax for this?
dt <- data.table(a = 1, b = 1:3, c = 1)
dt[, b := .(lapply(b, c, 2))][]
# a b c
#1: 1 1,2 1
#2: 1 2,2 1
#3: 1 3,2 1
If requiring a conversion to list first (i.e. when not already a list, and subsetting or doing a by), add dt[, b := .(as.list(b))] before the above.
How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:
x <- data.table(x = 1:2, y = c("a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
x[y]
# x y z
# 1: 1 a 10
# 2: 1 a 11
A semi-join would return just x[1]
More possibilities :
w = unique(x[y,which=TRUE]) # the row numbers in x which have a match from y
x[w]
If there are duplicate key values in x, then that needs :
w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]
Or, the other way around :
setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]
If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.
But the anti anti join appeals too :-)
One solution I can think of is:
tmp <- x[!y]
x[!tmp]
In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:
x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)
> x[y]
x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4
The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:
> x[!y]
x y
1: 2 b
2: 4 d
3: 6 f
4: 7 g
5: 8 h
6: 9 i
7: 10 j
It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.
The ! feature was added in data.table 1.8.4 with the following note in NEWS:
o A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works
DT[!"a"] # same result, now preferred.
DT[!J(6),...] # !J == not-join
DT[!2:3,...] # ! on all types of i
DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach (slow)
DT[!J(6L,23L)] # same result, faster binary search
'!' has been used rather than '-' :
* to match the 'not-join'/'not-where' nomenclature
* with '-', DT[-0] would return DT rather than DT[0] and not be backwards
compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
base R) and after this new feature.
* to leave DT[+J...] and DT[-J...] available for future use
For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.
P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].
I'm confused with all the not-joins above, isn't what you want simply:
unique(x[y, .SD])
# x y
#1: 1 a
If x can have duplicate keys, then you can unique y instead:
## Creating an example data.table 'a' three-times-repeated first row
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
setkey(y, x)
x[eval(unique(y, by = key(y))), .SD] # data.table >= 1.9.8 requires by=key(y)
# x y
# 1: 1 a
# 2: 1 a
# 3: 1 a
Update. Based on all the discussion here, I would do something like this, which should be fast and work in the most general case:
x[eval(unique(y[, key(x), with = FALSE]))]
Here is another, more direct solution:
unique(x[eval(y$x)])
It's more direct and runs faster - here is the comparison in run times with my previous solution:
# Generate some large data
N <- 1000000 * 26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
system.time(r1 <- x[!eval(x[!y])])
user system elapsed
7.772 1.217 11.998
system.time(r2 <- unique(x[eval(y$x)]))
user system elapsed
0.540 0.142 0.723
In a more general case, you can do something like
x[eval(y[, key(x), with = FALSE])]
I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.
sJ <- function(x,y){
ycols <- 1:min(ncol(y),length(key(x)))
yjoin <- unique(y[, ..ycols])
yjoin
}
x[eval(sJ(x,y))]
For Victor's simpler example, this gives the desired output:
x y
1: 1 a
2: 3 c
3: 5 e
This is a ~30% slower than Victor's way.
EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:
N <- 1e5*26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
require(microbenchmark)
microbenchmark(
sJ=x[eval(sJ(x,y))],
dolla=unique(x[eval(y$x)]),
brack=x[eval(unique(y[['x']]))]
)
Unit: milliseconds
expr min lq median uq max neval
# sJ 120.22700 125.04900 126.50704 132.35326 217.6566 100
# dolla 105.05373 108.33804 109.16249 118.17613 285.9814 100
# brack 53.95656 61.32669 61.88227 65.21571 235.8048 100
I'm guessing the [[ vs $ doesn't help the speed, but didn't check.
This thread is so old. But I noticed that the solution can be easily derived from the definition of semi-join given in the original post:
"A semi-join is like an inner join except that it only returns the
columns of X (not also those of Y), and does not repeat the rows of X
to match the rows of Y"
library(data.table)
dt1 <- data.table(ProdId = 1:4,
Product = c("Bread", "Cheese", "Pizza", "Butter"))
dt2 <- data.table(ProdId = c(1, 1, 3, 4, 5),
Company = c("A", "B", "C", "D", "E"))
# semi-join
unique(merge(dt1, dt2, on="ProdId")[, names(dt1), with=F])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've simply applied the syntax of inner-join, followed by filtering columns from first table only, with unique() to remove rows of first table which were repeated to match rows of second table.
Edit: The above approach will match dplyr::semi_join() output only if we have unique rows in the first table. If we need to output all the rows including duplicates from first table, then we may use fsetdiff() method shown below.
Another one line data.table solution:
fsetdiff(dt1, dt1[!dt2, on="ProdId"])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've just removed from first table the anti-join of first and second. Seems simpler to me. If the first table has duplicate rows, we will need:
fsetdiff(dt1, dt1[!dt2, on="ProdId"], all=T)
The fsetdiff() result with ,all=T matches the output from dplyr:
dplyr::semi_join(dt1, dt2, by="ProdId")
ProdId Product
1 1 Bread
2 3 Pizza
3 4 Butter
Using another set of data taken from one of previous posts:
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
y <- data.table(x = c(1, 1), z = 10:11)
With dplyr:
dplyr::semi_join(x, y, by="x")
x y
1 1 a
2 1 a
3 1 a
With data.table:
fsetdiff(x, x[!y, on="x"], all=T)
x y
1: 1 a
2: 1 a
3: 1 a
Without ,all=T, the duplicate rows are removed:
fsetdiff(x, x[!y, on="x"])
x y
1: 1 a
The package dplyr supports the following four join types:
inner_join, left_join, semi_join, anti_join
So for the semi-join try the following code
library("dplyr")
table1 <- data.table(x = 1:2, y = c("a", "b"))
table2 <- data.table(x = c(1, 1), z = 10:11)
semi_join(table1, table2)
The output is as expected:
# Joining by: "x"
# Source: local data table [1 x 2]
#
# x y
# (int) (chr)
# 1 1 a
Try the following:
w <- y[,unique(x)]
x[x %in% w]
Output will be:
x y
1: 1 a