How to find the distance to nearest non-overlapping element? - r

I have a table like the one below, where each cluster (column 1) contains annotations of different elements (column 4) in small regions with a start (column 2) and an end (column 3) coordinate. For each entry, I would like to add a column corresponding to the distance to the nearest other element in that cluster. But I want to exclude cases where a pair of elements in the cluster have identical start/end coordinates or overlapping regions. How can I produce such extra nearest_distance column for such data frame?
cluster-47593-walk-0125 252 306 AR
cluster-47593-walk-0125 6 23 ZNF148
cluster-47593-walk-0125 357 381 CEBPA
cluster-47593-walk-0125 263 276 CEBPB
cluster-47593-walk-0125 246 324 NR3C1
cluster-47593-walk-0125 139 170 HMGA1
cluster-47593-walk-0125 139 170 HMGA2
cluster-47593-walk-0125 207 227 IRF8
cluster-47593-walk-0125 207 227 IRF1
cluster-47593-walk-0125 207 245 IRF2
cluster-47593-walk-0125 207 227 IRF3
cluster-47593-walk-0125 207 227 IRF4
cluster-47593-walk-0125 207 227 IRF5
cluster-47593-walk-0125 207 227 IRF6
cluster-47593-walk-0125 204 245 IRF7
cluster-47593-walk-0125 13 36 PATZ1
cluster-47593-walk-0125 14 143 PAX4
cluster-47593-walk-0125 4 25 RREB1
cluster-47593-walk-0125 73 87 SMAD1
cluster-47593-walk-0125 73 87 SMAD2
cluster-47593-walk-0125 73 87 SMAD3
cluster-47593-walk-0125 71 89 SMAD4
cluster-47593-walk-0125 11 40 SP1
cluster-47593-walk-0125 11 38 SP2
cluster-47593-walk-0125 7 38 SP3
cluster-47593-walk-0125 11 38 SP4
cluster-47593-walk-0125 13 33 GTF2I
cluster-47593-walk-0125 281 352 YY1
cluster-47586-walk-0222 252 306 AR
cluster-47586-walk-0222 6 23 ZNF148
[...]

First, some column names
names(data) <- c("cluster", "start", "end", "element")
data
cluster start end element
1 cluster-47593-walk-0125 252 306 AR
2 cluster-47593-walk-0125 6 23 ZNF148
3 cluster-47593-walk-0125 357 381 CEBPA
4 cluster-47593-walk-0125 263 276 CEBPB
Now creating new column
data$nearest_distance <- apply(data, 1, function(x)
{
cluster <- x[1]
start <- as.numeric(x[2])
end <- as.numeric(x[3])
elem <- x[4]
posb <- data[data$cluster == cluster & data$element != elem &
((data$start > end) | (data$end < start)), ]
startDist <- as.matrix(dist(c(end, posb$start)))[, 1]
endDist <- as.matrix(dist(c(start, posb$end)))[, 1]
best.dist <- min(startDist[startDist > 0], endDist[endDist > 0])
return(best.dist)
}
)
I don't really like at least the beginning of the function, but I couldn't come up with a better solutions.. So we have
cluster start end element nearest_distance
1 cluster-47593-walk-0125 252 306 AR 7
2 cluster-47593-walk-0125 6 23 ZNF148 48
3 cluster-47593-walk-0125 357 381 CEBPA 5
4 cluster-47593-walk-0125 263 276 CEBPB 5
5 cluster-47593-walk-0125 246 324 NR3C1 1
.....
Edit: after fixing system.time() test it appeared that this is a very inefficient way. Obviously, it is redundant to compute whole dist() matrix , so we can change these two lines to
startDist <- abs(end-posb$start)
endDist <- abs(start-posb$end)
Another minor change is that we can delete constraint data$element != elem because later there is > 0. Testing this function on 1 000 clusters with 30 rows each took more than three minutes.. There remains subsetting problem, so I tried to split data into a list and this allows us to use matrices instead of data frames (since constraint for cluster disappears) , which improves efficiency too. This time we have 10 000 clusters with 30 rows each
data <- data[rep(1:30, each = 10000), ]
data$cluster <- factor(rep(1:10000, 30))
spl <- split(data[, c(2:3)], data$cluster)
spl <- lapply(spl, data.matrix)
system.time({
x = lapply(spl, function(z) {
apply(z, 1, function(x) {
start <- x[1]
end <- x[2]
posb <- z[z[,1] > end | z[,2] < start, , drop = FALSE]
startDist <- abs(end-posb[, 1])
endDist <- abs(start-posb[, 2])
best.dist <- min(startDist[startDist > 0], endDist[endDist > 0])
return(best.dist)
})
})
})
data$nearest_distance = unsplit(x, data$cluster)
user system elapsed
18.16 0.00 18.35

Related

Using dplyr to compute calculated fields depending on multiple columns without explicitly writing column names

Consider the following code.
set.seed(56)
library(dplyr)
df <- data.frame(
NUM_1 = sample.int(500, replace = TRUE),
DENOM_1 = sample.int(500, replace = TRUE),
NUM_2 = sample.int(500, replace = TRUE),
DENOM_2 = sample.int(500, replace = TRUE)
)
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2
1 417 379 154 173
2 160 437 239 154
3 243 315 106 361
4 291 169 393 340
5 170 450 429 421
6 422 131 75 64
Without having to manually specify each of the column names (the actual problem has about 40 of these I need to create), I would like to create columns FRAC_1 and FRAC_2 for which FRAC_X = NUM_X/DENOM_X.
So, this would be what I'm looking for with regard to output, but since I'm dealing with about 40 of these, I don't want to have to manually type out each column:
df_frac <- df %>%
mutate(FRAC_1 = NUM_1 / DENOM_1,
FRAC_2 = NUM_2 / DENOM_2)
head(df_frac)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750
I would strongly prefer a dplyr solution to this. I thought maybe I could use mutate() with across(), but it isn't clear to me how to tell across() to pair the NUM_x with the corresponding DENOM_x columns.
Here is one in tidyverse
Loop across the columns with names starts_with 'NUM'
Extract the column name cur_column(), replace the substring from 'NUM' to 'DENOM' in str_replace
get the column value, divide by the NUM column, and change the column name in .names to create the 'FRAC' columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("NUM"), ~
./get(str_replace(cur_column(), 'NUM', 'DENOM')),
.names = "{str_replace(.col, 'NUM', 'FRAC')}"))
-output
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750

Finding nearest matching points

What I would like to do is for the red points find the nearest equivalent blue dot on the other side of the abline (i.e. 1,5 find 5,1).
Data:
https://1drv.ms/f/s!Asb7WztvacfOuesIq4evh0jjvejZ4Q
Edit: to open data do readRDS("path/to/data")
So what I have tried is to find the difference between the x and y coordinates, rank them and then find the min value going down the ranks for both x and y. The results and pretty bad. The thing I'm struggling with is finding a way to find nearest match of tuples.
My attempt:
find_nearest <- function(query, subject){
weight_df <- data.frame(ID=query$ID)
#find difference of first, then second, rank and find match in both going from top to bottom
tmp_df <- query
for(i in 1:nrow(subject)){
first_order <- order(abs(query$mean_score_n-subject$mean_score_n[i]))
second_order <- order(abs(query$mean_score_p-subject$mean_score_p[i]))
tmp_df$order_1[first_order] <- seq(1, nrow(tmp_df))
tmp_df$order_2[second_order] <- seq(1, nrow(tmp_df))
weight_df[,i+1] <- tmp_df$order_1 + tmp_df$order_2
}
rownames(weight_df) <- weight_df$ID
weight_df$ID <- NULL
print(dim(weight_df))
nearest_match <- list()
count <- 1
subject_ids <- NA
query_ids <- NA
while(ncol(weight_df) > 0 & count <= ncol(weight_df)){
pos <- which(weight_df == min(weight_df, na.rm = TRUE), arr.ind = TRUE)
if(length(unique(rownames(pos))) > 1){
for(i in nrow(pos)){
#if subject/query already used then mask and find another
if(subject$ID[pos[i,2]] %in% subject_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else if(query$ID[pos[i,1]] %in% query_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[i,2]])
query_ids <- c(query_ids, query$ID[pos[i,1]])
nearest_match[[count]] <- data.frame(query=query[pos[i,1],]$ID, subject=subject[pos[i,2],]$ID)
#mask
weight_df[pos[i,1],pos[i,2]] <- NA
count <- count + 1
}
}
}else if(nrow(pos) > 1){
#if subject/query already used then mask and find another
if(subject$ID[pos[1,2]] %in% subject_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else if(query$ID[pos[1,1]] %in% query_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[1,1]])
query_ids <- c(query_ids, query$ID[pos[1,1]])
nearest_match[[count]] <- data.frame(query=query[pos[1,1],]$ID, subject=subject[pos[1,2],]$ID)
#mask
weight_df[pos[1,1],pos[1,2]] <- NA
count <- count + 1
}
}else{
#if subject/query already used then mask and find another
if(subject$ID[pos[2]] %in% subject_ids){
weight_df[pos[1],pos[2]] <- NA
}else if(query$ID[pos[1]] %in% query_ids){
weight_df[pos[1],pos[2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[2]])
query_ids <- c(query_ids, query$ID[pos[1]])
nearest_match[[count]] <- data.frame(query=query[pos[1],]$ID, subject=subject[pos[2],]$ID)
#mask
weight_df[pos[1],pos[2]] <- NA
count <- count + 1
}
}
}
out <- plyr::ldply(nearest_match, rbind)
out <- merge(out, data.frame(subject=subject$ID,
mean_score_p_n=subject$mean_score_p,
mean_score_n_n= subject$mean_score_n), by="subject", all.x=TRUE)
out <- merge(out, data.frame(query=query$ID,
mean_score_p_p=query$mean_score_p,
mean_score_n_p= query$mean_score_n), by="query", all.x=TRUE)
return(out)
}
Edit: is this what the solution looks like for you?
ggplot() +
geom_point(data=B[out,], aes(x=mean_score_p, y= mean_score_n, color="red")) +
geom_point(data=A, aes(x=mean_score_p, y=mean_score_n, color="blue")) +
geom_abline(intercept = 0, slope = 1)
Let
query <- readRDS("query.dms")
subject <- readRDS("subject.dms")
kA <- nrow(subject)
kB <- nrow(query)
A <- as.matrix(subject[, 2:3])
B <- as.matrix(query[, 2:3])
where we want to find the closest "reverse" point (row) in B to each point in A.
Solution permitting non-unique results
Then, assuming that you are using the Euclidean distance,
D <- as.matrix(dist(rbind(A, B[, 2:1])))[(1 + kA):(kA + kB), 1:kA]
unname(apply(D, 2, which.min))
# [1] 268 183 350 284 21 360 132 287 100 298 58 56 170 70 47 305 353
# [18] 43 266 198 58 215 198 389 412 321 255 181 79 340 292 268 198 54
# [35] 390 38 376 47 19 94 244 18 168 201 160 194 114 247 287 273 182
# [52] 87 94 87 192 63 160 244 101 298 62
are the corresponding row numbers in B. The trick was to switch the coordinates of the points in B by using B[, 2:1].
Solution with unique results
out <- vector("numeric", length = kA)
colnames(D) <- 1:ncol(D)
rownames(D) <- 1:nrow(D)
while(any(out == 0))
for(i in 1:nrow(D)) {
aux <- apply(D, 2, which.min)
if(i %in% aux) {
win <- which(aux == i)[which.min(D[i, aux == i])]
out[as.numeric(names(win))] <- as.numeric(rownames(D)[i])
D <- D[-i, -win, drop = FALSE]
}
}
out
# [1] 268 183 350 284 21 360 132 213 100 298 22 56 170 70 128 305 353
# [18] 43 266 198 58 215 294 389 412 321 255 181 79 340 292 20 347 54
# [35] 390 38 376 47 19 94 73 18 168 201 160 194 114 247 287 273 182
# [52] 87 365 158 192 63 211 244 101 68 62
whereas
all(table(res) == 1)
# [1] TRUE
confirms uniqueness. The solution is not the most efficient, but on your dataset it takes only a couple of seconds. It takes some time because it keeps going over all the available points in B checking if it is the closest one to any of the points in A. If so, the corresponding point in B is assigned to the closest one in A. Then both the point in A and the point in B are eliminated from the distance matrix. The loop goes until every point in A has some match in B.

Subset timeseries (date sequence) into a list

I have a dataframe with a series of dates, here's a simplified version of it:
> eventdates
dr.rank dr.start dr.end
1 14 1964-09-30 1964-10-06
2 16 1964-11-01 1964-12-24
I also have a time series of dates with values etc. associated with that, here's a much simplified version of the timeseries:
ts1964 <- data.frame(DATE = seq(from = as.Date("1964-01-01"), to = as.Date("1964-12-31"), by = "days"),
Q = 1:366)
What I am trying to do is subset by each date in eventdates, i.e.:
> filter(ts1964, ts1964$DATE >= eventdates[1,2] & ts1964$DATE <= eventdates[1,3])
DATE Q
1 1964-09-30 274
2 1964-10-01 275
3 1964-10-02 276
4 1964-10-03 277
5 1964-10-04 278
6 1964-10-05 279
7 1964-10-06 280
8 1964-10-07 281
9 1964-10-08 282
10 1964-10-09 283
11 1964-10-10 284
12 1964-10-11 285
13 1964-10-12 286
14 1964-10-13 287
15 1964-10-14 288
16 1964-10-15 289
17 1964-10-16 290
18 1964-10-17 291
19 1964-10-18 292
20 1964-10-19 293
21 1964-10-20 294
22 1964-10-21 295
23 1964-10-22 296
24 1964-10-23 297
25 1964-10-24 298
26 1964-10-25 299
27 1964-10-26 300
28 1964-10-27 301
29 1964-10-28 302
30 1964-10-29 303
31 1964-10-30 304
32 1964-10-31 305
33 1964-11-01 306
>
But I need to do this hundreds of times. What I would like to do is have each subset form an element in a list. I would normally be considering to using something like dlply in plyr but this isn't an option when I'm using dplyr. Could anyone advise on how I might achieve this otherwise? Thanks
We can use Map
Map(function(x,y) filter(ts1964, DATE >= x & DATE <= y),
eventdates$dr.start, eventdates$dr.end)

10 fold cross validation using logspline in R

I would like to do 10 fold cross validation and then using MSE for model selection in R . I can divide the data into 10 groups, but I got the following error, how can I fix it?
crossvalind <- function(N, kfold) {
len.seg <- ceiling(N/kfold)
incomplete <- kfold*len.seg - N
complete <- kfold - incomplete
ind <- matrix(c(sample(1:N), rep(NA, incomplete)), nrow = len.seg, byrow = TRUE)
cvi <- lapply(as.data.frame(ind), function(x) c(na.omit(x))) # a list
return(cvi)
}
I am using logspline package for estimation of a density function.
library(logspline)
x = rnorm(300, 0, 1)
kfold <- 10
cvi <- crossvalind(N = 300, kfold = 10)
for (i in 1:length(cvi)) {
xc <- x[cvi[-i]] # x in training set
xt <- x[cvi[i]] # x in test set
fit <- logspline(xc)
f.pred <- dlogspline(xt, fit)
f.true <- dnorm(xt, 0, 1)
mse[i] <- mean((f.true - f.pred)^2)
}
Error in x[cvi[-i]] : invalid subscript type 'list'
cvi is a list object, so cvi[-1] and cvi[1] are list objects, and then you try and get x[cvi[-1]] which is subscripting using a list object, which doesn't make sense because list objects can be complex objects containing numbers, characters, dates and other lists.
Subscripting a list with single square brackets always returns a list. Use double square brackets to get the constituents, which in this case are vectors.
> cvi[1] # this is a list with one element
$V1
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
> cvi[[1]] # a length 30 vector:
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
so you can then get those elements of x:
> x[cvi[[1]]]
[1] 0.32751014 -1.13362827 -0.13286966 0.47774044 -0.63942372 0.37453378
[7] -1.09954301 -0.52806368 -0.27923480 -0.43530831 1.09462984 0.38454106
[13] -0.68283862 -1.23407793 1.60511404 0.93178122 0.47314510 -0.68034783
[19] 2.13496564 1.20117869 -0.44558321 -0.94099782 -0.19366673 0.26640705
[25] -0.96841548 -1.03443796 1.24849113 0.09258465 -0.32922472 0.83169736
this doesn't work with negative indexes:
> cvi[[-1]]
Error in cvi[[-1]] : attempt to select more than one element
So instead of subscripting x with the list elements you don't want, subscript it with the negative of the indexes you do want (since you are partitioning here):
> x[-cvi[[1]]]
will return the other 270 elements. Note I've used 1 here for the first pass through your loop, replace with i and insert in your code.

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

Resources