Finding the overlapping range of a set of vectors in R - r

I have a data.frame of intervals given row-wise, the interval starts in column one, the interval ends in column 2. The numbers are not integers. How can I find the overlapping range, if any, of all intervals. e.g:
df <- cbind(c(1.5, 3, 2.1, 1), c(6, 5, 3.7, 10.1))
plot(1:11, ylim = c(0, 5), col = NA)
segments(x0 = c(1.5, 3, 2.1, 1), y0 = 1:4, x1 = c(6, 5, 3.7, 10.1), y1 = 1:4)
abline(v = 3, col = "red", lty = 2)
abline(v = 3.7, col = "red", lty = 2)
somefunc(df)
[1] 3 3.7
A nice, fast base R (or common package like dplyr ect) solution is preferred. I already know of foverlaps (data.table) and IRranges, but they do not seem to address my problem. For bonus points, if there were interval(s) that prevented total overlap, e.g: rbind'ing c(20, 25) to df above, then the function should still return the largest possible overlap of any of the vectors, i.e. still returning c(3, 3.7).
EDIT: the solution linked by Henrik is good, but relies on generating a sequence with a given step (e.g. seq(start, end by = 1)) then reducing them to get the intersection. My intervals may narrower than this step. Ideally I need a solution that operates using logical comparison or something like that. The second solution in the linked page is also not quite right (see below)
EDIT EDIT: The intersection should be returned only if it is common to all ranges. Solution 2 in the post linked by Henrik groups together intervals even if not all intervals in the group intersect with every other interval

Here is a solution which which seems to return the expected result for the given sample datasets.
It takes the vector of all unique interval endpoints and counts the number of intervals they are intersecting (by aggregating in a non-equi join). Among the subset of points with the maximum number of intersections, the range is taken.
library(data.table)
# enhanced dataset with 2 additional intervals
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
8.3 , 12
20 , 25")
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][N == max(N), range(lb)]
res
[1] 3.0 3.7
visualisation
library(ggolot2)
ggplot(dt) +
aes(x = lb, y = seq_along(lb), xend = ub, yend = seq_along(ub)) +
geom_segment() +
geom_vline(xintercept = res, col = "red", lty = 2)
EDIT: Handling of no overlaps
The OP has pointed out that the case where there are no overlaps needs to be recognized and handled separately. So I have modified the code:
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][
N == max(N), {
if (max(N) > 1) {
cat("Maximum overlaps found:", max(N), "out of", nrow(dt), "intervals\n")
range(lb)
} else {
cat("No overlaps found\n")
NULL
}
}]
This code will recognize the situation where there are no overlaps and will return NULL in these cases. In addition, a message is printed.
In all other cases, it will print an informative message, e.g.,
Maximum overlaps found: 4 out of 6 intervals
For OP's sample dataset without overlaps
dt <- data.table(lb = c(3, 6, 10), ub = c(5, 9, 15))
it will print
No overlaps found
Caveat
In case of multiple solutions the code above will return the overall range, i.e, the start of the first interval and the end of the last interval instead of a list of separate intervals.
Sample data for this use case:
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
11.5, 16
13 , 15
12.1, 13.7
11 , 20.1
0 , 22
")
So, there is a 5-fold overlap between 3 and 3.7 and a second 5-fold overlap between 13 and 13.7.
Furthermore, there is another use case which needs to be considered: How shall intervals be treated which overlap only in one point, i.e. one interval ends where another starts?

Related

Slight difference between data.table's frollmean() and forecast's ma() with even order

I was deseasonalizing a time trend and I realized that data.table's frollmean() and forecast's ma() produce slightly different results with even order (ex: quarterly data, n = 4). At first, I thought the difference between frollmean(n = 4) and ma(order = 4) was just because ma() has a rounding up method. From the documentation:
k=(m-1)/2 [m = order]
When an even order is specified, the observations averaged will include one more observation from the future than the past (k is rounded up). If centre is TRUE, the value from two moving averages (where k is rounded up and down respectively) are averaged, centering the moving average.
However, as you can below, even when averaging frollmean(n = 4) and frollmean(n = 5), the difference dif is nonzero and consistently above 0 (for this arbitrary time series). This does not occur for odd order (ex: n = 3). Any ideas why?
# toy example
set.seed(0)
dt = data.table(x = 1:100 + 10*rnorm(100))
dt[, fm4 := frollmean(x = x, n = 4, align = "center")]
dt[, fm5 := frollmean(x = x, n = 5, align = "center")]
dt[, fm4p5 := .5 * (fm4 + fm5)]
dt[, ma4 := ma(x = x, order = 4, centre = TRUE)]
dt[, dif := fm4p5 - ma4]
plot(dt[["dif"]])
mean(dt[["dif"]], na.rm = TRUE)
I think what it means is that ma is averaging the two 4 length rollmeans, one slightly advanced of centre one slightly lagging. I.e.
dt[, fm4c := (fm4+shift(fm4))/2]
dt[, sd(fm4c-ma4, na.rm = TRUE)]
#> [1] 5.599379e-15

T-Test For Genes using Apply Function in Dataframe

I’m trying to run a t.test on two data frames.
The dataframes (which I carved out from a data.frame) has the data I need to rows 1:143. I’ve already created sub-variables as I needed to calculate rowMeans.
> c.mRNA<-rowMeans(c007[1:143,(4:9)])
> h.mRNA<-rowMeans(c007[1:143,(10:15)])
I’m simply trying to run a t.test for each row, and then plot the p-values as histograms. This is what I thought would work…
Pvals<-apply(mRNA143.data,1,function(x) {t.test(x[c.mRNA],x[h.mRNA])$p.value})
But I keep getting an error?
Error in t.test.default(x[c.mRNA], x[h.mRNA]) :
not enough 'x' observations
I’ve got something off in my syntax and cannot figure it out for the life of me!
EDIT: I've created a data.frame so it's now just two columns, I need a p-value for each row. Below is a sample of my data...
c.mRNA h.mRNA
1 8.224342 8.520142
2 9.096665 11.762597
3 10.698863 10.815275
4 10.666233 10.972130
5 12.043525 12.140297
I tried this...
pvals=apply(mRNA143.data,1,function(x) {t.test(mRNA143.data[,1],mRNA143.data[, 2])$p.value})
But I can tell from my plot that I'm off (the plots are in a straight line).
A reproducible example would go a long way. In preparing it, you might have realized that you are trying to subset columns based on mean, which doesn't make sense, really.
What you want to do is go through rows of your data, subset columns belonging to a certain group, repeat for the second group and pass that to t.test function.
This is how I would do it.
group1 <- matrix(rnorm(50, mean = 0, sd = 2), ncol = 5)
group2 <- matrix(rnorm(50, mean = 5, sd = 2), ncol = 5)
xy <- cbind(group1, group2)
# this is just a visualization of the test you're performing
plot(0, 0, xlim = c(-5, 11), ylim = c(0, 0.25), type = "n")
curve(dnorm(x, mean = 5, sd = 2), add = TRUE)
curve(dnorm(x, mean = 0, sd = 2), add = TRUE)
out <- apply(xy, MARGIN = 1, FUN = function(x) {
# x is a vector, e.g. xy[i, ] or xy[1, ]
t.test(x = x[1:5], y = x[6:10])$p.value
})
out

for each row in a data frame, find whether there is a "close" row in another data frame

I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)

Check if decimal values are in a range in R

I need to re-categorize codes that represent various diseases so as to form appropriate groups for later analysis.
Many of the groupings include ranges that look like this:
1.0 to 1.5, 1.8 to 2.5, 3.0
where another might be 37.0
Originally I thought that something like this might work:
x <-c(0:.9, 1.9:2.9, 7.9:8.9, 4.0:4.9, 3:3.9, 5:5.9, 6:6.9, 11:11.9, 9:9.9, 10:10.9, 12.9, 13:13.9, 14,14.2, 14.8)
df$disease_cat[df$site_code %in% x] <- "disease a"
The problem is, 0.1,0.2 etc. are not being recognized as being in the range of 0:0.9.
I now understand that 5:10 (for example) in r is actually 5,6,7...10
What is a better way to code these intervals so that the decimals will be recognized as being in the interval 0 to 0.9? (keeping in mind that there will be many "mini" ranges and the idea of coding them all explicitly isn't particularly appealing)
You can find the answer by printing the content of c(1.1:4). The result is [1] 1.1 2.1 3.1. The thing you need is findInterval function. Check out this solution:
findInterval(c(1,2,3,4.5), c(1.1,4)) == 1
If you would like to have the inclusive right boundary, i. e. [1.1, 4] interval, you can use rightmost.closed parameter:
findInterval(c(1,2,3,4.5), c(1.1,4), rightmost.closed = TRUE) == 1
EDIT:
Here is the solution for a more general problem you have described:
d = data.frame(disease = c('d1', 'd2', 'd3'), minValue = c(0.3, 1.2, 2.2), maxValue = c(0.6, 1.9, 2.5))
measurements = c(0.1, 0.5, 2.2, 0.3, 2.7)
findDiagnosis <- function(data, measurement) {
diagnosis = data[data$minValue <= measurement & measurement <= data$maxValue,]
if (nrow(diagnosis) == 0) {
return(NA)
} else {
return(diagnosis$disease)
}
}
sapply(measurements, findDiagnosis, data = d)
I think you want this:
c(1,2,3,4.5) >= 1.1 & c(1,2,3,4.5) <= 4
[1] FALSE TRUE TRUE FALSE
Examine the output of 1.1:4:
1.1:4
[1] 1.1 2.1 3.1
You are actually testing whether elements from your vector are exactly equal to 1.1, 2.1, or 3.1
#This the list of your ranges that you want to check
ranges = list(c(0,.9), c(1.9,2.9), c(7.9,8.9), c(4.0,4.9), c(3,3.9), c(5,5.9), c(6,6.9), c(11,11.9), c(9,9.9), c(10,10.9), c(12.9), c(13,13.9), c(14),c(14.2), c(14.8))
#This is the values that you want to check for each range in ranges
values = c(1,2,3,4.5)
#You can check each value in each range with following command
output = data.frame(t(sapply(ranges, function(x) (min(x)<values & max(x)>values))))
#Maybe set column names to values so you know clearly what you are checking.
#Column names are values, row names are indexes of the ranges
colnames(output) = values
output$ranges = sapply(ranges, function(x) paste(x,collapse = "-"))

Faster way to subset data table instead of a for loop R

I have a data table (you'll need the data table package installed) in R generated with X and Y coordinates and random data values from both normal and uniform distributions. The coordinates represent points on a 2000x1600 array and has to be divided into 16 smaller "sectors" each 500x400. These sectors need their mean of Normal Distribution values taken, divided by the min^2 of the Uniform Distribution values. I also created two variables x and y using a provided function startstop, that have the coordinates for the 16 sectors and a function that calculates the numbers for each sector.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
x <- startstop(2000)
y <- startstop(1600, T)
sectorNos <- sectorCalc(x,y,DT)
The startstop function isn't really an issue but I need a faster way to subset the data table. Some modifications have to be made to the 'sectorCalc' function. The for loop was the best way I could think of but I don't have too much experience with data tables. Any ideas on a faster method of breaking up the data table?
A solution using not only the package data.table but also the cut function to build the interval "groups":
# Create your test data
library(data.table)
set.seed(123) # make random numbers reproducible to allow comparison of different answers
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
# calculate the sector by cutting the x and y values into groups defined by the interval breaks
DT[, x.sect := cut(DT[, X], c(0, 499, 1000, 1500, 2000), dig.lab=10)] # Intervals should be: seq(0, 2000, by=500) lower bound is less one since it is not included in the interval (see help for cut function)
DT[, y.sect := cut(DT[, Y], c(0, 399, 800, 1200, 1600), dig.lab=10)] # Intervals should be: seq(0, 1600, by=400)
# Now calculate per group (calculation logic "stolen" from the working answer of user "Symbolix"
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)]
Please note: I think the size of the first and second interval is wrong in the original solution (499 instead of 500 for x and 399 instead of 400 for y so that I could not use the seq function to reproduce your desired intervals but had to enumerate the interval breaks manually).
Edit 1: I have replaced the original code that adds the x.sect and y.sect columns by an improved solution that adds columns by reference (:=).
Edit 2: If you want to order the result you have (at least) two options:
# "Chaining" (output is input of next)
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)][order(x.sect, y.sect),]
# Or: Use the "keyby" param instead of "by"
DT[, .(sect = mean(Norm)/min(Unif)^2), keyby=.(x.sect, y.sect)]
Edit 3: Added dig.lab=10 param to cut function in code above to avoid scientific notation of the interval breaks.
To replace your sectorCalc function I think we can make use of data.tables joins
As you are looping over each row of sector, you just have to create a data.table to join onto that is your sector data,
specify a column to join (here I'm using key_col), and specify a 'group' variable for each row, to enable us to do a
the calculation at the end:
x <- startstop(2000)
y <- startstop(1600, T)
## copy the original DT
dt <- copy(DT)
dt_xy <- data.table(x_1 = x[[1]],
x_2 = x[[2]],
y_1 = y[[1]],
y_2 = y[[2]])
dt[, key_col := 1]
dt_xy[, `:=`(key_col = 1, xy_grp = seq(1,.N))]
## Use a data.table join, allowing cartesian, then filter out results.
dt_res <- dt[ dt_xy, on="key_col", allow.cartesian=T][x_1 <= X & X <= x_2 & y_1 <= Y & Y <= y_2]
## calculate 'sect' as required.
dt_sect <- dt_res[, .(sect = mean(Norm)/min(Unif)^2) , by=.(xy_grp)]

Resources