Extract integers from ranges - r

In R, what's an efficient way to extract the integers from ranges?
Let's say I have a matrix of ranges (column1=start, column2=end)
1 5
3 6
10 13
I would like to store the encompassing unique integers of all the ranges in the matrix into an object:
1
2
3
4
5
6
10
11
12
13
This would be applied to a matrix containing ~4 million ranges, so hopefully someone can offer a solution that is somewhat efficient.

Suppose you had start = 3, end = 7, and you'd marked each as a '1' on a number line starting at 1
starts: 0 0 1 0 0 0 0 0 0 ...
ends + 1: 0 0 0 0 0 0 0 1 0 ...
The cumulative sum of the starts minus the cumulative sum of the ends, and the difference between the two, is
cumsum(starts): 0 0 1 1 1 1 1 1 1 ...
cumsum(ends + 1): 0 0 0 0 0 0 0 1 1 ...
diff: 0 0 1 1 1 1 1 0 0
and the locations of the 1's in the diff are
which(diff > 0): 3 4 5 6 7
Use tabulate to allow for multiple starts / ends at the same location, and
range2 <- function(ranges)
{
max <- max(ranges)
starts <- tabulate(ranges[,1], max)
ends <- tabulate(ranges[,2] + 1L, max)
which(cumsum(starts) - cumsum(ends) > 0L)
}
For the question, this gives
> eg <- matrix(c(1, 3, 10, 5, 6, 13), 3)
> range2(eg)
[1] 1 2 3 4 5 6 10 11 12 13
It is pretty fast, for Andrie's example
> system.time(runs <- range2(xx))
user system elapsed
0.108 0.000 0.111
(this sounds a bit like DNA sequence analysis, for which GenomicRanges might be your friend; you'd use the coverage and slice functions on reads, perhaps input with readGappedAlignments).

I don't know that it is particularly efficient, but if your matrix of ranges is ranges then the following should work:
unique(unlist(apply(ranges, 1, function(x) x[1]:x[2])))

Use sequence and rep:
x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)
ranges <- function(x){
len <- x[, 2] - x[, 1] + 1
#allocate space
a <- b <- vector("numeric", sum(len))
a <- rep(x[, 1], len)
b <- sequence(len)-1
unique(a+b)
}
ranges(x)
[1] 1 2 3 4 5 6 10 11 12 13
Since this makes use of only vectorised code, this should be quite fast, even for large data sets. On my machine an input matrix of 1 million rows takes ~5 seconds to run:
set.seed(1)
xx <- sample(1e6, 1e6)
xx <- matrix(c(xx, xx+sample(1:100, 1e6, replace=TRUE)), ncol=2)
str(xx)
int [1:1000000, 1:2] 265509 372124 572853 908206 201682 898386 944670 660794 629110 61786 ...
system.time(zz <- ranges(xx))
user system elapsed
4.33 0.78 5.22
str(zz)
num [1:51470518] 265509 265510 265511 265512 265513 ...

Is it not something as simple as:
x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)
do.call(":",as.list(range(x)))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
Edit
Looks like I got the wrong end of the stick, but my answer can be modified to use union, although this is just a wrapper for unique:
Reduce("union",apply(x,1,function(y) do.call(":",as.list(y))))
[1] 1 2 3 4 5 6 10 11 12 13

Related

How to insert elements in a vector at regular intervals in R

Is there any alternative method of R for the problem explained here: How to insert elements in a vector at regular intervals in Matlab
Namely, from a vector x <- c(1,2,3,4,5,6,7,8,9,10,11,12), I want to obtain a vector y given by
y <- c(0, 1, 2, 3,
0, 4, 5, 6,
0, 7, 8, 9,
0,10,11,12)
... I found the following page,... maybe duplicate
R: insert elements into vector (a variation)
Edit I slighly modified the answer of #jay.sf . I think his interval.length is not our intuitive interval length.
x <- 1:16
interval.length <- 2
co_interval.length <- length(x)/interval.length
as.vector(t(cbind(0, matrix(x, co_interval.length, byrow=T))))
[1] 0 1 2 0 3 4 0 5 6 0 7 8 0 9 10 0 11 12 0 13 14 0 15 16
You could make a matrix and coerce it into a vector.
interval.length <- 4
as.vector(t(cbind(0, matrix(x, interval.length, byrow=T))))
# [1] 0 1 2 3 0 4 5 6 0 7 8 9 0 10 11 12
Another way is to make use of arithmetical indexing:
y <- numeric(16)
y[x + 1 + (x - 1) %/% 3] <- x
y
#> [1] 0 1 2 3 0 4 5 6 0 7 8 9 0 10 11 12

Removing rows/columns with only one element from a binary matrix

I'm trying to remove "singletons" from a binary matrix. Here, singletons refers to elements that are the only "1" value in the row AND the column in which they appear. For example, given the following matrix:
> matrix(c(0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1), nrow=6)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0 1 0 0 0 0 0
[2,] 1 0 1 0 0 0 0
[3,] 0 0 0 1 0 0 0
[4,] 1 1 0 0 0 0 0
[5,] 0 0 0 0 1 1 1
[6,] 0 0 0 0 1 0 1
...I would like to remove all of row 3 (and, if possible, all of column 4), because the 1 in [3,4] is the only 1 in that row/column combination. [1,2] is fine, since there are other 1's in column [,2]; similarly, [2,3] is fine, since there are other 1's in row [2,]. Any help would be appreciated - thanks!
You first want to find which rows and columns are singletons and then check if there are pairs of singletons rows and columns that share an index. Here is a short bit of code to accomplish this task:
foo <- matrix(c(0,1,0,...))
singRows <- which(rowSums(foo) == 1)
singCols <- which(colSums(foo) == 1)
singCombinations <- expand.grid(singRows, singCols)
singPairs <- singCombinations[apply(singCombinations, 1,
function(x) which(foo[x[1],] == 1) == x[2]),]
noSingFoo <- foo[-unique(singPairs[,1]), -unique(singPairs[,2])]
With many sinlgeton ros or columns you might need to make this a bit more efficient, but it does the job.
UPDATE: Here is the more efficient version I knew could be done. This way you loop only over the rows (or columns if desired) and not all combinations. Thus it is much more efficient for matrices with many singleton rows/columns.
## starting with foo and singRows as before
singPairRows <- singRows[sapply(singRows, function(singRow)
sum(foo[,foo[singRow,] == 1]) == 1)]
singPairs <- sapply(singPairRows, function(singRow)
c(singRow, which(foo[singRow,] == 1)))
noSingFoo <- foo[-singPairs[1,], -singPairs[2,]]
UPDATE 2: I have compared the two methods (mine=nonsparse and #Chris's=sparse) using the rbenchmark package. I have used a range of matrix sizes (from 10 to 1000 rows/columns; square matrices only) and levels of sparsity (from 0.1 to 5 non-zero entries per row/column). The relative level of performance is shown in the heat map below. Equal performance (log2 ratio of run times) is designated by white, faster with sparse method is red and faster with non-sparse method is blue. Note that I am not including the conversion to a sparse matrix in the performance calculation, so that will add some time to the sparse method. Just thought it was worth a little effort to see where this boundary was.
cr1msonB1ade's way is a great answer. For more computationally intensive matrices (millions x millions), you can use this method:
Encode your matrix in sparse notation:
DT <- structure(list(i = c(1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6), j = c(2,
1, 3, 4, 1, 2, 5, 6, 7, 5, 7), val = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1)), .Names = c("i", "j", "val"), row.names = c(NA, -11L
), class = "data.frame")
Gives (0s are implicit)
> DT
i j val
1 1 2 1
2 2 1 1
3 2 3 1
4 3 4 1
5 4 1 1
6 4 2 1
7 5 5 1
8 5 6 1
9 5 7 1
10 6 5 1
11 6 7 1
Then we can filter using:
DT <- data.table(DT)
DT[, rowcount := .N, by = i]
DT[, colcount := .N, by = j]
Giving:
>DT[!(rowcount*colcount == 1)]
i j val rowcount colcount
1: 1 2 1 1 2
2: 2 1 1 2 2
3: 2 3 1 2 1
4: 4 1 1 2 2
5: 4 2 1 2 2
6: 5 5 1 3 2
7: 5 6 1 3 1
8: 5 7 1 3 2
9: 6 5 1 2 2
10: 6 7 1 2 2
(Note the (3,4) row is now missing)

R constrainted all combinations [duplicate]

I am trying to randomly sample 7 numbers from 0 to 7 (with replacement), but subject to the constraint that the numbers chosen add up to 7. So for instance, the output 0 1 1 2 3 0 0 is okay, but the output 1 2 3 4 5 6 7 is not. Is there a way to use the sample command with added constraints?
I intend to use the replicate() function with the sample command as an argument, to return a list of N different vectors form the sample command. The way I am currently using the sample command (without any constraints), I need N to be very large in order to get as many possible vectors that sum to exactly 7 as possible. I figure there must be an easier way to do this!
Here is my code for that part:
x <- replicate(100000, sample(0:7, 7, replace=T))
Ideally, I want 10,000 or 100,000 vectors in x to sum to 7, but would need an enormous N value to do this. Thanks for any help.
To make sure you're sampling uniformly, you could just generate all the permutations and limit to those that sum to 7:
library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]
From nrow(perms7), we see there are only 1716 possible permutations that sum to 7. Now you can uniformly sample from the permutations:
set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 0 0 2 5 0 0
# [2,] 1 3 0 1 2 0 0
# [3,] 1 4 1 1 0 0 0
# [4,] 1 0 0 3 0 3 0
# [5,] 0 2 0 0 0 5 0
# [6,] 1 1 2 0 0 2 1
An advantage of this approach is that it's easy to see that we're sampling uniformly at random. Also, it's quite quick -- building perms7 took 0.3 seconds on my computer and building a 1 million-row my.perms took 0.04 seconds. If you need to draw many vectors this will be quite a bit quicker than a recursive approach because you're just using matrix indexing into perms7 instead of generating each vector separately.
Here's a distribution of counts of numbers in the sample:
# 0 1 2 3 4 5 6 7
# 323347 188162 102812 51344 22811 8629 2472 423
Start with all zeroes, add one to any element, do 7 times:
sumTo = function(){
v = rep(0,7)
for(i in 1:7){
addTo=sample(7)[1]
v[addTo]=v[addTo]+1
}
v
}
Or equivalently, just choose which of the 7 elements you are going to increment in one sample of length 7, then tabulate those, making sure you tabulate up to 7:
sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}
> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2
I don't know if this will produce a uniform sample from all possible combinations...
The distribution of individual elements over 100,000 reps is:
> X = replicate(100000,sumTo())
> table(X)
X
0 1 2 3 4 5 6
237709 277926 138810 38465 6427 627 36
Didn't hit a 0,0,0,0,0,7 that time!
This recursive algorithm will output a distribution with a higher probability for large numbers than the other solutions. The idea is to throw a random number y in 0:7 in any of the seven available slots, then repeat with a random number in 0:(7-y), etc:
sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
if (n == 1) return(s)
x <- x[x <= s]
y <- sample(x, 1)
sample(c(y, Recall(x, n - 1L, s - y)))
}
set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1
Drawing 100,000 vectors took 11 seconds on my machine and here is the distribution I get:
# 0 1 2 3 4 5 6 7
# 441607 98359 50587 33364 25055 20257 16527 14244
There may be an easier and/or more elegant way, but here's a brute-force method using the LSPM:::.nPri function. The link includes the definition for an R-only version of the algorithm, for those interested.
#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]
Now you can sample from z and be assured that you're getting a permutation that sums to 7.
I find this question intriguing and gave it some extra thought. Another (more general) approach to (approximate) sample uniformly from all feasible solutions, without generating and storing all permutations (which is clearly not possible in the case with much more than 7 numbers), in R by sample(), could be a simple MCMC implementation:
S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
opt <- sum(S[-b]) #sum of complementary elements
a <- sample(0:(7-opt), 1) #sample a substistute
S[b[1]] <- a #change elements
S[b[2]] <- 7 - opt - a
}
S #new sample
This is of course really fast for a few samples. The "distribution":
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 321729 189647 103206 52129 22287 8038 2532 432
Of course in this case, where it's actually possible to find and store all combinations, and if you want a huge sample from all feasible outcomes, just use partitions::compositions(7, 7), as also suggested by Josh O'Brien in the comments, to avoid calculating all the permutations, when only a small fraction is needed:
perms7 <- partitions::compositions(7, 7)
>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 323075 188787 102328 51511 22754 8697 2413 435

R: sample() command subject to a constraint

I am trying to randomly sample 7 numbers from 0 to 7 (with replacement), but subject to the constraint that the numbers chosen add up to 7. So for instance, the output 0 1 1 2 3 0 0 is okay, but the output 1 2 3 4 5 6 7 is not. Is there a way to use the sample command with added constraints?
I intend to use the replicate() function with the sample command as an argument, to return a list of N different vectors form the sample command. The way I am currently using the sample command (without any constraints), I need N to be very large in order to get as many possible vectors that sum to exactly 7 as possible. I figure there must be an easier way to do this!
Here is my code for that part:
x <- replicate(100000, sample(0:7, 7, replace=T))
Ideally, I want 10,000 or 100,000 vectors in x to sum to 7, but would need an enormous N value to do this. Thanks for any help.
To make sure you're sampling uniformly, you could just generate all the permutations and limit to those that sum to 7:
library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]
From nrow(perms7), we see there are only 1716 possible permutations that sum to 7. Now you can uniformly sample from the permutations:
set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 0 0 2 5 0 0
# [2,] 1 3 0 1 2 0 0
# [3,] 1 4 1 1 0 0 0
# [4,] 1 0 0 3 0 3 0
# [5,] 0 2 0 0 0 5 0
# [6,] 1 1 2 0 0 2 1
An advantage of this approach is that it's easy to see that we're sampling uniformly at random. Also, it's quite quick -- building perms7 took 0.3 seconds on my computer and building a 1 million-row my.perms took 0.04 seconds. If you need to draw many vectors this will be quite a bit quicker than a recursive approach because you're just using matrix indexing into perms7 instead of generating each vector separately.
Here's a distribution of counts of numbers in the sample:
# 0 1 2 3 4 5 6 7
# 323347 188162 102812 51344 22811 8629 2472 423
Start with all zeroes, add one to any element, do 7 times:
sumTo = function(){
v = rep(0,7)
for(i in 1:7){
addTo=sample(7)[1]
v[addTo]=v[addTo]+1
}
v
}
Or equivalently, just choose which of the 7 elements you are going to increment in one sample of length 7, then tabulate those, making sure you tabulate up to 7:
sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}
> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2
I don't know if this will produce a uniform sample from all possible combinations...
The distribution of individual elements over 100,000 reps is:
> X = replicate(100000,sumTo())
> table(X)
X
0 1 2 3 4 5 6
237709 277926 138810 38465 6427 627 36
Didn't hit a 0,0,0,0,0,7 that time!
This recursive algorithm will output a distribution with a higher probability for large numbers than the other solutions. The idea is to throw a random number y in 0:7 in any of the seven available slots, then repeat with a random number in 0:(7-y), etc:
sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
if (n == 1) return(s)
x <- x[x <= s]
y <- sample(x, 1)
sample(c(y, Recall(x, n - 1L, s - y)))
}
set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1
Drawing 100,000 vectors took 11 seconds on my machine and here is the distribution I get:
# 0 1 2 3 4 5 6 7
# 441607 98359 50587 33364 25055 20257 16527 14244
There may be an easier and/or more elegant way, but here's a brute-force method using the LSPM:::.nPri function. The link includes the definition for an R-only version of the algorithm, for those interested.
#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]
Now you can sample from z and be assured that you're getting a permutation that sums to 7.
I find this question intriguing and gave it some extra thought. Another (more general) approach to (approximate) sample uniformly from all feasible solutions, without generating and storing all permutations (which is clearly not possible in the case with much more than 7 numbers), in R by sample(), could be a simple MCMC implementation:
S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
opt <- sum(S[-b]) #sum of complementary elements
a <- sample(0:(7-opt), 1) #sample a substistute
S[b[1]] <- a #change elements
S[b[2]] <- 7 - opt - a
}
S #new sample
This is of course really fast for a few samples. The "distribution":
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 321729 189647 103206 52129 22287 8038 2532 432
Of course in this case, where it's actually possible to find and store all combinations, and if you want a huge sample from all feasible outcomes, just use partitions::compositions(7, 7), as also suggested by Josh O'Brien in the comments, to avoid calculating all the permutations, when only a small fraction is needed:
perms7 <- partitions::compositions(7, 7)
>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 323075 188787 102328 51511 22754 8697 2413 435

How to create a new column with multiple values based on another column in R

I have a data frame in R called A.Data.
It has 8 different columns: plate, row, col, TOF, EXT, green, red, and yellow.
Below is an example of the data.
> head(A.Data)
plate row col TOF EXT green red yellow
1 1 A 12 20 21 2 0 0
2 1 C 12 20 17 0 1 0
3 1 C 11 20 17 0 0 1
4 1 A 10 20 16 1 1 3
5 1 A 10 20 16 0 0 0
6 1 A 10 20 15 0 0 0
I'm trying to add a new column to A.Data called conc (short for concentration).
The new column called conc depends on the value in the col column.
-If col is 1 or 7, conc should equal to 0
-If col is 2 or 8, conc should equal to 0.5
-If col is 3 or 9, conc should equal to 1
-If col is 4 or 10, conc should equal to 2
-If col is 5 or 11, conc should equal to 4
-If col is 6 or 12, conc should say NA
So for the first 6 rows of data, the conc column should say NA, NA, 4, 2, 2, 2 because the col column values for the first 6 rows are 12, 12, 11, 10, 10, 10.
I asked my professor for help and he gave me this hint:
df$newcol <- rep(1, 1000) will add a new column to the df data frame called newcol and will have 1 replicated 1000 times
Try to add a concentration column called conc with 0, 0.5, 1, 2, 4, NA replicated as many times as you need for the entire column.
Here is the summary of A.Data$col, in case you might find it useful...
> summary (A.Data$col)
1 2 3 4 5 6 7 8 9 10 11 12 NA's
1128 703 538 256 156 30 2101 1039 741 294 73 60 11
Thank you!
Not tested, but this may work
map_column <- rep(c(0, 0.5, 1, 2, 4, NA),2)
df$newcol <- map_column[df$col]
EDIT: The idea behind this code is: map_column, which is a vector of length 12, serves here as a map (in the mathematical sense) between the numbers 1 to 12 and the values in the vector. For instance,
map_column[[1]]
returns the first element of the vector (0), and
map_column[[9]]
returns the 9th element of the vector (1), and so on. Now R vectors have the capability to process several inputs at once, so that
map_column[c(1,9)]
returns the corresponding elements (c(0,1)) at these positions in one go. Note that it is important to use a single square bracket [ instead of [[ here.
This works.
convert <- function(number){
if(number == 1 | number == 7){return(0)}
if(number == 2 | number == 8){return(.5)}
if(number == 3 | number == 9){return(1)}
if(number == 4 | number == 10){return(2)}
if(number == 5 | number == 11){return(4)}
if(number == 6 | number == 12){return(NA)}
}
A.Data$newcol <- do.call(rbind, lapply(A.Data$col, convert))
Use merge.
augment <- data.frame(col=1:12,conc=rep(c(0, 0.5, 1, 2, 4, NA),2))
A.Data <-merge(A.Data,augment,by="col",sort=F)
A.Data
# col plate row TOF EXT green red yellow conc
# 1 12 1 A 20 21 2 0 0 NA
# 2 12 1 C 20 17 0 1 0 NA
# 3 11 1 C 20 17 0 0 1 4
# 4 10 1 A 20 16 1 1 3 2
# 5 10 1 A 20 16 0 0 0 2
# 6 10 1 A 20 15 0 0 0 2
This creates an an augment dataframe with 2 columns, col corresponding to col in A.Data, and conc with the augment. Then merge that with A.Data based on col.
Here's a very different approach based on mathematical and logical operations:
x <- c(1:12, NA) # an example vector including all possible values
floor(2 ^ (z <- x %% 6 - 2)) / 2 * (z + 2 | NA)
The result:
[1] 0.0 0.5 1.0 2.0 4.0 NA 0.0 0.5 1.0 2.0 4.0 NA NA
(I fear that this solution may appear like obfuscation.)

Resources