Implementation of an optimised 2-Dimensional subset sum problem in R - r

I am dealing with a variation of the well-known subset sum problem which I am really in need of some help with. In my problem, I have a matrix m with two columns(a, b) and n rows. I want to find the rows that the sum of the corresponding a and b values equal two target values (a_target, b_target). Some constraints are that a_target, b_target, a,b are all whole positive integers and I am only interested in the first solution that meets the criteria of the two targets being returned or, if no solution meets the criteria, the closest. This closest can be defined as the sum of the error across the two targets. As this method will be run on large datasets, I would need the solution to be optimised.
The problem could be set up as follows:
m <- matrix(data=sample(1:100, 200, replace=T),
ncol=2,
dimnames = list(
NULL,c("a","b")
))
head(m)
a b
[1,] 44 80
[2,] 51 24
[3,] 31 68
[4,] 46 55
[5,] 34 98
[6,] 93 49
a_target <- 500
b_target <- 700
To give some background, the ordinary subset sum problem deals with finding any subset of a set of integers that sums to some target t, which is NP-complete. There are multiple methods to do this with varying time optimisations. One such package in R is subsetsum, documentation. I have taken code form this package with an aim to modify it for use in my problem but I'm not sure if it is possible, for instance this solution requires t to be in increasing order to work so I'm not sure how applicable that would be with two t values. The code where t is a single column i.e. a vector is:
subsetsum <- function(S, t) {
n <- length(S)
inds <- NULL
x <- logical(n)
F <- numeric(t + 1)
G <- logical(t + 1)
G[1] <- TRUE
print(paste("n,inds,x,F,G",n,inds,x,F,G))
for (k in 1:n) {
H <- c(logical(S[k]), G[1:(t + 1 - S[k])])
H <- (G < H)
j <- which(H)
F[j] <- k
G[j] <- TRUE
if (G[t + 1]) break
}
wch <- which(G)
j <- wch[length(wch)]
fmax <- j - 1
while (j > 1) {
k <- F[j]
x[k] <- TRUE
j <- j - S[k]
}
inds <- which(x)
return(list(val = sum(S[inds]), inds = inds))
}

Related

How to modify non-zero elements of a large sparse matrix based on a second sparse matrix in R

I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.

Partial sorting of a vector

Say I have a vector of random numbers, I can order them lowest to highest:
set.seed(1)
x <- runif(20)
v <- x[order(x)]
Now, say I want to order them but with some degree of noise.
I can randomly move elements like this:
z <-sample(1:20,2)
replace(v, z, v[rev(z)])
but this doesn't necessarily move closely related values. I could be equally likely to randomly switch the 1st and 20th values as the 5th and 6th. I would like to have some control over the switching, so I can switch more closely related values.
Ideally, I would be able to reorder the vector to have a specific Spearman's correlation. Say rather than the Spearman correlation of rank order being 1 when they are perfectly ordered, is there a way to reorder that same vector of numbers to have e.g. a Spearman's correlation of 0.5 ?
What if you added some noise to their rankings. This will makes sure values don't get moved too far away from the starting point. For example
set.seed(1)
N <- 50
D <- 3 # controls how far things can move
x <- runif(N)
v <- x[vx <- order(rank(x) + runif(N, -D, D))]
z <- x[order(x)]
layout(matrix(c(1,3,2,3), nrow=2))
plot(v, main ="Ordered")
plot(z, main ="Mixed")
plot(v, z, xlab="ordered", ylab="mixed"); abline(0,1)
I don't think I have completely understood your question but here's a start. I am simply recursively swapping random consecutive values of the sorted vector. You can control the amount of swapping with n_swaps argument. -
noisy_sort <- function(x, n_swaps) {
sorted_x <- sort(x)
indices <- sample(seq_along(x[-1]), n_swaps)
for(i in indices) {
sorted_x[c(i, i+1)] <- sorted_x[c(i+1, i)]
}
sorted_x
}
set.seed(1)
x <- runif(20)
result <- noisy_sort(x, 3)
order(result)
[1] 1 2 3 5 4 6 7 8 9 10 11 13 12 14 15 16 17 19 18 20
^ ^ ^ ^ ^ ^
Here is a very rudimentary algo.
Using Spearman correlation for distinct ranks, you can back out the desired sum of squared difference (SSE) between ranks. Then, using a Markov Chain Monte Carlo (MCMC) approach, you sample a pair of indices to swap and transit to the new vector with swapped elements if it improves the SSE towards desired score.
I used the number of iterations as the stopping criteria. You can change the condition so that it meets a target tolerance level.
set.seed(1)
n <- 20
x <- runif(n)
v <- sort(x)
calc_exp_sse <- function(rho, N) {
(1 - rho) * N * (N^2 - 1) / 6
}
exp_sse <- calc_exp_sse(0.5, n)
ord <- 1:n
vec <- ord
for (i in 1:1000) {
swap <- vec
swid <- sample(n, 2L)
swap[swid] <- swap[c(swid[2L], swid[1L])]
if (abs(exp_sse - sum((ord-swap)^2)) < abs(exp_sse - sum((ord-vec)^2))) {
vec <- swap
}
}
vec
cor(vec, ord, method="spearman")
#[1] 0.5007519
cor(v, v[vec], method="spearman")
#[1] 0.5007519

Computing pairwise Hamming distance between all rows of two integer matrices/data frames

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.
I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2:
a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.
Thanks.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!
For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
Note that:
hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames
This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
Note that:
transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
Test hmd first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
Test foo:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
If you want to append some columns to df2, you know what to do, right?
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:
set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)
we will end up with a scalar:
hmd0(x,y)
# 13
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function hmd will do:
hamming.distance <- function(x, y, pairwise = TRUE) {
nx <- length(x)
ny <- length(y)
rawx <- intToBits(x)
rawy <- intToBits(y)
if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
bits <- length(intToBits(0)) ## 32-bit or 64 bit?
result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
}
if (pairwise) result else sum(result)
}
Now
hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)
The distance matrix between x and y is:
outer(x, y, hamming.distance) ## pairwise argument has no effect here
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 2 3 4 3 4 4 2
# [2,] 7 6 3 4 3 3 3
# [3,] 4 5 4 3 6 4 2
# [4,] 2 3 2 5 6 4 2
# [5,] 4 3 4 3 2 0 2
We can also do:
outer(x, x, hamming.distance)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 5 2 2 4
# [2,] 5 0 3 5 3
# [3,] 2 3 0 2 4
# [4,] 2 5 2 0 4
# [5,] 4 3 4 4 0
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.
Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
# INPUT:
# X corresponds to your entire df1, but is a matrix
# Y corresponds to your entire df2, but is a matrix
# OUTPUT:
# Matrix with four columns corresponding to the values
# that you specified in your question
fun <- function(X, Y) {
# Convert integers to bits
X <- intToBits(t(X))
# Reshape into matrix
dim(X) <- c(ncols * 32, nrows)
# Convert integers to bits
Y <- intToBits(t(Y))
# Reshape into matrix
dim(Y) <- c(ncols * 32, nrows)
# Calculate pairwise hamming distances using matrix
# multiplication.
# Columns of H index into Y; rows index into X.
# The code for the hamming() function was retrieved
# from this page:
# https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
H <- hamming(X, Y)
# Now, for each row in Y, find the two best matches
# in X. In other words: for each column in H, find
# the two smallest values and their row indices.
t(apply(H, 2, function(h) {
mindists <- sort(h, partial = 1:2)
c(
ind1 = which(h == mindists[1])[1],
val1 = mindists[1],
hmd2 = which(h == mindists[2])[1],
val2 = mindists[2]
)
}))
}
To call the function on some random data:
# Generate some random test data with no. of columns
# corresponding to your data
nrows <- 1000
ncols <- 26
# X corresponds to your df1
X <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
# Y corresponds to your df2
Y <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
res <- fun(X, Y)
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.

Loop While condition is TRUE

I am trying to generate n random numbers whose sum is less than 1.
So I can't just run runif(3). But I can condition each iteration on the sum of all values generated up to that point.
The idea is to start an empty vector, v, and set up a loop such that for each iteration, i, a runif() is generated, but before it is accepted as an element of v, i.e. v[i] <- runif(), the test sum(v) < 1 is carried out, and while FALSE the last entry v[i] is finally accepted, BUT if TRUE, that is the sum is greater than 1, v[i] is tossed out of the vector, and the iteration i is repeated.
I am far from implementing this idea, but I would like to resolve it along the lines of something similar to what follows. It's not so much a practical problem, but more of an exercise to understand the syntax of loops in general:
n <- 4
v <- 0
for (i in 1:n){
rdom <- runif(1)
if((sum(v) + rdom) < 1) v[i] <- rdom
}
# keep trying before moving on to iteration i + 1???? i <- stays i?????
}
I have looked into while (actually I incorporated the while function in the title); however, I need the vector to have n elements, so I get stuck if I try something that basically tells R to add random uniform realizations as elements of the vector v while sum(v) < 1, because I can end up with less than n elements in v.
Here's a possible solution. It doesn't use while but the more generic repeat. I edited it to use a while and save a couple of lines.
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
while (i < n) {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
}
v
# [1] 0.89669720 0.06178627 0.01339033 0.02333120
Using a repeat block, you must check for the condition anyways, but, removing the growing problem, it would look very similar:
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
repeat {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
if (i == 4) break
}
If you really want to keep exactly the same procedure that you have posted (aka iteratively sample the n values one at a time from the standard uniform distribution, rejecting any samples that cause your sum to exceed 1), then the following code is mathematically equivalent, shorter, and more efficient:
samp <- function(n) {
v <- rep(0, n)
for (i in 1:n) {
v[i] <- runif(1, 0, 1-sum(v))
}
v
}
Basically, this code uses the mathematical fact that if the sum of the vector is currently sum(v), then sampling from the standard uniform distribution until you get a value no greater than 1-sum(v) is exactly equivalent to sampling in the uniform distribution from 0 to 1-sum(v). The advantage of using the latter approach is that it's much more efficient -- we don't need to keep rejecting samples and trying again, and can instead just sample once for each element.
To get a sense of the runtime differences, consider sampling 100 observations with n=10, comparing to a working implementation of the code from your post (copied from my other answer to this question):
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
set.seed(144)
system.time(samples.OP <- replicate(100, OP(10)))
# user system elapsed
# 261.937 1.641 265.805
system.time(samples.josliber <- replicate(100, samp(10)))
# user system elapsed
# 0.004 0.001 0.004
In this case, the new approach is approaching 100,000 times faster.
It sounds like you're trying to uniformly sample from a space of n variables where the following constraints hold:
x_1 + x_2 + ... + x_n <= 1
x_1 >= 0
x_2 >= 0
...
x_n >= 0
The "hit and run" algorithm is the mathematical machinery that enables you to do exactly this. In 2-dimensional space, the algorithm will sample uniformly from the following triangle, with each location in the shaded area being equally likely to be selected:
The algorithm is provided in R through the hitandrun package, which requires you to specify the linear inequalities that define the space through a constraint matrix, direction vector, and right-hand side vector:
library(hitandrun)
n <- 3
constr <- list(constr = rbind(rep(1, n), -diag(n)),
dir = c(rep("<=", n+1)),
rhs = c(1, rep(0, n)))
set.seed(144)
samples <- hitandrun(constr, n.samples=1000)
head(samples, 10)
# [,1] [,2] [,3]
# [1,] 0.28914690 0.01620488 0.42663224
# [2,] 0.65489979 0.28455231 0.00199671
# [3,] 0.23215115 0.00661661 0.63597912
# [4,] 0.29644234 0.06398131 0.60707269
# [5,] 0.58335047 0.13891392 0.06151205
# [6,] 0.09442808 0.30287832 0.55118290
# [7,] 0.51462261 0.44094683 0.02641638
# [8,] 0.38847794 0.15501252 0.31572793
# [9,] 0.52155055 0.09921046 0.13304728
# [10,] 0.70503030 0.03770875 0.14299089
Breaking down this code a bit, we generated the following constraint matrix:
constr
# $constr
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] -1 0 0
# [3,] 0 -1 0
# [4,] 0 0 -1
#
# $dir
# [1] "<=" "<=" "<=" "<="
#
# $rhs
# [1] 1 0 0 0
Reading across the first line of constr$constr we have 1, 1, 1 which indicates "1*x1 + 1*x2 + 1*x3". The first element of constr$dir is <=, and the first element of constr$rhs is 1; putting it together we have x1 + x2 + x3 <= 1. From the second row of constr$constr we read -1, 0, 0 which indicates "-1*x1 + 0*x2 + 0*x3". The second element of constr$dir is <= and the second element of constr$rhs is 0; putting it together we have -x1 <= 0 which is the same as saying x1 >= 0. The similar non-negativity constraints follow in the remaining rows.
Note that the hit and run algorithm has the nice property of having the exact same distribution for each of the variables:
hist(samples[,1])
hist(samples[,2])
hist(samples[,3])
Meanwhile, the distribution of the samples from your procedure will be highly uneven, and as n increases this problem will get worse and worse.
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
samples.OP <- t(replicate(1000, OP(3)))
hist(samples.OP[,1])
hist(samples.OP[,2])
hist(samples.OP[,3])
An added advantage is that the hit-and-run algorithm appears faster -- I generated these 1000 replicates in 0.006 seconds on my computer with hit-and-run and it took 0.3 seconds using the modified code from the OP.
Here's how I would do it, without any loop, if or while:
set.seed(123)
x <- runif(1) # start with the sum that you want to obtain
n <- 4 # number of generated random numbers, can be chosen arbitrarily
y <- sort(runif(n-1,0,x)) # choose n-1 random points to cut the range [0:x]
z <- c(y[1],diff(y),x-y[n-1]) # result: determine the length of the segments
#> z
#[1] 0.11761257 0.10908627 0.02723712 0.03364156
#> sum(z)
#[1] 0.2875775
#> all.equal(sum(z),x)
#[1] TRUE
The advantage here is that you can determine exactly which sum you want to obtain and how many numbers n you want to generate for this. If you set, e.g., x <- 1 in the second line, the n random numbers stored in the vector z will add up to one.

how to find the most similar columns in a matrix?

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)
# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)
I personally thought of "cor" or "all.equal" and I did as follows, but did not work.
indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))
what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)
similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)
I would also love to see how then this is different from correlated. do they result in identical results ?
Using correlation you could try (with a simpler matrix for demonstration)
set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)
corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)
data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
row col correlation
1 8 1 0.3387738
2 6 2 0.3350891
Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738
I'd take linear regression approach:
Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))
And residual standard error as an optional criteria.
You can further order data frame by sigma and get best/worst combinations.
If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:
set.seed(1)
Mat <- matrix(runif(40000), ncol=100, nrow=400)
col.combs <- t(combn(1:ncol(Mat), 2))
similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)
# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
i <- col.combs[k, 1]
j <- col.combs[k, 2]
# Difference within threshold?
diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)
pair.corr <- cor(Mat[, 1], Mat[, 2])
if (mean(diff.thresh) > 0.75)
similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}
In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.
> head(similar)
Col1 Col2 Corr Pct.Diff
1 1 2 -0.003187894 76.75
2 1 3 0.074061019 76.75
3 1 4 0.082668387 78.00
4 1 5 0.001713751 75.50
5 1 8 0.052228907 75.75
6 1 12 -0.017921978 78.00
Perhaps it's not the best solution, but gets the job done.
Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.

Resources