Related
I want to create a 100*4 matrix of 0s and 1s, such that each row has only one 1 and each column has at least two 1s, in R.
MyMat <- as.matrix(rsparsematrix(nrow=100, ncol=4, nnz = 100))
I am thinking of rsparsematrix but yet I am not sure how to apply my required conditions.
edit. My other try would be dummy_cols, but then no matter what. I am stuck with applying the two conditions yet. I guess there must be a more straightforward way of creating such a matrix.
1) A matrix consisting of 25 4x4 identity matrices stacked one on top of each other satisfies these requirements
m <- matrix(1, 25) %x% diag(4)
2) Exchanging the two arguments of %x% would also work and gives a different matrix which also satisfies this.
3) Any permutation of the rows and the columns of the two solution matrices in (1) and (2) would also satisfy the conditions.
m[sample(100), sample(4)]
4) If the objective is to generate a random table containing 0/1 values whose row sums are each 1 and whose column sums are each 25 then use r2dtable:
r <- r2dtable(1, rep(1, 100), rep(25, 4))[[1]]
5) or if it is desired to allow any column sums of at least 2 then:
rsums <- rep(1, 100)
csums <- rmultinom(1, 92, rep(0.25, 4)) + 2
r <- r2dtable(1, rsums, csums)[[1]]
Stochastically, with two rules:
All rows must have exactly one 1; and
All columns must have at least two 1s.
I control the first implicitly by construction; I test against the second.
nr <- 100 ; nc <- 4
set.seed(42)
lim <- 10000
while (lim > 0) {
lim <- lim - 1
M <- t(replicate(nr, sample(c(1, rep(0, nc-1)))))
if (all(colSums(M > 0) >= 2)) break
}
head(M)
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 0
# [2,] 0 0 0 1
# [3,] 0 0 0 1
# [4,] 0 1 0 0
# [5,] 0 0 0 1
# [6,] 0 1 0 0
colSums(M)
# [1] 25 30 21 24
lim
# [1] 9999
My use of lim is hardly needed in this example, but is there as a mechanism to stop this from running infinitely: if you change the dimensions and/or the rules, it might become highly unlikely or infeasible to meet all rules, so this keeps the execution time limited. (10000 is completely arbitrary.)
My point in the comment is that it would be rather difficult to find a 100x4 matrix that matches rule 1 that does not match rule 2. In fact, since the odds of a 0 or a 1 in any one cell is 0.75 and 0.25, respectively, to find a column (among 100 rows) that contains fewer than two 1s would be around 1.1e-11.
Here is a simple way to generate the 100 rows with the 1's randomly positioned and then create the matrix by transposing the rows object. The matrix generation is wrapped by a while loop (THX r2evans) to ensure each column contains at least two 1's.
minval <- 0
while(minval < 2) {
rows <- replicate(100, sample(c(0,0,0,1), 4))
m <- t(rows)
minval <- min(colSums(m))
}
m
[,1] [,2] [,3] [,4]
[1,] 0 0 0 1
[2,] 1 0 0 0
[3,] 0 0 0 1
[4,] 0 0 1 0
[5,] 1 0 0 0
[6,] 0 0 0 1
[7,] 1 0 0 0
[8,] 0 0 1 0
[9,] 0 1 0 0
[10,] 1 0 0 0
Code:
v <- tabulate(sample(1:4, 100-2*4, replace=TRUE), nbins=4) + 2
m <- diag(length(v))[sample(rep(seq_along(v), v)),]
Result check:
> dim(m)
[1] 100 4
> range(rowSums(m))
[1] 1 1
> range(colSums(m))
[1] 20 30
This works with any matrix size - just adjust the numbers 4 and 100. The first one controls the number of columns and the second one - the number of rows:
v <- tabulate(sample(1:10, 200-2*10, replace=TRUE), nbins=10) + 2
m <- diag(length(v))[sample(rep(seq_along(v), v)),]
> dim(m)
[1] 200 10
> range(rowSums(m))
[1] 1 1
> range(colSums(m))
[1] 15 31
Explanation: this works backwards from the properties of the resulting matrix. If you have 100 rows and 4 columns, with each row having only one 1 then the matrix will have 100 1s in total. Which means that the sum of all column-sums should also be 100. So we start with a vector of numbers (summing up to 100) which represents how many 1s each column will have. Say this vector is c(50,25,20,5). This tells us that there will be 50 rows of the form (1,0,0,0), 25 rows with the form (0,1,0,0), and so on. The final step is to generate all these rows and shuffle them.
The trick here:
v <- tabulate(sample(1:4, 100-2*4, replace=TRUE), nbins=4) + 2
Is to generate random column-sums while making sure the minimum is at least 2. We do this by generating values summing up to 92 and then adding 2 to each value (which, with 4 columns, ends up as additional 8).
I want to generate a matrix(4 rows and 30 columns) in R software, with random elements, by range of the elements between 0 and 1, which the sum of each rows equal to 1.
Here's a solution based on the softmax (multinomial logit) transform.
m <- matrix(rnorm(4 * 30), nrow=30)
prob <- exp(m)/rowSums(exp(m))
rowSums(prob)
#[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
all(prob > 0 & prob < 1)
#[1] TRUE
If you pick n numbers in [0,1] which sum to 1 you are in effect picking n-1 breakpoints. You can pick the breakpoints and then work backwards to the numbers:
rand.sum <- function(n){
x <- sort(runif(n-1))
c(x,1) - c(0,x)
}
And then
t(replicate(4,rand.sum(30)))
will be a 4x30 matrix of random numbers where eaxch row sums to 1.
I have an example word by document matrix (from Landauer and Dumais, 1997):
wxd <- matrix(c(1,1,1,0,0,0,0,0,0,0,0,0,
0,0,1,1,1,1,1,0,1,0,0,0,
0,1,0,1,1,0,0,1,0,0,0,0,
1,0,0,0,2,0,0,1,0,0,0,0,
0,0,0,1,0,1,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,0,1,1,0,
0,0,0,0,0,0,0,0,0,1,1,1,
0,0,0,0,0,0,0,0,1,0,1,1)
,12, 9)
rownames(wxd) <- c("human", "interface", "computer", "user", "system",
"response", "time", "EPS", "survey", "trees", "graph", "minors")
colnames(wxd) <- c(paste0("c", 1:5), paste0("m", 1:4))
I can perform Singular Value Decomposition on this matrix using the svd() function and have three matrices U, S, and V:
SVD <- svd(wxd)
U <- SVD$u
S <- diag(SVD$d)
V <- SVD$v
I can multiply these matrices and get my original matrix returned (within some small margin or error):
U %*% S %*% t(V)
I can also take the first two columns of the U and V matrices and the first two columns and rows of the S matrix to get the least squares best approximation of the original data. This fits with the results of the same procedure in the paper I mentioned above:
U[ , 1:2] %*% S[1:2, 1:2] %*% t(V[ , 1:2])
I am wanting to make sure I understand what this function is doing (as best as I am able), and I have been able to generate the V and S matrices to match those from the svd() function:
ATA <- t(wxd) %*% wxd
V2 <- eigen(ATA)$vectors
S2 <- sqrt(diag(eigen(ATA)$values))
But, the U matrix I generate has the same absolute values for the first 9 columns then adds an additional 3 columns. And some elements of this U matrix have different signs than the U matrix from the svd() function:
AAT <- wxd %*% t(wxd)
U2 <- eigen(AAT)$vectors
So my question is, why is the U matrix different than when I attempt to calculate it from scratch?
wxd has rank of 9. Therefore, your AAT only has 9 non-zero eigenvalues (the rest are very small ~1e-16). For those zero eigenvalues, the eigenvectors are arbitrary as long as they span the subspace orthogonal to that spanned by the other eigenvectors in R^12.
Now, by default svd only computes nu=min(n,p) left singular vectors (similarly for right eigenvectors) where n is the number of rows and p is the number of columns in the input (see ?svd). Therefore, you only get 9 left singular vectors. To generate all 12, call svd with:
svd(wxd,nu=nrow(wxd))
However, those extra 3 left singular vectors will not correspond to those found with eigen(AAT)$vectors again because these eigenvectors are determined somewhat arbitrarily to span that orthogonal subspace.
As for why some of the signs have changed, recall that eigenvectors are only determined up to a scale factor. Although these eigenvectors are normalized, they may differ by a factor of -1. To check just divide one from U with the corresponding one from U2. You should get columns of all 1s or -1s:
U[,1:9]/U2[,1:9]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1 -1 1 -1 1 -1 1 1 1
## [2,] 1 -1 1 -1 1 -1 1 1 1
## [3,] 1 -1 1 -1 1 -1 1 1 1
## [4,] 1 -1 1 -1 1 -1 1 1 1
## [5,] 1 -1 1 -1 1 -1 1 1 1
## [6,] 1 -1 1 -1 1 -1 1 1 1
## [7,] 1 -1 1 -1 1 -1 1 1 1
## [8,] 1 -1 1 -1 1 -1 1 1 1
## [9,] 1 -1 1 -1 1 -1 1 1 1
##[10,] 1 -1 1 -1 1 -1 1 1 1
##[11,] 1 -1 1 -1 1 -1 1 1 1
##[12,] 1 -1 1 -1 1 -1 1 1 1
Update to explain why Eigenvector is only determined up to a scale factor
This can be seen from the definition of the eigenvector. From Wikipedia,
In linear algebra, an eigenvector or characteristic vector of a linear transformation is a non-zero vector that does not change its direction when that linear transformation is applied to it.
In finite-dimensional vector space, the linear transformation is in terms of multiplying the vector with a square matrix A, and therefore the definition is (This is where I wish SO supports LaTeX markdown as this is not an equation in code; that is * is matrix-multiply here):
A * v = lambda * v
which is known as the Eigenvalue Equation for the matrix A where lambda is the eigenvalue associated with the eigenvector v. From this equation, it is clear that if v is an eigenvector of A then any k * v for some scalar k is also an eigenvector of A with associated eigenvalue lambda.
I am trying to randomly sample 7 numbers from 0 to 7 (with replacement), but subject to the constraint that the numbers chosen add up to 7. So for instance, the output 0 1 1 2 3 0 0 is okay, but the output 1 2 3 4 5 6 7 is not. Is there a way to use the sample command with added constraints?
I intend to use the replicate() function with the sample command as an argument, to return a list of N different vectors form the sample command. The way I am currently using the sample command (without any constraints), I need N to be very large in order to get as many possible vectors that sum to exactly 7 as possible. I figure there must be an easier way to do this!
Here is my code for that part:
x <- replicate(100000, sample(0:7, 7, replace=T))
Ideally, I want 10,000 or 100,000 vectors in x to sum to 7, but would need an enormous N value to do this. Thanks for any help.
To make sure you're sampling uniformly, you could just generate all the permutations and limit to those that sum to 7:
library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]
From nrow(perms7), we see there are only 1716 possible permutations that sum to 7. Now you can uniformly sample from the permutations:
set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 0 0 2 5 0 0
# [2,] 1 3 0 1 2 0 0
# [3,] 1 4 1 1 0 0 0
# [4,] 1 0 0 3 0 3 0
# [5,] 0 2 0 0 0 5 0
# [6,] 1 1 2 0 0 2 1
An advantage of this approach is that it's easy to see that we're sampling uniformly at random. Also, it's quite quick -- building perms7 took 0.3 seconds on my computer and building a 1 million-row my.perms took 0.04 seconds. If you need to draw many vectors this will be quite a bit quicker than a recursive approach because you're just using matrix indexing into perms7 instead of generating each vector separately.
Here's a distribution of counts of numbers in the sample:
# 0 1 2 3 4 5 6 7
# 323347 188162 102812 51344 22811 8629 2472 423
Start with all zeroes, add one to any element, do 7 times:
sumTo = function(){
v = rep(0,7)
for(i in 1:7){
addTo=sample(7)[1]
v[addTo]=v[addTo]+1
}
v
}
Or equivalently, just choose which of the 7 elements you are going to increment in one sample of length 7, then tabulate those, making sure you tabulate up to 7:
sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}
> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2
I don't know if this will produce a uniform sample from all possible combinations...
The distribution of individual elements over 100,000 reps is:
> X = replicate(100000,sumTo())
> table(X)
X
0 1 2 3 4 5 6
237709 277926 138810 38465 6427 627 36
Didn't hit a 0,0,0,0,0,7 that time!
This recursive algorithm will output a distribution with a higher probability for large numbers than the other solutions. The idea is to throw a random number y in 0:7 in any of the seven available slots, then repeat with a random number in 0:(7-y), etc:
sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
if (n == 1) return(s)
x <- x[x <= s]
y <- sample(x, 1)
sample(c(y, Recall(x, n - 1L, s - y)))
}
set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1
Drawing 100,000 vectors took 11 seconds on my machine and here is the distribution I get:
# 0 1 2 3 4 5 6 7
# 441607 98359 50587 33364 25055 20257 16527 14244
There may be an easier and/or more elegant way, but here's a brute-force method using the LSPM:::.nPri function. The link includes the definition for an R-only version of the algorithm, for those interested.
#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]
Now you can sample from z and be assured that you're getting a permutation that sums to 7.
I find this question intriguing and gave it some extra thought. Another (more general) approach to (approximate) sample uniformly from all feasible solutions, without generating and storing all permutations (which is clearly not possible in the case with much more than 7 numbers), in R by sample(), could be a simple MCMC implementation:
S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
opt <- sum(S[-b]) #sum of complementary elements
a <- sample(0:(7-opt), 1) #sample a substistute
S[b[1]] <- a #change elements
S[b[2]] <- 7 - opt - a
}
S #new sample
This is of course really fast for a few samples. The "distribution":
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 321729 189647 103206 52129 22287 8038 2532 432
Of course in this case, where it's actually possible to find and store all combinations, and if you want a huge sample from all feasible outcomes, just use partitions::compositions(7, 7), as also suggested by Josh O'Brien in the comments, to avoid calculating all the permutations, when only a small fraction is needed:
perms7 <- partitions::compositions(7, 7)
>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 323075 188787 102328 51511 22754 8697 2413 435
I am trying to randomly sample 7 numbers from 0 to 7 (with replacement), but subject to the constraint that the numbers chosen add up to 7. So for instance, the output 0 1 1 2 3 0 0 is okay, but the output 1 2 3 4 5 6 7 is not. Is there a way to use the sample command with added constraints?
I intend to use the replicate() function with the sample command as an argument, to return a list of N different vectors form the sample command. The way I am currently using the sample command (without any constraints), I need N to be very large in order to get as many possible vectors that sum to exactly 7 as possible. I figure there must be an easier way to do this!
Here is my code for that part:
x <- replicate(100000, sample(0:7, 7, replace=T))
Ideally, I want 10,000 or 100,000 vectors in x to sum to 7, but would need an enormous N value to do this. Thanks for any help.
To make sure you're sampling uniformly, you could just generate all the permutations and limit to those that sum to 7:
library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]
From nrow(perms7), we see there are only 1716 possible permutations that sum to 7. Now you can uniformly sample from the permutations:
set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 0 0 2 5 0 0
# [2,] 1 3 0 1 2 0 0
# [3,] 1 4 1 1 0 0 0
# [4,] 1 0 0 3 0 3 0
# [5,] 0 2 0 0 0 5 0
# [6,] 1 1 2 0 0 2 1
An advantage of this approach is that it's easy to see that we're sampling uniformly at random. Also, it's quite quick -- building perms7 took 0.3 seconds on my computer and building a 1 million-row my.perms took 0.04 seconds. If you need to draw many vectors this will be quite a bit quicker than a recursive approach because you're just using matrix indexing into perms7 instead of generating each vector separately.
Here's a distribution of counts of numbers in the sample:
# 0 1 2 3 4 5 6 7
# 323347 188162 102812 51344 22811 8629 2472 423
Start with all zeroes, add one to any element, do 7 times:
sumTo = function(){
v = rep(0,7)
for(i in 1:7){
addTo=sample(7)[1]
v[addTo]=v[addTo]+1
}
v
}
Or equivalently, just choose which of the 7 elements you are going to increment in one sample of length 7, then tabulate those, making sure you tabulate up to 7:
sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}
> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2
I don't know if this will produce a uniform sample from all possible combinations...
The distribution of individual elements over 100,000 reps is:
> X = replicate(100000,sumTo())
> table(X)
X
0 1 2 3 4 5 6
237709 277926 138810 38465 6427 627 36
Didn't hit a 0,0,0,0,0,7 that time!
This recursive algorithm will output a distribution with a higher probability for large numbers than the other solutions. The idea is to throw a random number y in 0:7 in any of the seven available slots, then repeat with a random number in 0:(7-y), etc:
sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
if (n == 1) return(s)
x <- x[x <= s]
y <- sample(x, 1)
sample(c(y, Recall(x, n - 1L, s - y)))
}
set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1
Drawing 100,000 vectors took 11 seconds on my machine and here is the distribution I get:
# 0 1 2 3 4 5 6 7
# 441607 98359 50587 33364 25055 20257 16527 14244
There may be an easier and/or more elegant way, but here's a brute-force method using the LSPM:::.nPri function. The link includes the definition for an R-only version of the algorithm, for those interested.
#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]
Now you can sample from z and be assured that you're getting a permutation that sums to 7.
I find this question intriguing and gave it some extra thought. Another (more general) approach to (approximate) sample uniformly from all feasible solutions, without generating and storing all permutations (which is clearly not possible in the case with much more than 7 numbers), in R by sample(), could be a simple MCMC implementation:
S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
opt <- sum(S[-b]) #sum of complementary elements
a <- sample(0:(7-opt), 1) #sample a substistute
S[b[1]] <- a #change elements
S[b[2]] <- 7 - opt - a
}
S #new sample
This is of course really fast for a few samples. The "distribution":
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 321729 189647 103206 52129 22287 8038 2532 432
Of course in this case, where it's actually possible to find and store all combinations, and if you want a huge sample from all feasible outcomes, just use partitions::compositions(7, 7), as also suggested by Josh O'Brien in the comments, to avoid calculating all the permutations, when only a small fraction is needed:
perms7 <- partitions::compositions(7, 7)
>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 323075 188787 102328 51511 22754 8697 2413 435