I have a matrix Q that is relatively high dimensional (100X500000), and I want to downsample it. By downsample, I will explain with an example.
Let Q =
1 4 9
3 2 1
and downsample size= n. I want to draw n balls from a jar of sum(Q) = 20 balls, each ball colored 1 of 6 ways corresponding to a different index pair of the matrix. It's like I have 1 ball of color A, 4 balls of color B, etc, and I'm drawing n balls without replacement.
I want it to be returned in the same format, as a matrix. One example return value, for example, downsample(Q, 3) =
0 0 2
1 0 0
My approach is trying to use sample:
sample(length(as.vector(Q)), size=n, replace=FALSE, prob = as.vector(Q))
However the problem with this is, sample considers 1:length(as.vector(Q)) as all the balls I have, so I can't draw more than length(as.vector(Q)) balls since I'm not replacing my balls.
So then to adapt my method, I would need to update my prob by subtracting 1 from this vector, and call sample one by one using a for loop of some sort. It doesn't sound like nice code.
Is there a better way to do this in a R-friendly, no for loop way?
It's a little inefficient, but if sum(Q) isn't too large you can do this by disaggregating/replicating the vector and then sampling, then reaggregating/tabulating.
Q <- setNames(c(1,4,9,3,2,1),LETTERS[1:6])
n <- 10
set.seed(101)
s0 <- sample(rep(names(Q),Q),
size=n,replace=FALSE)
Q2 <- table(factor(s0,levels=names(Q)))
## A B C D E F
## 1 2 5 1 0 1
I'm not sure about your matrix structure. You could use dim(Q2) <- dim(Q) to reorganize the results in the same order as your original matrix ...
Here's one way that's pretty good. You could improve its efficiency (if necessary) by replacing which(x <= cq)[1] with a function special-built for finding the first TRUE value.
Q = matrix(c(1, 4, 9, 3, 2, 1), nrow = 2)
set.seed(47)
samp = sample(sum(Q), size = 3)
cq = cumsum(Q)
inds = table(sapply(samp, function(x) which(x <= cq)[1]))
result = integer(length(Q))
result[as.integer(names(inds))] = inds
dim(result) = dim(Q)
# [,1] [,2] [,3]
# [1,] 0 2 0
# [2,] 0 0 1
Related
I am working with the R programming language. I am trying to build a loop that performs the following :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
Since I do not know how to keep generating random numbers until a condition is met, I tried to generate a large amount of random numbers hoping that the condition is met (there is probably a better way to write this):
results <- list()
for (i in 1:100){
# do until break
repeat {
# repeat many random numbers
a = rnorm(10000,10,1)
b = rnorm(10000,10,1)
# does any pair meet the requirement
if (any(a > 12 & b > 12)) {
# put it in a data.frame
d_i = data.frame(a,b)
# end repeat
break
}
}
# select all rows until the first time the requirement is met
# it must be met, otherwise the loop would not have ended
d_i <- d_i[1:which(d_i$a > 10 & d_i$b > 10)[1], ]
# prep other variables and only keep last row (i.e. the row where the condition was met)
d_i$index = seq_len(nrow(d_i))
d_i$iteration = as.factor(i)
e_i = d_i[nrow(d_i),]
results[[i]] <- e_i
}
results_df <- do.call(rbind.data.frame, results)
Problem: When I look at the results, I noticed that the loop is incorrectly considering the condition to be met, for example:
head(results_df)
a b index iteration
4 10.29053 10.56263 4 1
5 10.95308 10.32236 5 2
3 10.74808 10.50135 3 3
13 11.87705 10.75067 13 4
1 10.17850 10.58678 1 5
14 10.14741 11.07238 1 6
For instance, in each one of these rows - both "a" and "b" are smaller than 12.
Does anyone know why this is happening and can someone please show me how to fix this problem?
Thanks!
How about this way? As you tag while-loop, I tried using it.
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
dim(res)
[1] 100 3
I want to create a function which replaces the a chosen row of a matrix with zeros. I try to think of the matrix as arbitrary but for this example I have done it with a sample 3x3 matrix with the numbers 1-9, called a_matrix
1 4 7
2 5 8
3 6 9
I have done:
zero_row <- function(M, n){
n <- c(0,0,0)
M*n
}
And then I have set the matrix and tried to get my desired result by using my zero_row function
mat1 <- a_matrix
zero_row(M = mat1, n = 1)
zero_row(M = mat1, n = 2)
zero_row(M = mat1, n = 3)
However, right now all I get is a matrix with only zeros, which I do understand why. But if I instead change the vector n to one of the following
n <- c(0,1,1)
n <- c(1,0,1)
n <- c(1,1,0)
I get my desired result for when n=1, n=2, n=3 separately. But what i want is, depending on which n I put in, I get that row to zero, so I have a function that does it for every different n, instead of me having to change the vector for every separate n. So that I get (n=2 for example)
1 4 7
0 0 0
3 6 9
And is it better to do it in another form, instead of using vectors?
Here is a way.
zero_row <- function(M, n){
stopifnot(n <= nrow(M))
M[n, ] <- 0
M
}
A <- matrix(1:9, nrow = 3)
zero_row(A, 1)
zero_row(A, 2)
zero_row(A, 3)
I know I can use expand.grid for this, but I am trying to learn actual programming. My goal is to take what I have below and use a recursion to get all 2^n binary sequences of length n.
I can do this for n = 1, but I don't understand how I would use the same function in a recursive way to get the answer for higher dimensions.
Here is for n = 1:
binseq <- function(n){
binmat <- matrix(nrow = 2^n, ncol = n)
r <- 0 #row counter
for (i in 0:1) {
r <- r + 1
binmat[r,] <- i
}
return(binmat)
}
I know I have to use probably a cbind in the return statement. My intuition says the return statement should be something like cbind(binseq(n-1), binseq(n)). But, honestly, I'm completely lost at this point.
The desired output should produce something like what expand.grid gives:
n = 5
expand.grid(replicate(n, 0:1, simplify = FALSE))
It should just be a matrix as binmat is being filled recursively.
As requested in a comment (below), here is a limited implementation for binary sequences only:
eg.binary <- function(n, digits=0:1) {
if (n <= 0) return(matrix(0,0,0))
if (n == 1) return(matrix(digits, 2))
x <- eg.binary(n-1)
rbind(cbind(digits[1], x), cbind(digits[2], x))
}
After taking care of an initial case that R cannot handle correctly, it treats the "base case" of n=1 and then recursively obtains all n-1-digit binary strings and prepends each digit to each of them. The digits are prepended so that the binary strings end up in their usual lexicographic order (the same as expand.grid).
Example:
eg.binary(3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 1
[3,] 0 1 0
[4,] 0 1 1
[5,] 1 0 0
[6,] 1 0 1
[7,] 1 1 0
[8,] 1 1 1
A general explanation (with a more flexible solution) follows.
Distill the problem down to the basic operation of tacking the values of an array y onto the rows of a dataframe X, associating a whole copy of X with each value (via cbind) and appending the whole lot (via rbind):
cross <- function(X, y) {
do.call("rbind", lapply(y, function(z) cbind(X, z)))
}
For example,
cross(data.frame(A=1:2, b=letters[1:2]), c("X","Y"))
A b z
1 1 a X
2 2 b X
3 1 a Y
4 2 b Y
(Let's worry about the column names later.)
The recursive solution for a list of such arrays y assumes you have already carried out these operations for all but the last element of the list. It has to start somewhere, which evidently consists of converting an array into a one-column data frame. Thus:
eg_ <- function(y) {
n <- length(y)
if (n <= 1) {
as.data.frame(y)
} else {
cross(eg_(y[-n]), y[[n]])
}
}
Why the funny name? Because we might want to do some post-processing, such as giving the result nice names. Here's a fuller implementation:
eg <- function(y) {
# (Define `eg_` here to keep it local to `eg` if you like)
X <- eg_(y)
names.default <- paste0("Var", seq.int(length(y)))
if (is.null(names(y))) {
colnames(X) <- names.default
} else {
colnames(X) <- ifelse(names(y)=="", names.default, names(y))
}
X
}
For example:
eg(replicate(3, 0:1, simplify=FALSE))
Var1 Var2 Var3
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
eg(list(0:1, B=2:3))
Var1 B
1 0 2
2 1 2
3 0 3
4 1 3
Apparently this was the desired recursive code:
binseq <- function(n){
if(n == 1){
binmat <- matrix(c(0,1), nrow = 2, ncol = 1)
}else if(n > 1){
A <- binseq(n-1)
B <- cbind(rep(0, nrow(A)), A)
C <- cbind(rep(1, nrow(A)), A)
binmat <- rbind(B,C)
}
return(binmat)
}
Basically for n = 1 we create a [0, 1] matrix. For every n there after we add a column of 0's to the original matrix, and, separately, a column of 1's. Then we rbind the two matrices to get the final product. So I get what the algorithm is doing, but I don't really understand what the recursion is doing. For example, I don't understand the step from n = 2 to n = 3 based on the algorithm.
I have 10M rows matrix with integer values
A row in this matrix can look as follows:
1 1 1 1 2
I need to transform the row above to the following vector:
4 1 0 0 0 0 0 0 0
Other example:
1 2 3 4 5
To:
1 1 1 1 1 0 0 0 0
How to do it efficiently in R
?
Update:
There is a function that does exactly what I need: base::tabulate (suggested here before)
but it is extremely slow (took at least 15 mins to go over my init matrix)
I would try something like this:
m <- nrow(x)
n <- ncol(x)
i.idx <- seq_len(m)
j.idx <- seq_len(n)
out <- matrix(0L, m, max(x))
for (j in j.idx) {
ij <- cbind(i.idx, x[, j])
out[ij] <- out[ij] + 1L
}
A for loop might sound surprising for a question that asks for an efficient implementation. However, this solution is vectorized for a given column and only loops through five columns. This will be many, many times faster than looping over 10 million rows using apply.
Testing with:
n <- 1e7
m <- 5
x <- matrix(sample(1:9, n*m, T), n ,m)
this approach takes less than six seconds while a naive t(apply(x, 1, tabulate, 9)) takes close to two minutes.
I am trying to construct a summed area table or integral image given an image matrix. For those of you who dont know what it is, from wikipedia:
A summed area table (also known as an integral image) is a data structure and algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid
In other words, its used to sum up values of any rectangular region in the image/matrix in constant time.
I am trying to implement this in R. However, my code seems to take too long to run.
Here is the pseudo code from this link. in is the input matrix or image and intImg is whats returned
for i=0 to w do
sum←0
for j=0 to h do
sum ← sum + in[i, j]
if i = 0 then
intImg[i, j] ← sum
else
intImg[i, j] ← intImg[i − 1, j] + sum
end if
end for
end for
And here is my implementation
w = ncol(im)
h = nrow(im)
intImg = c(NA)
length(intImg) = w*h
for(i in 1:w){ #x
sum = 0;
for(j in 1:h){ #y
ind = ((j-1)*w)+ (i-1) + 1 #index
sum = sum + im[ind]
if(i == 1){
intImg[ind] = sum
}else{
intImg[ind] = intImg[ind-1]+sum
}
}
}
intImg = matrix(intImg, h, w, byrow=T)
Example of input and output matrix:
However, on a 480x640 matrix, this takes ~ 4 seconds. In the paper they describe it to take on the order of milliseconds for those dimensions.
Am I doing something inefficient in my loops or indexing?
I considered writing it in C++ and wrapping it in R, but I am not very familiar with C++.
Thank you
You could try to use apply (isn't faster than your for-loops if you pre-allocating the memory):
areaTable <- function(x) {
return(apply(apply(x, 1, cumsum), 1, cumsum))
}
areaTable(m)
# [,1] [,2] [,3] [,4]
# [1,] 4 5 7 9
# [2,] 4 9 12 17
# [3,] 7 13 16 25
# [4,] 9 16 22 33