R efficient way to use values as indexes - r

I have 10M rows matrix with integer values
A row in this matrix can look as follows:
1 1 1 1 2
I need to transform the row above to the following vector:
4 1 0 0 0 0 0 0 0
Other example:
1 2 3 4 5
To:
1 1 1 1 1 0 0 0 0
How to do it efficiently in R
?
Update:
There is a function that does exactly what I need: base::tabulate (suggested here before)
but it is extremely slow (took at least 15 mins to go over my init matrix)

I would try something like this:
m <- nrow(x)
n <- ncol(x)
i.idx <- seq_len(m)
j.idx <- seq_len(n)
out <- matrix(0L, m, max(x))
for (j in j.idx) {
ij <- cbind(i.idx, x[, j])
out[ij] <- out[ij] + 1L
}
A for loop might sound surprising for a question that asks for an efficient implementation. However, this solution is vectorized for a given column and only loops through five columns. This will be many, many times faster than looping over 10 million rows using apply.
Testing with:
n <- 1e7
m <- 5
x <- matrix(sample(1:9, n*m, T), n ,m)
this approach takes less than six seconds while a naive t(apply(x, 1, tabulate, 9)) takes close to two minutes.

Related

How to create matrix of all 2^n binary sequences of length n using recursion in R?

I know I can use expand.grid for this, but I am trying to learn actual programming. My goal is to take what I have below and use a recursion to get all 2^n binary sequences of length n.
I can do this for n = 1, but I don't understand how I would use the same function in a recursive way to get the answer for higher dimensions.
Here is for n = 1:
binseq <- function(n){
binmat <- matrix(nrow = 2^n, ncol = n)
r <- 0 #row counter
for (i in 0:1) {
r <- r + 1
binmat[r,] <- i
}
return(binmat)
}
I know I have to use probably a cbind in the return statement. My intuition says the return statement should be something like cbind(binseq(n-1), binseq(n)). But, honestly, I'm completely lost at this point.
The desired output should produce something like what expand.grid gives:
n = 5
expand.grid(replicate(n, 0:1, simplify = FALSE))
It should just be a matrix as binmat is being filled recursively.
As requested in a comment (below), here is a limited implementation for binary sequences only:
eg.binary <- function(n, digits=0:1) {
if (n <= 0) return(matrix(0,0,0))
if (n == 1) return(matrix(digits, 2))
x <- eg.binary(n-1)
rbind(cbind(digits[1], x), cbind(digits[2], x))
}
After taking care of an initial case that R cannot handle correctly, it treats the "base case" of n=1 and then recursively obtains all n-1-digit binary strings and prepends each digit to each of them. The digits are prepended so that the binary strings end up in their usual lexicographic order (the same as expand.grid).
Example:
eg.binary(3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 1
[3,] 0 1 0
[4,] 0 1 1
[5,] 1 0 0
[6,] 1 0 1
[7,] 1 1 0
[8,] 1 1 1
A general explanation (with a more flexible solution) follows.
Distill the problem down to the basic operation of tacking the values of an array y onto the rows of a dataframe X, associating a whole copy of X with each value (via cbind) and appending the whole lot (via rbind):
cross <- function(X, y) {
do.call("rbind", lapply(y, function(z) cbind(X, z)))
}
For example,
cross(data.frame(A=1:2, b=letters[1:2]), c("X","Y"))
A b z
1 1 a X
2 2 b X
3 1 a Y
4 2 b Y
(Let's worry about the column names later.)
The recursive solution for a list of such arrays y assumes you have already carried out these operations for all but the last element of the list. It has to start somewhere, which evidently consists of converting an array into a one-column data frame. Thus:
eg_ <- function(y) {
n <- length(y)
if (n <= 1) {
as.data.frame(y)
} else {
cross(eg_(y[-n]), y[[n]])
}
}
Why the funny name? Because we might want to do some post-processing, such as giving the result nice names. Here's a fuller implementation:
eg <- function(y) {
# (Define `eg_` here to keep it local to `eg` if you like)
X <- eg_(y)
names.default <- paste0("Var", seq.int(length(y)))
if (is.null(names(y))) {
colnames(X) <- names.default
} else {
colnames(X) <- ifelse(names(y)=="", names.default, names(y))
}
X
}
For example:
eg(replicate(3, 0:1, simplify=FALSE))
Var1 Var2 Var3
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
eg(list(0:1, B=2:3))
Var1 B
1 0 2
2 1 2
3 0 3
4 1 3
Apparently this was the desired recursive code:
binseq <- function(n){
if(n == 1){
binmat <- matrix(c(0,1), nrow = 2, ncol = 1)
}else if(n > 1){
A <- binseq(n-1)
B <- cbind(rep(0, nrow(A)), A)
C <- cbind(rep(1, nrow(A)), A)
binmat <- rbind(B,C)
}
return(binmat)
}
Basically for n = 1 we create a [0, 1] matrix. For every n there after we add a column of 0's to the original matrix, and, separately, a column of 1's. Then we rbind the two matrices to get the final product. So I get what the algorithm is doing, but I don't really understand what the recursion is doing. For example, I don't understand the step from n = 2 to n = 3 based on the algorithm.

R: Remove the number of occurrences of values in one vector from another vector, but not all

Apologies for the confusing title, but I don't know how to express my problem otherwise. In R, I have the following problem which I want to solve:
x <- seq(1,1, length.out=10)
y <- seq(0,0, length.out=10)
z <- c(x, y)
p <- c(1,0,1,1,0,0)
How can I remove vector p from vector z so that vector a new vector i now has three occurrences of 1 and three occurrences 0 less, so what do I have to do to arrive at the following result? In the solution, the order of 1's and 0's in z should not matter, they just might have been in a random order, plus there can be other numbers involved as well.
i
> 1 1 1 1 1 1 1 0 0 0 0 0 0 0
Thanks in advance!
Similar to #VincentGuillemot's answer, but in functional programming style. Uses purrr package:
i <- z
map(p, function(x) { i <<- i[-min(which(i == x))]})
i
> i
[1] 1 1 1 1 1 1 1 0 0 0 0 0 0 0
There might be numerous better ways to do it:
i <- z
for (val in p) {
if (val %in% i) {
i <- i[ - which(i==val)[1] ]
}
}
Another solution that I like better because it does not require a test (and thanks fo #Franck's suggestion):
for (val in p)
i <- i[ - match(val, i, nomatch = integer(0) ) ]

R create vector with a for and while loop

Good morning,
I have the following problem.
My Data.frame "data" has the format:
Type amount
1 2
2 0
3 3
I would like to create a vector with the format:
1
1
3
3
3
This means I would like to transform my data.
I created a vector and wrote the following code for my transformation in R:
vector <- numeric(5)
for (i in 1:3){
k <- 1
while (k <= data[i,2]){
vector[k] <- data[i,1]
k <- k+1
}
}
The problem is, I get the following results and I have no Idea at which part I go wrong…
3
3
3
0
0
There might be many different ways in solving this particular problem in R but I am curious why my solution doesn't work. I am thankful for alternatives, but really would like to know what my mistake is.
Thank's for your help!
Try this solution:
df <- data.frame(type = c(1, 2, 3), amount = c(2, 0, 3))
result <- unlist(mapply(function(x, y) rep.int(x, y), df[, "type"], df[, "amount"]))
result
Output is following:
# [1] 1 1 3 3 3
Exaclty your code is buggy. Correct code should looks following:
df <- data.frame(type = c(1, 2, 3), amount = c(2, 0, 3))
vector <- numeric(5)
k <- 1
for (i in 1:3) {
j <- 1
while (j <= df[i, 2]) {
vector[k] <- df[i, 1]
k <- k + 1
j <- j + 1
}
}
vector
# [1] 1 1 3 3 3
Probably the fastest and most elegant way to obtain this result has been posted before in a comment by #akrun:
with(data, rep(Type, amount))
[1] 1 1 3 3 3
However, if you want to do this with for/while loops, it could be helpful to use a list for such cases, where the number of entries is not known at the beginning.
Here is an example with minimal modifications of your code:
my_list <- vector("list", 3)
for (i in 1:3) {
k <- 1
while (k <= data[i,2]){
my_list[[i]][k] <- data[i,1]
k <- k + 1
}
}
vector <- unlist(my_list)
#> vector
#[1] 1 1 3 3 3
The reason why your code didn't work was essentially that you were trying to put too much information into a single variable, k. It cannot serve as both, an index of your output vector, and as a counter for the individual entries in the first column of data; a counter which is reset to 1 each time the while loop has finished.

Aggregate rows in a large matrix by rowname

I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
Here's a solution using by and colSums, but requires some fiddling due to the default output of by.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9
There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.
The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
combineByRow <- function(m) {
m <- m[ order(rownames(m)), ]
## keep track of previous row name
prev <- rownames(m)[1]
i.start <- 1
i.end <- 1
## cache the rownames -- profiling shows that it takes
## forever to look at them
m.rownames <- rownames(m)
stopifnot(all(!is.na(m.rownames)))
## go through matrix in a loop, as we need to combine some unknown
## set of rows
for (i in 2:(1+nrow(m))) {
curr <- m.rownames[i]
## if we found a new row name (or are at the end of the matrix),
## combine all rows and mark invalid rows
if (prev != curr || is.na(curr)) {
if (i.start < i.end) {
m[i.start,] <- apply(m[i.start:i.end,], 2, max)
m.rownames[(1+i.start):i.end] <- NA
}
prev <- curr
i.start <- i
} else {
i.end <- i
}
}
m[ which(!is.na(m.rownames)),]
}
Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)
start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)
all(m1 == m2)

Multiply each row of a matrix by the matrix

I am a new R-user and I have a kind of algorithm problem. I made some research on the web and on Stackoverflow, but can't find my answer.
I have a squared matrix, for example :
A B C D
A 0 0 0 1
B 0 1 1 0
C 1 0 0 0
D 0 1 1 1
This matrix represents links between keywords (A, B, C and D here). A '1' (or a TRUE) means keywords are in relation. For example, the '1' on the first row means A is linked to D.
I need to find the two most linked keywords on the matrix. I know I need to compute the scalar product between each row and the initial matrix. Then I take the sum of the rows and get the maximum.
But, what is the R program which put in a new matrix the product between each row of my matrix, and the matrix itself ?
Thanks!
I thought I had a cleverer answer but it turns out to be slower ...
tmp1 <- function(a) {
n <- nrow(a)
aa <- apply(array(apply(a,1,"*",a),
rep(n,3)),3,rowSums)
apply(aa,2,which.max)
}
Previous solution:
tmp2 <- function(a) {
n <- nrow(a)
r <- numeric(n)
for(i in seq(n)) {
b <- rowSums(a[i,]*a)
r[i] <- which.max(b)
}
r
}
Test this on something reasonably large:
n <- 50
a <- matrix(0,nrow=n,ncol=n)
a[sample(length(a),size=n^2/5,replace=TRUE)] <- 1
all(tmp1(a)==tmp2(a)) ## TRUE
library(rbenchmark)
benchmark(tmp1(a),tmp2(a))
> benchmark(tmp1(a),tmp2(a))
test replications elapsed relative user.self sys.self
1 tmp1(a) 100 4.030 9.264368 2.052 1.96
2 tmp2(a) 100 0.435 1.000000 0.232 0.20
You will presumably do even better if you can do it in terms of sparse matrices.
Like this?
a=matrix(c(0,0,0,1,0,1,1,0,1,0,0,0,0,1,1,1), ncol=4, byrow=T)
for(i in 1:4){
b = rowSums(a[i,]*a)
print(which(b==max(b)))
}

Resources