I am a new R-user and I have a kind of algorithm problem. I made some research on the web and on Stackoverflow, but can't find my answer.
I have a squared matrix, for example :
A B C D
A 0 0 0 1
B 0 1 1 0
C 1 0 0 0
D 0 1 1 1
This matrix represents links between keywords (A, B, C and D here). A '1' (or a TRUE) means keywords are in relation. For example, the '1' on the first row means A is linked to D.
I need to find the two most linked keywords on the matrix. I know I need to compute the scalar product between each row and the initial matrix. Then I take the sum of the rows and get the maximum.
But, what is the R program which put in a new matrix the product between each row of my matrix, and the matrix itself ?
Thanks!
I thought I had a cleverer answer but it turns out to be slower ...
tmp1 <- function(a) {
n <- nrow(a)
aa <- apply(array(apply(a,1,"*",a),
rep(n,3)),3,rowSums)
apply(aa,2,which.max)
}
Previous solution:
tmp2 <- function(a) {
n <- nrow(a)
r <- numeric(n)
for(i in seq(n)) {
b <- rowSums(a[i,]*a)
r[i] <- which.max(b)
}
r
}
Test this on something reasonably large:
n <- 50
a <- matrix(0,nrow=n,ncol=n)
a[sample(length(a),size=n^2/5,replace=TRUE)] <- 1
all(tmp1(a)==tmp2(a)) ## TRUE
library(rbenchmark)
benchmark(tmp1(a),tmp2(a))
> benchmark(tmp1(a),tmp2(a))
test replications elapsed relative user.self sys.self
1 tmp1(a) 100 4.030 9.264368 2.052 1.96
2 tmp2(a) 100 0.435 1.000000 0.232 0.20
You will presumably do even better if you can do it in terms of sparse matrices.
Like this?
a=matrix(c(0,0,0,1,0,1,1,0,1,0,0,0,0,1,1,1), ncol=4, byrow=T)
for(i in 1:4){
b = rowSums(a[i,]*a)
print(which(b==max(b)))
}
Related
I am developing a bayesian hierarchical model in R with BUGS code in JAGS.
In my model, I have two matrices that contain relevant information about each another in the same exact matrix position. My information is structured by rows. I apply a mathematical operation to the first matrix, Distmat, by row:
diffmat[i,j] <- abs(Distmat[birthterr[i],j] - Dist[i])
I am interested to record the column position of every minimum value in each row of diffmat in a new vector, to then apply this vector to the second matrix. This would be relatively easy in regular R code using functions "which" or "which.min":
a <- numeric()
for (i in 1:dim(diffmat)[1])
for (j in 1:dim(diffmat)[2])
a[i] <- which.min(diffmat[i,])
And then apply vector "a" to the second matrix (terrmat) to obtain the values associated with Distmat positions:
b <- numeric(0)
for (i in 1:dim(diffmat)[1])
for (j in 1:dim(diffmat)[2])
b[i] <- terrmat[i, a[i]]
However, apparently BUGS code does not recognize either which or which.min(), and I am struggling to find a way to store these matrix row positions in vectors. Perhaps there is a very simple solution to this, but I really got stuck there. Hope my info was enough clear.
Any suggestions would be very appreciated. Thanks for your time!
Here's a minimal working example. The analogs here are that x would be your diffmat. I'm drawing it at random here, but it should still work if you're defining it otherwise. Essentially, you rank the values of x in each row and make a new matrix e that is a dummy matrix that is coded 1 if x[i,j] is ranked 1 and 0 otherwise. Then you take the inner product of that and a vector of values from 1:ncol(terrmat) assuming terrmat and diffmat are of the same dimensions. Then that gives you the column index of the first ranked value for observation i. The ymat in the example below is where your terrmat would go. I think it will be pretty slow on any real-sized problem, but it appears to work from the output below.
dl <- list(
ymat = matrix(1:3, ncol=3, nrow=5, byrow=TRUE),
yinds = 1:3
)
mods <- "model{
for(i in 1:5){
for(j in 1:3){
x[i,j] ~ dnorm(0,1)
e[i,j] <- equals(rx[i,j], 1)
}
rx[i,1:3] <- rank(x[i,1:3])
ind[i] <- inprod(e[i,], yinds)
yval[i] <- ymat[i,ind[i]]
}
}"
library(runjags)
out <- run.jags(mods, data=dl, monitor="yval")
out
#
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#
# Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# yval[1] 1 2 3 1.9973 0.8139 2 0.0058421 0.7 19409 -0.0077146 0.99996
# yval[2] 1 2 3 2.0067 0.81605 3 0.0057704 0.7 20000 0.00049096 1.0003
# yval[3] 1 2 3 1.9895 0.8142 2 0.0057573 0.7 20000 0.00066309 1
# yval[4] 1 2 3 1.9973 0.81638 1 0.0057727 0.7 20000 -0.00040016 0.99998
# yval[5] 1 2 3 1.993 0.81611 1 0.0057708 0.7 20000 -0.0027988 0.99996
#
# Total time taken: 0.7 seconds
Suppose I write the following R code:
first.value <- sample(100, 100, replace=TRUE)
second.value <- sample(10, 100, replace=TRUE)
X <- data.frame(first.value, second.value)
split.X <- split(X, second.value)
This code creates a data frame with two fields, and splits into bins according to the second. Now suppose I wanted to normalize each bin; i.e., subtract the mean and divide by the standard deviation. I could accomplish this by
normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})
But this creates a new list with the normalized versions of each bin. What I really want to do is replace the copy of the data in split.X with its normalized version.
To illustrate, here's some sample output:
> first.value <- sample(100, 100, replace=TRUE)
> second.value <- sample(10, 100, replace=TRUE)
> X <- data.frame(first.value, second.value)
> split.X <- split(X, second.value)
> normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})
> split.X[[1]]
first.value second.value
4 34 1
8 40 1
24 21 1
31 34 1
37 23 1
40 22 1
> normalized.first.value[[1]]
[1] 0.625 1.375 -1.000 0.625 -0.750 -0.875
What I really want to do is to put the values of normalized.first.value[[1]] into split.X[[1]]$first.value, and the same for the other indices.
This could be achieved with a for loop as follows:
for (i in 1:length(split.X)) {
split.X[[i]]$first.value <- (split.X[[i]]$first.value - mean(split.X[[i]]$first.value) / sd(split.X[[i]]$first.value);
}
But for loops are BAD in R, and I'd like to use sapply,lapply, etc. if I can. Unfortunately, when dealing with a list of dataframes, sapply and lapply don't seem to iterate in the way I want.
Here's a more arcane way (though I still reckon the for loop is fine in this case)
new.split.X <- mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
How it works: applies [<- on each split.X[[i]]. The T is the i index to replace (i.e. all of them), 'first.value' is the j index to replace (that column), normalized.first.value contains the replacements.
The loop may be easier to read in the end though, and probably not slower than tricksy *apply solutions.
library(rbenchmark)
benchmark(loop={
for (i in 1:length(split.X))
split.X[[i]]$first.value <- normalized.first.value[[i]]
},
mapply={
mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
},
Map={
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
},
lapply={
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
})
test replications elapsed relative user.self sys.self user.child sys.child
4 lapply 100 0.034 4.857 0.035 0 0 0
1 loop 100 0.007 1.000 0.007 0 0 0
3 Map 100 0.012 1.714 0.013 0 0 0
2 mapply 100 0.030 4.286 0.032 0 0 0
So the explicit loop is the fastest, and easieset to read anyway.
You can use Map as both the lists have the same length. It works by replacing the first column in 'split.X' by the corresponding the list element in 'normalized.first.value'
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
Or we can loop through the length of 'split.X', get the list elements of the 'split.X' and 'normalized.first.value' based on the index and then replace.
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
I have 10M rows matrix with integer values
A row in this matrix can look as follows:
1 1 1 1 2
I need to transform the row above to the following vector:
4 1 0 0 0 0 0 0 0
Other example:
1 2 3 4 5
To:
1 1 1 1 1 0 0 0 0
How to do it efficiently in R
?
Update:
There is a function that does exactly what I need: base::tabulate (suggested here before)
but it is extremely slow (took at least 15 mins to go over my init matrix)
I would try something like this:
m <- nrow(x)
n <- ncol(x)
i.idx <- seq_len(m)
j.idx <- seq_len(n)
out <- matrix(0L, m, max(x))
for (j in j.idx) {
ij <- cbind(i.idx, x[, j])
out[ij] <- out[ij] + 1L
}
A for loop might sound surprising for a question that asks for an efficient implementation. However, this solution is vectorized for a given column and only loops through five columns. This will be many, many times faster than looping over 10 million rows using apply.
Testing with:
n <- 1e7
m <- 5
x <- matrix(sample(1:9, n*m, T), n ,m)
this approach takes less than six seconds while a naive t(apply(x, 1, tabulate, 9)) takes close to two minutes.
I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
Here's a solution using by and colSums, but requires some fiddling due to the default output of by.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9
There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.
The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
combineByRow <- function(m) {
m <- m[ order(rownames(m)), ]
## keep track of previous row name
prev <- rownames(m)[1]
i.start <- 1
i.end <- 1
## cache the rownames -- profiling shows that it takes
## forever to look at them
m.rownames <- rownames(m)
stopifnot(all(!is.na(m.rownames)))
## go through matrix in a loop, as we need to combine some unknown
## set of rows
for (i in 2:(1+nrow(m))) {
curr <- m.rownames[i]
## if we found a new row name (or are at the end of the matrix),
## combine all rows and mark invalid rows
if (prev != curr || is.na(curr)) {
if (i.start < i.end) {
m[i.start,] <- apply(m[i.start:i.end,], 2, max)
m.rownames[(1+i.start):i.end] <- NA
}
prev <- curr
i.start <- i
} else {
i.end <- i
}
}
m[ which(!is.na(m.rownames)),]
}
Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)
start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)
all(m1 == m2)
I'm trying to speed up/vectorize some calculations in a time series.
Can I vectorize a calculation in a for loop which can depend on results from an earlier iteration? For example:
z <- c(1,1,0,0,0,0)
zi <- 2:6
for (i in zi) {z[i] <- ifelse (z[i-1]== 1, 1, 0) }
uses the z[i] values updated in earlier steps:
> z
[1] 1 1 1 1 1 1
In my effort at vectorizing this
z <- c(1,1,0,0,0,0)
z[zi] <- ifelse( z[zi-1] == 1, 1, 0)
the element-by-element operations don't use results updated in the operation:
> z
[1] 1 1 1 0 0 0
So this vectorized operation operates in 'parallel' rather than iterative fashion. Is there a way I can write/vectorize this to get the results of the for loop?
ifelse is vectorized and there's a bit of a penalty if you're using it on one element at a time in a for-loop. In your example, you can get a pretty good speedup by using if instead of ifelse.
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
z <- c(1,1,0,0,0,0)
identical(fun1(z),fun2(z))
# [1] TRUE
system.time(replicate(10000, fun1(z)))
# user system elapsed
# 1.13 0.00 1.32
system.time(replicate(10000, fun2(z)))
# user system elapsed
# 0.27 0.00 0.26
You can get some additional speed gains out of fun2 by compiling it.
library(compiler)
cfun2 <- cmpfun(fun2)
system.time(replicate(10000, cfun2(z)))
# user system elapsed
# 0.11 0.00 0.11
So there's a 10x speedup without vectorization. As others have said (and some have illustrated) there are ways to vectorize your example, but that may not translate to your actual problem. Hopefully this is general enough to be applicable.
The filter function may be useful to you as well if you can figure out how to express your problem in terms of a autoregressive or moving average process.
This is a nice and simple example where Rcpp can shine.
So let us first recast functions 1 and 2 and their compiled counterparts:
library(inline)
library(rbenchmark)
library(compiler)
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun1c <- cmpfun(fun1)
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
fun2c <- cmpfun(fun2)
We write a Rcpp variant very easily:
funRcpp <- cxxfunction(signature(zs="numeric"), plugin="Rcpp", body="
Rcpp::NumericVector z = Rcpp::NumericVector(zs);
int n = z.size();
for (int i=1; i<n; i++) {
z[i] = (z[i-1]==1.0 ? 1.0 : 0.0);
}
return(z);
")
This uses the inline package to compile, load and link the five-liner on the fly.
Now we can define our test-date, which we make a little longer than the original (as just running the original too few times result in unmeasurable times):
R> z <- rep(c(1,1,0,0,0,0), 100)
R> identical(fun1(z),fun2(z),fun1c(z),fun2c(z),funRcpp(z))
[1] TRUE
R>
All answers are seen as identical.
Finally, we can benchmark:
R> res <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=1000)
R> print(res)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 1000 0.005 1.0 0.01 0
4 fun2c(z) 1000 0.466 93.2 0.46 0
2 fun2(z) 1000 1.918 383.6 1.92 0
3 fun1c(z) 1000 10.865 2173.0 10.86 0
1 fun1(z) 1000 12.480 2496.0 12.47 0
The compiled version wins by a factor of almost 400 against the best R version, and almost 100 against its byte-compiled variant. For function 1, the byte compilation matters much less and both variants trail the C++ by a factor of well over two-thousand.
It took about one minute to write the C++ version. The speed gain suggests it was a minute well spent.
For comparison, here is the result for the original short vector called more often:
R> z <- c(1,1,0,0,0,0)
R> res2 <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications",
+ "elapsed", "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=10000)
R> print(res2)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 10000 0.046 1.000000 0.04 0
4 fun2c(z) 10000 0.132 2.869565 0.13 0
2 fun2(z) 10000 0.271 5.891304 0.27 0
3 fun1c(z) 10000 1.045 22.717391 1.05 0
1 fun1(z) 10000 1.202 26.130435 1.20 0
The qualitative ranking is unchanged: the Rcpp version dominates, function2 is second-best. with the byte-compiled version being about twice as fast that the plain R variant, but still almost three times slower than the C++ version. And the relative difference are lower: relatively speaking, the function call overhead matters less and the actual looping matters more: C++ gets a bigger advantage on the actual loop operations in the longer vectors. That it is an important result as it suggests that more real-life sized data, the compiled version may reap a larger benefit.
Edited to correct two small oversights in the code examples. And edited again with thanks to Josh to catch a setup error relative to fun2c.
I think this is cheating and not generalizable, but: according to the rules you have above, any occurrence of 1 in the vector will make all subsequent elements 1 (by recursion: z[i] is 1 set to 1 if z[i-1] equals 1; therefore z[i] will be set to 1 if z[i-2] equals 1; and so forth). Depending on what you really want to do, there may be such a recursive solution available if you think carefully about it ...
z <- c(1,1,0,0,0,0)
first1 <- min(which(z==1))
z[seq_along(z)>first1] <- 1
edit: this is wrong, but I'm leaving it up to admit my mistakes. Based on a little bit of playing (and less thinking), I think the actual solution to this recursion is more symmetric and even simpler:
rep(z[1],length(z))
Test cases:
z <- c(1,1,0,0,0,0)
z <- c(0,1,1,0,0,0)
z <- c(0,0,1,0,0,0)
Check out the rollapply function in zoo.
I'm not super familiar with it, but I think this does what you want:
> c( 1, rollapply(z,2,function(x) x[1]) )
[1] 1 1 1 1 1 1
I'm sort of kludging it by using a window of 2 and then only using the first element of that window.
For more complicated examples you could perform some calculation on x[1] and return that instead.
Sometimes you just need to think about it totally differently. What you're doing is creating a vector where every item is the same as the first if it's a 1 or 0 otherwise.
z <- c(1,1,0,0,0,0)
if (z[1] != 1) z[1] <- 0
z[2:length(z)] <- z[1]
There is a function that does this particular calculation: cumprod (cumulative product)
> cumprod(z[zi])
[1] 1 0 0 0 0
> cumprod(c(1,2,3,4,0,5))
[1] 1 2 6 24 0 0
Otherwise, vectorize with Rccp as other answers have shown.
It's also possible to do this with "apply" using the original vector and a lagged version of the vector as the constituent columns of a data frame.