adding successive four / n numbers in large matrix in R - r

I have very large dataset with dimension of 60K x 4 K. I am trying add every four values in succession in every row column wise. The following is smaller example dataset.
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0 1 1 1 0 1 1 0 1 1 0 0
[2,] 1 0 0 1 0 1 1 0 1 0 0 0
[3,] 0 1 1 0 0 1 1 1 0 0 0 0
[4,] 1 1 0 1 1 1 1 1 0 0 0 0
Here is what I am trying to perform:
mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3
i.e. add every four values and output.
mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2
Keep going to end of matrix (here to 12).
mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]
Once first row is done apply the same to second row, like:
mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4]
mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8]
mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]
The result will be nrow x (ncol)/4 matrix.
The expected result will look like:
col1-col4 col5-8 col9-12
row1 3 2 2
row2 2 2 1
row3 2 3 0
row4 3 4 0
Similarly for row 3 to number of rows in the matrix. How can I efficiently loop this.

While Matthew's answer is really cool (+1, btw), you can get a much (~100x) faster solution if you avoid apply and use the *Sums functions (in this case colSums), and a bit of vector manipulation trickery:
funSums <- function(mat) {
t.mat <- t(mat) # rows become columns
dim(t.mat) <- c(4, length(t.mat) / 4) # wrap columns every four items (this is what we want to sum)
t(matrix(colSums(t.mat), nrow=ncol(mat) / 4)) # sum our new 4 element columns, and reconstruct desired output format
}
set.seed(123)
mat <- matrix(sample(0:1, 48, replace = TRUE), 4)
funSums(mat)
Produces desired output:
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Now, let's make something the real size and compare against the other options:
set.seed(123)
mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4)
funApply <- function(mat) { # Matthew's Solution
apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum)
}
funRcpp <- function(mat) { # David's Solution
roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1]
}
library(microbenchmark)
microbenchmark(times=10,
funSums(mat),
funApply(mat),
funRcpp(mat)
)
Produces:
Unit: milliseconds
expr min lq median uq max neval
funSums(mat) 4.035823 4.079707 5.256517 7.5359 42.06529 10
funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960 10
funRcpp(mat) 18.481184 20.364885 38.595383 106.0277 132.93382 10
And to check:
all.equal(funSums(mat), funApply(mat))
# [1] TRUE
all.equal(funSums(mat), funRcpp(mat))
# [1] TRUE
The key point is that the *Sums functions are fully "vectorized", in as much as all the calculations happen in C. apply still needs to do a bunch of not strictly vectorized (in the primitive C function way) stuff in R, and is slower (but far more flexible).
Specific to this problem, it might be possible to make it 2-3x faster as about half the time is spent on the transpositions, which are only necessary so that the dim changes do what I need for colSums to work.

Dividing the matrix up into a 3D array is one way:
apply(array(mat, dim=c(4, 4, 3)), MARGIN=c(1,3), FUN=sum)
# [,1] [,2] [,3]
# [1,] 3 2 2
# [2,] 2 2 1
# [3,] 2 3 0
# [4,] 3 4 0

Here's another approach using the RcppRoll package
library(RcppRoll) # Uses C++/Rcpp
n <- 4 # The summing range
roll_sum(mat, n, by.column = F)[, seq_len(ncol(mat) - n + 1) %% n == 1]
## [,1] [,2] [,3]
## [1,] 3 2 2
## [2,] 2 2 1
## [3,] 2 3 0
#3 [4,] 3 4 0

This might be the slowest of all:
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
mat
output <- sapply(seq(4,ncol(mat),4), function(i) { apply(mat,1,function(j){
sum(j[c(i-3, i-2, i-1, i)], na.rm=TRUE)
})})
output
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Maybe nested for-loops would be slower, but this answer is pretty close to being nested for-loops.

Related

Transform adjacency lists to binary matrix in R

Given a list of the locations of 1s in each row, I'm trying to find an efficient way to construct a binary matrix. Here's a small example, although I’m trying to find something that scales well -
Given a binary matrix:
> M <- matrix(rbinom(25,1,0.5),5,5)
> M
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 1 1 0
[2,] 0 1 1 1 1
[3,] 1 1 0 1 1
[4,] 1 0 0 1 0
[5,] 0 1 1 0 0
I can transform M into an adjacency list using:
> Mlist <- apply(M==1, 1, which, simplify = FALSE)
> Mlist
[[1]]
[1] 2 3 4
[[2]]
[1] 2 3 4 5
[[3]]
[1] 1 2 4 5
[[4]]
[1] 1 4
[[5]]
[1] 2 3
I'd like to transform Mlist back into M. One possibility is:
M.new <- matrix(0,5,5)
for (row in 1:5){M.new[row,Mlist[[row]]] <- 1}
But, it seems like there should be a more efficient way.
Thanks!
1) Using M and Mlist defined in the Note at the end, sapply over its components replacing a vector of zeros with ones at the needed locations. Transpose at the end.
M2 <- t(sapply(Mlist, replace, x = integer(length(Mlist)), 1L))
identical(M, M2) # check that M2 equals M
## [1] TRUE
2) A variation with slightly more keystrokes, but faster, would be
M3 <- do.call("rbind", lapply(Mlist, replace, x = integer(length(Mlist)), 1L))
identical(M, M3)
## [1] TRUE
Benchmark
Here ex1 and ex2 are (1) and (2) above and ex0 is the for loop in the question except we used integer instead of double. Note that (2) is about 100x faster then the loop in the question.
library(microbenchmark)
microbenchmark(
ex0 = { M.new <- matrix(0L,5,5); for (row in 1:5){M.new[row,Mlist[[row]]] <- 1L} },
ex1 = t(sapply(Mlist, replace, x = integer(length(Mlist)), 1L)),
ex2 = do.call("rbind", lapply(Mlist, replace, x = integer(length(Mlist)), 1L))
)
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
ex0 4454.4 4504.15 4639.111 4564.1 4670.10 8450.2 100 b
ex1 73.1 84.75 98.220 94.3 111.75 130.8 100 a
ex2 32.0 36.20 43.866 42.7 51.85 82.5 100 a
Note
set.seed(123)
M <- matrix(rbinom(25,1,0.5),5,5)
Mlist <- apply(M==1, 1, which, simplify = FALSE)
Using the vectorized row/column indexing - replicate the sequence of 'Mlist' by the lengths of the 'Mlist', and cbind with the unlisted 'Mlist' to create a matrix which can be used to assign the subset of elements of 'M.new' to 1
ind <- cbind(rep(seq_along(Mlist), lengths(Mlist)), unlist(Mlist))
M.new[ind] <- 1
-checking
> all.equal(M, M.new)
[1] TRUE
Or another option is sparseMatrix
library(Matrix)
as.matrix(sparseMatrix(i = rep(seq_along(Mlist), lengths(Mlist)),
j = unlist(Mlist), x = 1))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 1 1 1
[2,] 0 1 0 1 0
[3,] 1 0 0 1 0
[4,] 0 1 0 1 0
[5,] 1 0 1 1 1

Create matrix with for-loop

I am trying to create the following matrix A for n rows and n+1 columns. n will likely be around 20 or 30, but for the purpose of the question I put it at 4 and 5.
Here is what I have so far:
N <- 5 # n+1
n <- 4 # n
columns <- list()
# first column:
columns[1] <- c(-1, 1, rep(0, N-2))
# all other columns:
for(i in N:2) {
columns[i] <- c((rep(0, N-i), 1, -2, 1, rep(0, i-3)))
}
# combine into matrix:
A <- cbind(columns)
I keep getting the following error msg:
In columns[1] <- c(-1, 1, rep(0, N - 2)) :
number of items to replace is not a multiple of replacement length
And later
"for(i in N:2) {
columns[i] <- c((rep(0, N-i),"
}
Error: unexpected '}' in "}"
I guess you can try the for loop below to create your matrix A:
N <- 5
n <- 4
A <- matrix(0,n,N)
for (i in 1:nrow(A)) {
if (i == 1) {
A[i,1:2] <- c(-1,1)
} else {
A[i,i+(-1:1)] <- c(1,-2,1)
}
}
such that
> A
[,1] [,2] [,3] [,4] [,5]
[1,] -1 1 0 0 0
[2,] 1 -2 1 0 0
[3,] 0 1 -2 1 0
[4,] 0 0 1 -2 1
Another solution is to use outer, and this method would be faster and looks more compact than the for loop approach, i.e.,
A <- `diag<-`(replace(z<-abs(outer(1:n,1:N,"-")),!z %in% c(0,1),0),
c(-1,rep(-2,length(diag(z))-1)))
I thought this would be fast compared to the loop, but when I tested on a 5000x5001 example, the loop in ThomasIsCoding's answer was about 5x faster. Go with that one!
N = 5
n = N - 1
A = matrix(0, nrow = n, ncol = N)
delta = row(A) - col(A)
diag(A) = -2
A[delta %in% c(1, -1)] = 1
A[1, 1] = -1
A
# [,1] [,2] [,3] [,4] [,5]
# [1,] -1 1 0 0 0
# [2,] 1 -2 1 0 0
# [3,] 0 1 -2 1 0
# [4,] 0 0 1 -2 1
You could use data.table::shift to shift the vector c(1, -2, 1, 0) by all increments from -1 (backwards shift / lead by 1) to n - 1 (forward shift / lagged by n - 1) and then cbind all the shifted outputs together. The first-row first-column element doesn't follow this pattern so that's fixed at the end.
library(data.table)
out <- do.call(cbind, shift(c(1, -2, 1, 0), seq(-1, n - 1), fill = 0))
out[1, 1] <- -1
out
# [,1] [,2] [,3] [,4] [,5]
# [1,] -1 1 0 0 0
# [2,] 1 -2 1 0 0
# [3,] 0 1 -2 1 0
# [4,] 0 0 1 -2 1

Randomly populate R dataframe with integers between

I would like to create an R dataframe with random integers WITHOUT repetition.
I have come up with this approach which works:
rank_random<-data.frame(matrix(NA, nrow = 13, ncol = 30)
for (colIdx in seq(1:30) {
rank_random[colIdx,] <-sample(1:ncol(subset(exc_ret, select=-c(Date))), 30,
replace=F)
}
I assume that you mean without repetition on each row. If you meant something else, please clarify.
For your example:
N= ncol(subset(exc_ret, select=-c(Date)))
num.rows = 30
t(sapply( seq(num.rows),
FUN=function(x){sample(1:N, num.rows, replace=F)} ))
To test it for a simpler case
N= 5
num.rows = 5
t(sapply( seq(num.rows),
FUN=function(x){sample(1:N, num.rows, replace=F)} ))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 2 4 5 1 3
# [2,] 2 5 1 3 4
# [3,] 5 1 4 3 2
# [4,] 3 4 5 2 1
# [5,] 3 2 5 1 4

Generate all possible binary vectors of length n in R

I'm looking to generate all possible binary vectors of length n in R. What is the best way (preferably both computationally efficient and readable code) to do this?
n = 3
expand.grid(replicate(n, 0:1, simplify = FALSE))
# Var1 Var2 Var3
#1 0 0 0
#2 1 0 0
#3 0 1 0
#4 1 1 0
#5 0 0 1
#6 1 0 1
#7 0 1 1
#8 1 1 1
Inspired by this question generating all possible binary vectors of length n containing less than m 1s, I've extended this code to produce all possible combinations. It's not pretty, though.
> z <- 3
> z <- rep(0, n)
> do.call(rbind, lapply(0:n, function(i) t(apply(combn(1:n,i), 2, function(k) {z[k]=1;z}))))
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 0 1
[5,] 1 1 0
[6,] 1 0 1
[7,] 0 1 1
[8,] 1 1 1
What is it doing? Once we strip it back, the core of this one-liner is the following:
apply(combn(1:n,i), 2, function(k) {z[k]=1;z})
To understand this, let's step back one level further. The function combn(x,m) generates all possible combinations of x taken m at a time.
> combn(1:n, 1)
[,1] [,2] [,3]
[1,] 1 2 3
> combn(1:n, 2)
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
> combn(1:n, 3)
[,1]
[1,] 1
[2,] 2
[3,] 3
For using apply(MARGIN=2), we pass in a column of this function at a time to our inner function function(k) {z[k]=1;z} which simply replaces all of the values at the indices k with 1. Since they were originally all 0, this gives us each possible binary vector.
The rest is just window dressing. combn gives us a wide, short matrix; we transpose it with t. lapply returns a list; we bind the matrices in each element of the list together with do.call(rbind, .).
You should define what is "the best way" (fastest? shortest code?, etc.).
One way is to use the package R.utils and the function intToBin for converting decimal numbers to binary numbers. See the example.
require(R.utils)
n <- 5
strsplit(intToBin(0:(2 ^ n - 1)), split = "")

splitting integers and converting into matrix

I was wondering if is it possible to stringsplit each integer in a set of numbers and transform it into a transition matrix, e.g
data<-c(11,123,142,1423,1234,12)
What i would like to do is to split each integer in the data (considering only the first two elements in the dataset),first element will be 1,1 second element will be 1,2,3....and convert it into matrix e,g 1,1 will be 1 to 1, 1,2 will be 1 to 2 and 2,3 will be 2 to 3. generating the following matrix
1 2 3 4 5
1 1 1 0 0 0
2 0 0 1 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
My matrix will never go past 5x5. Below is what i have done which works but it's really really tedious.
data2<-as.matrix(as.character(data))
for(i in 1:nrow(data2)) {
values<-strsplit(data2,"")
}
values2<-t(sapply(values, '[', 1:max(sapply(values, length))))
values2[is.na(values2)]<-0
values3<-apply(values2,2,as.numeric)
from1to1<-0
from1to2<-0
from1to3<-0
from1to4<-0
from1to5<-0
from2to1<-0
from2to2<-0
from2to3<-0
from2to4<-0
...
from5to4<-0
from5to5<-0
for(i in 1:nrow(values3)){
for(j in 1:(ncol(values3)-1))
if (((values3[i,j]==1)&(values3[i,j+1]==1))){
from1to1<-from1to1 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==2))){
from1to2<-from1to2 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==3))){
from1to3<-from1to3 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==4))){
from1to4<-from1to4 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==5))){
from1to5<-from1to5 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==1))){
from1to1<-from1to1 + 1
}else{.....continues through all other from2to1...from5to5``
I then place every single number into a 5x5 matrix.
This is obviously tedious and long and ridiculous. Is there anyway to shorten this? Any suggestions is appreciated.
Here's an option, presented here piped so as to be easy to follow:
library(magrittr) # for the pipe
# initialize a matrix of zeros
mat <- matrix(0, 5, 5)
# split each element into individual digits
strsplit(as.character(data), '') %>%
# turn list elements back to integers
lapply(as.integer) %>%
# make a 2 column matrix of each digit paired with the previous digit
lapply(function(x){matrix(c(x[-length(x)], x[-1]), ncol = 2)}) %>%
# reduce list to a single 2-column matrix
do.call(rbind, .) %>%
# for each row, add 1 to the element of mat they subset
apply(1, function(x){mat[x[1], x[2]] <<- mat[x[1], x[2]] + 1; x})
# output is the transpose of the matrix; the real results are stored in mat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 1 1 2 1 4 1 4 2 1 2 3 1
## [2,] 1 2 3 4 2 4 2 3 2 3 4 2
mat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 0 2 0
## [2,] 0 0 3 0 0
## [3,] 0 0 0 1 0
## [4,] 0 2 0 0 0
## [5,] 0 0 0 0 0
Alternately, if you'd like xtabs as suggested by alexis_laz, replace the last line with xtabs(formula = ~ .[,1] + .[,2]) instead of using mat.
You might also check out the permutations package, which from what I can tell seems to be for working with this kind of data, though it's somewhat high-level.

Resources