Best way to fill a sparse matrix - julia

What is the most efficient way to fill a sparse matrix? I know that sparse matrixes are CSC, so I expected it to be fast to fill them column by column like
using SparseArrays
M = 100
N = 1000
sparr = spzeros(Float64, M, N)
for n = 1:N
# do some math here
idx = <<boolean index array of nonzero elements in the nth column>>
vals = <<values of nonzero elements in the nth column>>
sparr[idx, n] = vals
end
However, I find that this scales very poorly with N. Is there a better way to fill the array? Or perhaps, I should not bother with filling the array and instead initialize the matrix differently?

You can do sparse(I, J, V, M, N) directly:
julia> using SparseArrays
julia> M = 100;
julia> N = 1000;
julia> nz = 2000; # number of nonzeros
julia> I = rand(1:M, nz); # dummy I indices
julia> J = rand(1:N, nz); # dummy J indices
julia> V = randn(nz); # dummy matrix values
julia> sparse(I, J, V, M, N)
100×1000 SparseMatrixCSC{Float64, Int64} with 1982 stored entries:
⣻⣿⣿⣿⣿⡿⣾⣿⣿⣿⣿⣿⣿⣷⣾⣽⣿⢿⢿⣿⣿⣿⢿⣿⣾⣿⣽⣿⣿⣾⣿⣿⣿⣿⣿⣿⣿⣿⣻⣿
⣼⣿⣿⡿⣿⣿⡽⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣻⣿⡿⣿⣿⣿⡿⣿⡿⣯⢿⣿⠾⣿⣿⡿⢿⣿⣻⡿⣾
which should scale decently with size. For more expert use, you could directly construct the SparseMatrixCSC object.
EDIT:
Note that if you have to stick with the pseudo code you gave with a for-loop for column indices and values, you could simply concatenate them and create I, J, and V:
I = Int[]
J = Int[]
V = Float64[]
for n = 1:N
# do some math here
idx = <<boolean index array of nonzero elements in the nth column>>
vals = <<values of nonzero elements in the nth column>>
I = [I; idx]
J = [J; fill(n, length(I))]
V = [V; vals]
end
but that'd be slower I think.

Related

R: find sum of every i < j without using for loop

How to find sum of i<j (i,j = 1 to 25) of i without using for loop in R language.
This equation is what I am trying to code exactly, I need to get the index of both i and j and calculate sum of determination from there.
{(x_i, j_i)}i = 1 to 25
We can use outer
sum(outer(i, j, FUN = `<`))
If we need to find the sum of 'x'
sum(matrix(x, 25, 25)[outer(x, x, FUN = `<`)])
data
i <- 1:25
j <- 1:25
x <- rnorm(25)
For vector x, you can try the code below
sum(cumsum(x)[-length(x)])
# DATA
set.seed(42)
n = 25
v = 1:n
x = rnorm(n)
sum(rep(v, n) < rep(v, each = n))
sum(rep(x, n)[rep(v, n) < rep(v, each = n)])

Generating all gapped k-mer sequences from string

I am interested in making all gapped-kmers from a sequence, with gapped-kmer defined as a sequence of length k separated by up to m positions from another sequence of length k. So for example, "sequence CAGAT the gappy pair
kernel with k = 1 and m = 2 finds pairs of monomers with zero to two irrelevant positions in between. i.e. it finds the features CA, C.G, C..A, AG, A.A, A..T, GA, G.T and AT"
replacefxn <- function(x, k, m) {
substr(x, k + 1, k + m) <- paste(rep("X", m), collapse = "")
return(x)
}
gappedkmersfxn <- function(x, k, m) {
n <- (2 * k + m)
subseq <-
substring(x, seq(from = 1, to = (nchar(x) - n + 1)), seq(from = n, to = nchar(x)))
return(sapply(subseq, replacefxn, k, m))
}
allgappedkmersfxn <- function(x, k, m) {
kmers <- list()
for (i in 0:m) {
kmers[[i]] <- gappedkmersfxn(x, k, i)
}
kmers <- unlist(kmers)
return(kmers)
}
allgappedkmersfxn is how I have it implemented currently, but it does not add the features with no gap (m is maximum gap, but goes from 0 to m), thus not giving me all of the desired features (see example of "CAGAT"). In addition, it is very slow and inefficient when doing millions of sequences at a time. It is also coded poorly, but with limited experience in R I am not sure how to improve it.
What would the most efficient way of doing this be while making sure that all expected subsequences (ex: CAGAT -> CA, C.G, C..A, AG, A.A, A..T, GA, G.T and AT for k=1, m=2) are included in the output?
Thanks!
You may want to look at the implementation of a gappy pair kernel in the Bioconductor kebabs package.
Installation:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("kebabs")
To generate a k = 1, m = 2 kernel:
library(kebabs)
gappyK1M2 <- gappyPairKernel(k = 1, m = 2)
To generate an explicit representation from a DNA sequence:
dnaseqs <- DNAStringSet("CAGAT")
dnaseqsrep <- getExRep(dnaseqs, gappyK1M2)
The k-mers are stored in the Dimnames slot:
dnaseqsrep#Dimnames[[2]]
[1] "A.A" "AG" "AT" "A..T" "CA" "C..A" "C.G" "GA" "G.T"

Create a accumulated binomial distribution table

I am new to R and trying to build a accumulative binomial distribution table and got stuck in the loop.
r = readline("please enter an interger n:")
p = seq(from = 0.1, to = 1,by = 0.1 )
r = seq(from = 0, to = 100)
n <- ""
for (each in r) {
x=qbinom(x,r,p)
}
print(x)
As an alternate to the loop: you can use expand.grid to create all permutations of k and p, and further avoid the loop as pbinom can take vectors.
# Create input values
p = 1:9/10
k = 0:25
n = 25
# Create permutations of `k` and `p`: use this to make grid of values
ex <- expand.grid(p=p, k=k)
# Find probabilities for each value set
ex$P <- with(ex, pbinom(k, n, p ))
# Reshape to your required table format
round(reshape2::dcast(k ~ p, data=ex, value.var = "P"), 3)
Loop approach
# Values to match new example
p = 1:19/20
k = 0:25
n = 4
# Create matrix to match the dimensions of our required output
# We will fill this as we iterate through the loop
mat1 <- mat2 <- matrix(0, ncol=length(p), nrow=length(k), dimnames=list(k, p))
# Loop through the values of k
# We will also use the fact that you can pass vectors to `pbinom`
# so for each value of `k`, we pass the vector of `p`
# So we will update each row of our output matrix with
# each iteration of the loop
for(i in seq_along(k)){
mat1[i, ] <- pbinom(k[i], n, p)
}
Just for completeness, we could of updated the columns of our output matrix instead - that is for each value of p pass the vector k
for(j in seq_along(p)){
mat2[, j] <- pbinom(k, n, p[j])
}
# Check that they give the same result
all.equal(mat1, mat2)

Error with `norm` function when estimating `pi` using Monte Carlo simulation on a unit circle

## simulate `N` uniformly distributed points on unit square
N <- 1000
x <- matrix(runif(2 * N), ncol = 2)
## count number of points inside unit circle
n <- 0; for(i in 1:N) {if (norm(x[i,]) < 1) {n <- n + 1} }
n <- n / N
## estimate of pi
4 * n
But I get:
"Error in norm(x[i,]): 'A' must be a numeric matrix"
Not sure what is wrong.
norm gives you error, because it asks for a matrix. However, x[i, ] is not a matrix, but a vector. In other words, when you extract a single row / column from a matrix, its dimension is dropped. You can use x[i, , drop = FALSE] to maintain matrix class.
The second issue is, you want L2-norm here. So set type = "2" inside norm. Altogether, use
norm(x[i, , drop = FALSE], type = "2") < 1
norm is not the only solution. You can also use either of the following:
sqrt(c(crossprod(x[i,])))
sqrt(sum(x[i,] ^ 2))
and in fact, they are more efficient. They also underpin the idea of using rowSums in the vectorized approach below.
Vectorization
We can avoid the loop via:
n <- mean(sqrt(rowSums(x ^ 2)) < 1) ## or simply `mean(rowSums(x ^ 2) < 1)`
sqrt(rowSums(x ^ 2)) gives L2-norm for all rows. After comparison with 1 (the radius) we get a logical vector, with TRUE indicating "inside the circle". Now, the value n you want is just the number of TRUE. You can sum over this logical vector then divide N, or simply take mean over this vector.

Allocating space for a sparse matrix in R

I construct a large, sparse matrix, of which I know the number non-zero elements in advance. Is it possible in R to allocate space for this matrix, instead of having its space automatically increased every time I add an element? Something like spalloc does in Matlab.
As a simplified code-example of what I want, consider the construction of the following block-wise diagonal matrix.
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
x = Matrix(0, n*q, p*q, sparse=TRUE);
for(i in 1:q) {
x[((i-1)*n+1):(i*n),((i-1)*p+1):(i*p)] = x_i;
}
I think this process would be much faster if I could tell R in advance that the matrix will contain n*p*q non-zero elements.
Thanks in advance!
Edit: I now see that for the blockwise matrix I should use bdiag()
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
lst = list();
for(i in 1:q) {
lst[i] = x_i;
}
x = bdiag(lst);
This is much faster.

Resources