What is the most efficient way to fill a sparse matrix? I know that sparse matrixes are CSC, so I expected it to be fast to fill them column by column like
using SparseArrays
M = 100
N = 1000
sparr = spzeros(Float64, M, N)
for n = 1:N
# do some math here
idx = <<boolean index array of nonzero elements in the nth column>>
vals = <<values of nonzero elements in the nth column>>
sparr[idx, n] = vals
However, I find that this scales very poorly with N. Is there a better way to fill the array? Or perhaps, I should not bother with filling the array and instead initialize the matrix differently?
You can do sparse(I, J, V, M, N) directly:
julia> using SparseArrays
julia> M = 100;
julia> N = 1000;
julia> nz = 2000; # number of nonzeros
julia> I = rand(1:M, nz); # dummy I indices
julia> J = rand(1:N, nz); # dummy J indices
julia> V = randn(nz); # dummy matrix values
julia> sparse(I, J, V, M, N)
100×1000 SparseMatrixCSC{Float64, Int64} with 1982 stored entries:
which should scale decently with size. For more expert use, you could directly construct the SparseMatrixCSC object.
Note that if you have to stick with the pseudo code you gave with a for-loop for column indices and values, you could simply concatenate them and create I, J, and V:
I = Int[]
J = Int[]
V = Float64[]
for n = 1:N
# do some math here
idx = <<boolean index array of nonzero elements in the nth column>>
vals = <<values of nonzero elements in the nth column>>
I = [I; idx]
J = [J; fill(n, length(I))]
V = [V; vals]
but that'd be slower I think.
How to find sum of i<j (i,j = 1 to 25) of i without using for loop in R language.
This equation is what I am trying to code exactly, I need to get the index of both i and j and calculate sum of determination from there.
{(x_i, j_i)}i = 1 to 25
We can use outer
sum(outer(i, j, FUN = `<`))
If we need to find the sum of 'x'
sum(matrix(x, 25, 25)[outer(x, x, FUN = `<`)])
i <- 1:25
j <- 1:25
x <- rnorm(25)
For vector x, you can try the code below
n = 25
v = 1:n
x = rnorm(n)
sum(rep(v, n) < rep(v, each = n))
sum(rep(x, n)[rep(v, n) < rep(v, each = n)])
I am interested in making all gapped-kmers from a sequence, with gapped-kmer defined as a sequence of length k separated by up to m positions from another sequence of length k. So for example, "sequence CAGAT the gappy pair
kernel with k = 1 and m = 2 finds pairs of monomers with zero to two irrelevant positions in between. i.e. it finds the features CA, C.G, C..A, AG, A.A, A..T, GA, G.T and AT"
replacefxn <- function(x, k, m) {
substr(x, k + 1, k + m) <- paste(rep("X", m), collapse = "")
gappedkmersfxn <- function(x, k, m) {
n <- (2 * k + m)
subseq <-
substring(x, seq(from = 1, to = (nchar(x) - n + 1)), seq(from = n, to = nchar(x)))
return(sapply(subseq, replacefxn, k, m))
allgappedkmersfxn <- function(x, k, m) {
kmers <- list()
for (i in 0:m) {
kmers[[i]] <- gappedkmersfxn(x, k, i)
kmers <- unlist(kmers)
allgappedkmersfxn is how I have it implemented currently, but it does not add the features with no gap (m is maximum gap, but goes from 0 to m), thus not giving me all of the desired features (see example of "CAGAT"). In addition, it is very slow and inefficient when doing millions of sequences at a time. It is also coded poorly, but with limited experience in R I am not sure how to improve it.
What would the most efficient way of doing this be while making sure that all expected subsequences (ex: CAGAT -> CA, C.G, C..A, AG, A.A, A..T, GA, G.T and AT for k=1, m=2) are included in the output?
You may want to look at the implementation of a gappy pair kernel in the Bioconductor kebabs package.
## try http:// if https:// URLs are not supported
To generate a k = 1, m = 2 kernel:
gappyK1M2 <- gappyPairKernel(k = 1, m = 2)
To generate an explicit representation from a DNA sequence:
dnaseqs <- DNAStringSet("CAGAT")
dnaseqsrep <- getExRep(dnaseqs, gappyK1M2)
The k-mers are stored in the Dimnames slot:
[1] "A.A" "AG" "AT" "A..T" "CA" "C..A" "C.G" "GA" "G.T"
I am new to R and trying to build a accumulative binomial distribution table and got stuck in the loop.
r = readline("please enter an interger n:")
p = seq(from = 0.1, to = 1,by = 0.1 )
r = seq(from = 0, to = 100)
n <- ""
for (each in r) {
As an alternate to the loop: you can use expand.grid to create all permutations of k and p, and further avoid the loop as pbinom can take vectors.
# Create input values
p = 1:9/10
k = 0:25
n = 25
# Create permutations of `k` and `p`: use this to make grid of values
ex <- expand.grid(p=p, k=k)
# Find probabilities for each value set
ex$P <- with(ex, pbinom(k, n, p ))
# Reshape to your required table format
round(reshape2::dcast(k ~ p, data=ex, value.var = "P"), 3)
Loop approach
# Values to match new example
p = 1:19/20
k = 0:25
n = 4
# Create matrix to match the dimensions of our required output
# We will fill this as we iterate through the loop
mat1 <- mat2 <- matrix(0, ncol=length(p), nrow=length(k), dimnames=list(k, p))
# Loop through the values of k
# We will also use the fact that you can pass vectors to `pbinom`
# so for each value of `k`, we pass the vector of `p`
# So we will update each row of our output matrix with
# each iteration of the loop
for(i in seq_along(k)){
mat1[i, ] <- pbinom(k[i], n, p)
Just for completeness, we could of updated the columns of our output matrix instead - that is for each value of p pass the vector k
for(j in seq_along(p)){
mat2[, j] <- pbinom(k, n, p[j])
# Check that they give the same result
all.equal(mat1, mat2)
## simulate `N` uniformly distributed points on unit square
N <- 1000
x <- matrix(runif(2 * N), ncol = 2)
## count number of points inside unit circle
n <- 0; for(i in 1:N) {if (norm(x[i,]) < 1) {n <- n + 1} }
n <- n / N
## estimate of pi
4 * n
But I get:
"Error in norm(x[i,]): 'A' must be a numeric matrix"
Not sure what is wrong.
norm gives you error, because it asks for a matrix. However, x[i, ] is not a matrix, but a vector. In other words, when you extract a single row / column from a matrix, its dimension is dropped. You can use x[i, , drop = FALSE] to maintain matrix class.
The second issue is, you want L2-norm here. So set type = "2" inside norm. Altogether, use
norm(x[i, , drop = FALSE], type = "2") < 1
norm is not the only solution. You can also use either of the following:
sqrt(sum(x[i,] ^ 2))
and in fact, they are more efficient. They also underpin the idea of using rowSums in the vectorized approach below.
We can avoid the loop via:
n <- mean(sqrt(rowSums(x ^ 2)) < 1) ## or simply `mean(rowSums(x ^ 2) < 1)`
sqrt(rowSums(x ^ 2)) gives L2-norm for all rows. After comparison with 1 (the radius) we get a logical vector, with TRUE indicating "inside the circle". Now, the value n you want is just the number of TRUE. You can sum over this logical vector then divide N, or simply take mean over this vector.
I construct a large, sparse matrix, of which I know the number non-zero elements in advance. Is it possible in R to allocate space for this matrix, instead of having its space automatically increased every time I add an element? Something like spalloc does in Matlab.
As a simplified code-example of what I want, consider the construction of the following block-wise diagonal matrix.
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
x = Matrix(0, n*q, p*q, sparse=TRUE);
for(i in 1:q) {
x[((i-1)*n+1):(i*n),((i-1)*p+1):(i*p)] = x_i;
I think this process would be much faster if I could tell R in advance that the matrix will contain n*p*q non-zero elements.
Thanks in advance!
Edit: I now see that for the blockwise matrix I should use bdiag()
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
lst = list();
for(i in 1:q) {
lst[i] = x_i;
x = bdiag(lst);
This is much faster.