I use the class dgCMatrix from the Matrix package to store a square matrix of about 255 million values, with a size of about 1.7MB .
However after I perform variable <- variable/rowSums(variable) where variable is the sparse matrix, the resulting variable changes into class dgeMatrix, and the size ballooned to almost 2GB, effectively taking up all memory available and in some instances crashing the script.
Is there a way to coerce the output to remain in the class dgCMatrix ?
I suspect that the reason is that the number of non-zero elements increase to the point that the matrix is no longer considered sparse, due to introduction of NaN in elements where the sum of rows is zero. If there's a work around to address the NaN 's , I'm open to that too. Note however that I cannot avoid producing the zero rows, because my matrix need to be a square, and the corresponding column sums are generally non-zero.
You could try doing a simple ifelse function for the divisor:
variable <- variable/ifelse(rowSums(variable)!=0,rowSums(variable),1)
Unless there's some reason you need to be dividing by the 0 there, that seems like the simplest way to avoid NANs.
I have the same problem. This is the work-around that I am using to avoid NaNs and to preserve the output in the class dgCMatrix:
tmp = 1/rowSums(variable)
tmp[is.infinite(tmp)] <- 0
variable <- variable * tmp
Related
I'm creating a data frame of a deck of cards (1,2,3,3,4,4,5,6,7,8). Taking a ggplot but applying a tt=sapply(t,card_2), R gives me an error saying dim(X) must have a positive length. Can anyone help me on this? Thank you
This is failing because of the following lines:
sum_a=apply(a,2,sum)
min_a=apply(a,2,min)
sum_b=apply(a,2,sum)
min_b=apply(b,2,min)
The sum and min functions are aggregating functions. They return a single value over a vector (or matrix). You are asking R to iterate over your a and calculate a sum or minimum for each value (which is nonsense). Just do:
sum_a=sum(a)
min_a=min(a)
sum_b=sum(b)
min_b=min(b)
Also, you need to make sure a and b are numeric first.
The following code causes a memory error:
diag(1:100000)
Is there any alternative for diag which allows producing a huge diagonal matrix?
Longer answer: I suggest not creating a diagonal matrix, because in most situations you can do without it. To make that clear, consider the most typical matrix operations:
Multiply the diagonal matrix D by a vector v to produce Dv. Instead of maintaining a matrix, keep your "matrix" as a vector d of the diagonal elements, and then multiply d elementwise by v. Same result.
Invert the matrix. Again, easy: invert each element (of course, only for diagonal matrices is this generally the correct inverse).
Various decompositions/eigenvalues/determinants/trace. Again, these can all be done on the vector d.
In short, though it requires a bit of attention in your code, you can always represent a diagonal matrix as a vector, and that should solve your memory issues.
Shorter answer: Now, having said all that, of course people have already implemented the above steps implicitly using sparse matrices, which does the above steps under the hood. In R, the Matrix package is nice for sparse matrices: https://cran.r-project.org/web/packages/Matrix/Matrix.pdf
I'm looking to preallocate a sparse matrix in R (using simple_triplet_matrix) by providing the dimensions of the matrix, m x n, and also the number of non-zero elements I expect to have. Matlab has the function "spalloc" (see below), but I have not been able to find an equivalent in R. Any suggestions?
S = spalloc(m,n,nzmax) creates an all zero sparse matrix S of size m-by-n with room to hold nzmax nonzeros.
Whereas it may make sense to preallocate a traditional dense matrix in R (in the same way it is much more efficient to preallocate a regular (atomic) vector rather than increasing its size one by one,
I'm pretty sure it will not pay to preallocate sparse matrices in R, in most situations.
Why?
For dense matrices, you allocate and then assign "piece by piece", e.g.,
m[i,j] <- value
For sparse matrices, however that is very different: If you do something like
S[i,j] <- value
the internal code has to check if [i,j] is an existing entry (typically non-zero) or not. If it is, it can change the value, but otherwise, one way or the other, the triplet (i,j, value) needs to be stored and that means extending the current structure etc. If you do this piece by piece, it is inefficient... mostly irrespectively if you had done some preallocation or not.
If, on the other hand, you already know in advance all the [i,j] combinations which will contain non-zeroes, you could "pre-allocate", but in this case,
just store the vector i and j of length nnzero, say. And then use your underlying "algorithm" to also construct a vector x of the same length which contains all the corresponding values, i.e., entries.
Now, indeed, as #Pafnucy suggested, use spMatrix() or sparseMatrix(), two slightly different versions of the same functionality: Constructing a sparse matrix, given its contents.
I am happy to help further, as I am the maintainer of the Matrix package.
I have assignment using R and have a little problem. In the assignment several matrices have to be generated with random number of rows and later used for various calculations. Everything works perfect, unless number of rows is 1.
In the calculations I use nrow(matrix) in different ways, for example if (i <= nrow(matrix) ) {action} and also statements like matrix[,4] and so on.
So in case number of rows is 1 (I know it is actually vector) R give errors, definitely because nrow(1-dimensional matrix)=NULL. Is there simple way to deal with this? Otherwise probably whole code have to be rewritten, but I'm very short in time :(
It is not that single-row/col matrices in R have ncol/nrow set to NULL -- in R everything is a 1D vector which can behave like matrix (i.e. show as a matrix, accept matrix indexing, etc.) when it has a dim attribute set. It seems otherwise because simple indexing a matrix to a single row or column drops dim and leaves the data in its default (1D vector) state.
Thus you can accomplish your goal either by directly recreating dim attribute of a vector (say it is called x):
dim(x)<-c(length(x),1)
x #Now a single column matrix
dim(x)<-c(1,length(x))
x #Now a single row matrix
OR by preventing [] operator from dropping dim by adding drop=FALSE argument:
x<-matrix(1:12,3,4)
x #OK, matrix
x[,3] #Boo, vector
x[,3,drop=FALSE] #Matrixicity saved!
Let's call your vector x. Try using matrix(x) or t(matrix(x)) to convert it into a proper (2D) matrix.
New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names