I have a sparse matrix (dgCMatrix) as the result of fitting a glmnet. I want to write this result to a .csv but can't use write.table() the matrix because it can't coerced into a data.frame.
Is there a way to coerce the sparse matrix to either a data.frame or a regular matrix? Or is there a way to write it to a file while keeping the coefficient names which are probably row names?
That will be dangerous to transform the sparse matrix to a normal one, if the sparse matrix size is too large. In my case (text classification task), I got a matrix of size 22490 by 120,000. If you try get the dense matrix, that will be more than 20 GB, I think. Then R will break down !
So my suggestion, you may simply store the sparse matrix in an efficient and memory friendly way, such as Matrix Market Format, which keeps all non-zero values and their coordinates (row & col number). In the R you can use the method writeMM
as.matrix() will convert to the full dense representation:
> as.matrix(Matrix(0, 3, 2))
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 0 0
You can write the resulting object out using write.csv or write.table.
Converting directly to a dense matrix is likely to waste a lot of memory. The R package Matrix allows converting the sparse matrix into a memory-efficient coordinate triplet format data frame using the summary() function, which could then be written easily to csv. This is probably simpler and easier than the matrix market approach. See the answer to this related question: Sparse matrix to a data frame in R
Also, here is an illustration from the Matrix package documentation:
## very simple export - in triplet format - to text file:
data(CAex)
s.CA <- summary(CAex)
s.CA # shows (i, j, x) [columns of a data frame]
message("writing to ", outf <- tempfile())
write.table(s.CA, file = outf, row.names=FALSE)
## and read it back -- showing off sparseMatrix():
str(dd <- read.table(outf, header=TRUE))
## has columns (i, j, x) -> we can use via do.call() as arguments to sparseMatrix():
mm <- do.call(sparseMatrix, dd)
stopifnot(all.equal(mm, CAex, tolerance=1e-15))
# input: a sparse matrix with named rows and columns (dimnames)
# returns: a data frame representing triplets (r, c, x) suitable for writing to a CSV file
sparse2triples <- function(m) {
SM = summary(m)
D1 = m#Dimnames[[1]][SM[,1]]
D2 = m#Dimnames[[2]][SM[,2]]
data.frame(row=D1, col=D2, x=m#x)
}
Example
> library(Matrix)
> dn <- list(LETTERS[1:3], letters[1:5])
> m <- sparseMatrix(i = c(3,1,3,2,2,1), p= c(0:2, 4,4,6), x = 1:6, dimnames = dn)
> m
3 x 5 sparse Matrix of class "dgCMatrix"
a b c d e
A . 2 . . 6
B . . 4 . 5
C 1 . 3 . .
> sparse2triples(m)
row col x
1 C a 1
2 A b 2
3 B c 4
4 C c 3
5 A e 6
6 B e 5
[EDIT: use data.frame]
Related
I'm trying to figure out how to iteratively load a matrix (this form part of a bigger function I can't reproduce here).
Let's suppose that I create a matrix:
m <- matrix(c(1:9), nrow = 3, ncol = 3)
m
This matrix can be named "m", "x" or whatsoever. Then, I need to load iteratively the matrix in the function:
if (interactive() ) { mat <-
readline("Your matrix, please: ")
}
So far, the function "knows" the name of the matrix, since mat returns [1] "m", and is a object listed in ls(). But when I try to get the matrix values, for example through x <- get(mat) I keep getting an error
Error in get(mat) : unused argument (mat)
Can anybody be so kind as to tell me what I'm doing wrong here?
1) Assuming you mean interactive, not iterative,
get_matrix <- function() {
nr <- as.numeric(readline("how many rows? "))
cat("Enter space separated data row by row. Enter empty row when finished.\n")
nums <- scan(stdin())
matrix(nums, nr, byrow = TRUE)
}
m <- get_matrix()
Here is a test:
> m <- get_matrix()
how many rows? 3
Enter space separated data row by row. Enter empty row when finished.
1: 1 2
3: 3 4
5: 5 6
7:
Read 6 items
> m
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
>
2) Another possibility is to require that the user create a matrix using R and then just give the name of the matrix:
get_matrix2 <- function(envir = parent.frame()) {
m <- readline("Enter name of matrix: ")
get(m, envir)
}
Test it:
> m <- matrix(1:6, 3)
> mat <- get_matrix2()
Enter name of matrix: m
> mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Good afternoon ,
I have developped this R function that hashes data in buckets :
# The used packages
library("pacman")
pacman::p_load(dplyr, tidyr, devtools, MASS, pracma, mvtnorm, interval, intervals)
pacman::p_load(sprof, RDocumentation, helpRFunctions, foreach , philentropy , Rcpp , RcppAlgos)
hash<-function(v,p){
if(dot(v,p)>0) return(1) else (0) }
LSH_Band<-function(data,K ){
# We retrieve numerical columns of data
t<-list.df.var.types(data)
df.r<-as.matrix(data[c(t$numeric,t$Intervals)])
n=nrow(df.r)
# we create K*K matrice using normal law
rn=array(rnorm(K*K,0,1),c(K,K))
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd=unique.array(array(unique(ceiling(runif(K*K,0,ncol(df.r)))),c(K,K)))
buckets<-array(NA,c(K,n))
for (i in 1:K) {
for (j in 1:n) {
buckets[i,j]<-hash(df.r[j,][rd[,i]],rn[,i])
}
}
return(buckets)
}
> df.r
age height salaire.1 salaire.2
1 27 180 0 5000
2 26 178 0 5000
3 30 190 7000 10000
4 31 185 7000 10000
5 31 187 7000 10000
6 38 160 10000 15000
7 39 158 10000 15000
> LSH_Band(df.r, 3 )
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 1 1 1 1 1 1
[2,] 1 1 0 0 0 0 0
[3,] 0 0 0 0 0 0 0
The dot function is the scalar product of two vectors.
My Lsh function takes a row of my data , then it takes a part of the
obtained row using df.r[j,][rd[,i]] . df.r[j,] is j-éme row of the
data.
rd[,i] : rd is a K*K matrix of integers between 1 and ncol(df.r) , each column of the matrix contains only unique integers.
rn[,i] : rn is a K*K matrix that contains values of N(0,1) law.
In the resulting table , observations are represented in columns . I will have k Rows. For the last row , i will compute the scalar product between df.r[j,][rd[,K]] and rn[,K]. I will obtain 1 if the scalar product is positive. rd[,K] and rn[,K] will be used only for the last row in the resulting table and for all observations in that row.
My question :
Is it to replace the loops with variables i and j by a lapply function ?
My real data will be large , this is why i'm asking this question.
Thank you !
The following is a bit too long as a comment, so here are some pointers/issues/remarks:
First off, I have to say I struggle to understand what LHS_Band does. Perhaps some context would help here.
I don't understand the purpose of certain functions like helpRFunctions::list.df.var.type which simply seems to return the column names of data in a list. Note also that t$Intervals returns NULL based on the sample data you give. So I'm not sure what's going on there.
I don't see the point of function pracma::dot either. The dot product between two vectors can be calculated in base R using %*%. There's really no need for an additional package.
Function hash can be written more compactly as
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
This avoids the if conditional which is slow.
Notwithstanding my lack of understanding what it is you're trying to do, here are some tweaks to your code
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
LSH_Band <- function(data, K, seed = NULL) {
# We retrieve numerical columns of data
data <- as.matrix(data[sapply(data, is.numeric)])
# we create K*K matrice using normal law
if (!is.null(seed)) set.seed(seed)
rn <- matrix(rnorm(K * K, 0, 1), nrow = K, ncol = K)
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd <- sapply(seq_len(K), function(col) sample.int(ncol(data), K))
buckets <- matrix(NA, nrow = K, ncol = nrow(data))
for (i in 1:K) {
buckets[i, ] <- apply(data, 1, function(row) hash(row[rd[, i]], rn[, i]))
}
buckets
}
Always add an option to use a reproducible seed when working with random numbers. That will make debugging a lot easier.
You can replace at least one for loop with apply (which when using MARGIN = 1 iterates through the rows of a matrix (or array)).
I've removed all the unnecessary package dependencies, and replaced the functionality with base R functions.
I am learning R and reading the book Guide to programming algorithms in r.
The book give an example function:
# MATRIX-VECTOR MULTIPLICATION
matvecmult = function(A,x){
m = nrow(A)
n = ncol(A)
y = matrix(0,nrow=m)
for (i in 1:m){
sumvalue = 0
for (j in 1:n){
sumvalue = sumvalue + A[i,j]*x[j]
}
y[i] = sumvalue
}
return(y)
}
How do I call this function in the R console? And what exactly is passing into this function A, X?
The function takes an argument A, which should be a matrix, and x, which should be a numeric vector of same length as values per row in A.
If
A <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
then you have 3 values (number of columns, ncol) per row, thus x needs to be something like
x <- c(4,5,6)
The function itself iterates all rows, and in each row, each value is multiplied with a value from x, where the value in the first column is multiplied with the first value in x, the value in As second column is multiplied with the second value in x and so on. This is repeated for each row, and the sum for each row is returned by the function.
matvecmult(A, x)
[,1]
[1,] 49 # 1*4 + 3*5 + 5*6
[2,] 64 # 2*4 + 4*5 + 6*6
To run this function, you first have to compile (source) it and then consecutively run these three code lines:
A <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
x <- c(4,5,6)
matvecmult(A, x)
This function is designed to return the product of a matrix A with a vector x; i.e. the result will be the matrix product A x (where - as is usual in R, the vector is a column vector). An example should make things clear.
# define a matrix
mymatrix <- matrix(sample(12), nrow <- 4)
# see what the matrix looks like
mymatrix
# [,1] [,2] [,3]
# [1,] 2 10 9
# [2,] 3 1 12
# [3,] 11 7 5
# [4,] 8 4 6
# define a vector where multiplication of our matrix times the vector will be defined
vec3 <- c(-1,0,1)
# apply the function to our matrix and vector
result <- matvecmult(mymatrix, vec3)
result
# [,1]
# [1,] 7
# [2,] 9
# [3,] -6
# [4,] -2
class(result)
# [1] "matrix"
So matvecmult(mymatrix, vec3) is how you would call this function, and the result is an n by 1 matrix, where n is the number of rows in the matrix argument.
You can also get some insight by playing around and seeing what happens when you pass something other than a matrix-vector pair where the product is defined. In some cases, you will get an error; sometimes you get nonsense; and sometimes you get something you might not expect just from the function name. See what happens when you call matvecmult(mymatrix, mymatrix).
The function is calculating the product of a Matrix and a column vector. It assumes both the number of columns of the matrix is equal to the number of elements in the vector.
It stores the number of columns of A in n and number of rows in m.
It then initializes a matrix of mrows with all values as 0.
It iterates along the rows of A and multiplies each value in each row with the values in x.
The answer is the stored in y and finally it returns the single column matrix y.
I have a row vector and a column vector say c(1,2), c(7,100). I want to extract (1,7), (2,100).
Out, I find Matrix[row, column] will return a cross-product thing not just a vector of two numbers.
What should I do?
You want to exploit the feature that if m is a matrix containing the row/col indices required, then subsetting by passing m as argument i of [ gives the desired behaviour. From ?'['
i, j, ...: indices specifying elements to extract or replace.
.... snipped ....
When indexing arrays by ‘[’ a single argument ‘i’ can be a
matrix with as many columns as there are dimensions of ‘x’;
the result is then a vector with elements corresponding to
the sets of indices in each row of ‘i’.
Here is an example
rv <- 1:2
cv <- 3:4
mat <- matrix(1:25, ncol = 5)
mat[cbind(rv, cv)]
R> cbind(rv, cv)
rv cv
[1,] 1 3
[2,] 2 4
R> mat[cbind(rv, cv)]
[1] 11 17
You can use 2 column subsetting matrices inside [:
mx <- matrix(1:200, nrow=2)
mx[cbind(c(1, 2), c(7, 100))]
produces:
[1] 13 200
Matrix computations such as A%*%B require a data.frame to be transformed into a matrix using as.matrix(), but this way is cumbersome. Is there a more convenient method to do such things?
If you objection is just that you have to wrap your data frame in as.matrix before using %*% then you could make your own binary function that does that wrapping for you
`%*df%` <- function(x, y){as.matrix(x) %*% as.matrix(y)}
x <- data.frame(a = 1:2, b = 3:4)
x %*df% x
# a b
#[1,] 7 15
#[2,] 10 22