Feature hashing in R for Text classification

Feature hashing in R for Text classification - r

I'm trying to implement feature hashing in R to help me with a text classification problem, but i'm not sure if i'm doing it the way it should be. Part of my code is based on this post: Hashing function for mapping integers to a given range?.
My code:
random.data = function(n = 200, wlen = 40, ncol = 10){
random.word = function(n){
paste0(sample(c(letters, 0:9), n, TRUE), collapse = '')
}
matrix(replicate(n, random.word(wlen)), ncol = ncol)
}
feature_hash = function(doc, N){
doc = as.matrix(doc)
library(digest)
idx = matrix(strtoi(substr(sapply(doc, digest), 28, 32), 16L) %% (N + 1), ncol = ncol(doc))
sapply(1:N, function(r)apply(idx, 1, function(v)sum(v == r)))
}
set.seed(1)
doc = random.data(50, 16, 5)
feature_hash(doc, 3)
[,1] [,2] [,3]
[1,] 2 0 1
[2,] 2 1 1
[3,] 2 0 1
[4,] 0 2 1
[5,] 1 1 1
[6,] 1 0 1
[7,] 1 2 0
[8,] 2 0 0
[9,] 3 1 0
[10,] 2 1 0
So, i'm basically converting the strings to integers using the last 5 hex digits of the md5 hash returned by digest. Questions:
1 - Is there any package that can do this for me? I haven't found any.
2 - Is it a good idea do use digest as hash function? If not, what can i do?
PS: I should test if it works before posting, but my files are quite big and take a lot of processing time, so i think it's more clever to someone point me in the right direction, because i'm sure i'm doing it wrong!
Thanks for nay help on this!

I don't know any existed CRAN package for this.
However, I wrote a package for myself to do feature hashing. The source code is here: https://github.com/wush978/FeatureHashing, but the API is different.
In my case, I use it to convert a data.frame to CSRMatrix, a customized sparse matrix in the package. I also implemented a helper function to convert the CSRMatrix to Matrix::dgCMatrix. For text classification, I guess the sparse matrix will be more suitable.
If you want to try it, please check the test script here: https://github.com/wush978/FeatureHashing/blob/master/tests/test-conver-to-dgCMatrix.R
Note that I only used it in Ubuntu, so I don't know if it works for windows or macs or not. Please feel free to ask me any question of the package on https://github.com/wush978/FeatureHashing/issues.

Related

How to get the pivot and rank from Matrix::qr() like that of base::qr()?

When applying Matrix::qr() on the sparse matrix in R, the output is quite different from that of base::qr. There are V, beta, p, R, q but not rank and pivot. Below is a small example code. I want to detect linear dependent columns of the A sparse matrix, which requires the pivot and rank. How should I get these information?
library(Matrix)
A <- matrix(c(0, -2, 1, 0,
0, -4, 2, 0,
1, -2, 1, 2,
1, -2, 1, 2,
1, -2, 1, 2), nrow = 5, byrow = T)
A.s <- as(A, "dgCMatrix")
qrA.M <- Matrix::qr(A.s)
qrA.R <- base::qr(A)
There is another related but not answered question, Get base::qr pivoting in Matrix::qr method

I would reconstruct your example matrix A a little bit:
A <- A[, c(1,4,3,2)]
# [,1] [,2] [,3] [,4]
#[1,] 0 0 1 -2
#[2,] 0 0 2 -4
#[3,] 1 2 1 -2
#[4,] 1 2 1 -2
#[5,] 1 2 1 -2
You did not mention in your question why rank and pivot returned by a dense QR factorization are useful. But I think this is what you are looking for:
dQR <- base::qr(A)
with(dQR, pivot[1:rank])
#[1] 1 3
So columns 1 and 3 of A gives a basis for A's column space.
I don't really understand the logic of a sparse QR factorization. The 2nd column of A is perfectly linearly dependent on the 1st column, so I expect column pivoting to take place during the factorization. But very much to my surprise, it doesn't!
library(Matrix)
sA <- Matrix(A, sparse = TRUE)
sQR <- Matrix::qr(sA)
sQR#q + 1L
#[1] 1 2 3 4
No column pivoting is done! As a result, there isn't an obvious way to determine the rank of A.
At this moment, I could only think of performing a dense QR factorization on the R factor to get what you are looking for.
R <- as.matrix(Matrix::qrR(sQR))
QRR <- base::qr(R)
with(QRR, pivot[1:rank])
#[1] 1 3
Why does this work? Well, the Q factor has orthogonal hence linearly independent columns, thus columns of R inherit linear dependence or independence of A. For a matrix with much more rows than columns, the computational costs of this 2nd QR factorization is negligible.
I need to figure out the algorithm behind a sparse QR factorization before coming up with a better idea.

I've been looking at a similar problem and I ended up not relying on Matrix::qr() to calculate rank and to detect linear dependency. Instead I programmed the function GaussIndependent in the package SSBtools.
In the package examples I included an example that demonstrates wrong conclusion from rankMatrix(x, method = "qr"). Input x is a 44*20 dummy matrix.
Starting with your example matrix, A.s:
library(SSBtools)
GaussIndependent(A.s) # List of logical vectors specifying independent rows and columns
# $rows
# [1] TRUE FALSE TRUE FALSE FALSE
#
# $columns
# [1] TRUE TRUE FALSE FALSE
GaussRank(A.s) # the rank
# [1] 2

Detecting zero in solution of linear system of equations (Ax=b)

Suppose the following system of equations Ax = b with:
> A <- matrix(c(2,0,-1,0,0,2,2,1,-1,2,0,0,0,1,0,0), ncol = 4)
> A
[,1] [,2] [,3] [,4]
[1,] 2 0 -1 0
[2,] 0 2 2 1
[3,] -1 2 0 0
[4,] 0 1 0 0
> b <- c(-2,5,0,0)
Solving these equations with solve() yields:
> x <- solve(A,b)
> x
[1] 6.66e-16 4.44e-16 2.00e+00 1.00e+00
This is just an example, but A and b can be of any form.
I need to detect whether any component of x is 0. Now, the first two components should actually be 0, but they are both higher than the machine epsilon .Machine$double.eps = 2.22e-16 which makes them very small, but not equal to zero.
I think I understand that this is caused by rounding errors in floating point arithmetic inside solve(). What I need to know is whether it is possible (from a practical point of view) to determine the higher bound of these errors, so 0s can be detected. For example, instead of
> x == 0
[1] FALSE FALSE FALSE FALSE
one would use something like this:
> x > -1e-15 & x < 1e-15
[1] TRUE TRUE FALSE FALSE
Giving more insight into this problem would be appreciated.

One way to approach this is to check if we can find a better solution to the linear system if we assume the components to be zero. For that we would want to solve A[3:4]%*%y=b since A%*%c(0,0,x[3],x[4])=A[3:4]%*%c(x[3],x[4]). This is an overdetermined system so we can't use solve to find a solution. We can however use qr.solve:
> x.new = c(0,0,qr.solve(A[,3:4],b))
It remains to check if this solution is really better:
> norm(A%*%x.new - b) < norm(A%*%x - b)
[1] TRUE
Thus we have a good reason to suspect that x[1]==x[2]==0.
In this simple example it is obviously possible to guess the true solution by looking at the approximate solution:
> x.true = c(0,0,2,1)
> norm(A%*%x.true - b)
[1] 0
This is however not very helpful in the general case.

Gibbs Sampler (Albert and Chib) for Binary Probit (rbprobitGibbs) A precision matrix

Presently, I am working through the above in the RStudio help file, which contains the following sample:
##
## rbprobitGibbs example
##
if(nchar(Sys.getenv("LONG_TEST")) != 0) {R=2000} else {R=10}
set.seed(66)
simbprobit = function(X,beta) {
## function to simulate from binary probit including x variable
y=ifelse((X%*%beta+rnorm(nrow(X)))<0,0,1)
list(X=X,y=y,beta=beta)
}
nobs=200
X=cbind(rep(1,nobs),runif(nobs),runif(nobs))
beta=c(0,1,-1)
nvar=ncol(X)
simout=simbprobit(X,beta)
Data1=list(X=simout$X,y=simout$y)
Mcmc1=list(R=R,keep=1)
out=rbprobitGibbs(Data=Data1,Mcmc=Mcmc1)
summary(out$betadraw,tvalues=beta)
if(0){
## plotting example
plot(out$betadraw,tvalues=beta)
}
When I step through the code, I don't see anywhere that the A matrix is set. It is only when I reach this line:
out=rbprobitGibbs(Data=Data1,Mcmc=Mcmc1)
That I see the A matrix displayed in the output, which I understand has to be a k * k matrix, where betabar is k * 1 matrix.
Prior Parms:
betabar
# [1] 0 0 0
A
# [,1] [,2] [,3]
# [1,] 0.01 0.00 0.00
# [2,] 0.00 0.01 0.00
# [3,] 0.00 0.00 0.01
So I can understand how A gets its dimensions; however, what is not clear to my is how the values in A are set to 0.01. I am trying to figure out how I can allow a user calling the rbprobitGibbs function to set the precision via A to whatever they like. I can see where A is output, but how are its values based on some input? Does anyone have any suggestions? TIA.
UPDATE:
Here is the output produced, but as far as I can determine it is identical whether I use prior = list(rep(0,3), .2*diag(3)) or not:
> out
$betadraw
[,1] [,2] [,3]
[1,] 0.3565099 0.6369436 -0.9859025
[2,] 0.4705437 0.7211755 -1.1955608
[3,] 0.1478930 0.6538157 -0.6989660
[4,] 0.4118663 0.7910846 -1.3919411
[5,] 0.0385419 0.9421720 -0.7359932
[6,] 0.1091359 0.7991905 -0.7731041
[7,] 0.4072556 0.5183280 -0.7993501
[8,] 0.3869478 0.8116237 -1.2831395
[9,] 0.8893555 0.5448905 -1.8526630
[10,] 0.3165972 0.6484716 -0.9857531
attr(,"class")
[1] "bayesm.mat" "mcmc"
attr(,"mcpar")
[1] 1 10 1

It gets this factor by a scaling constant on the prior precision matrix. In the source, you will note that if you do not supply a prior precision then it will generate a square k matrix and multiply it by .1. Nothing fancy here. These scaling parameters for all of the various functions in bayesm can be found in the ./bayesm/R/bayesmConstants.R file.
if (is.null(Prior$A)) {
A = BayesmConstant.A * diag(nvar)
}
Should you like to you could supply your own constant, say .2, you could do so as follows, prior = list(rep(0,k), .2*diag(k)), or even introduce some relational information into the prior.

Very late to the party, but I ran across this same issue and just figured it out. In order to change the A matrix and prior matrix you have to name them as well since all of your other input variables are named.
For example your code should be,
rbprobitGibbs(Data=Data1, Prior=list(betabar=betabar1, A=A1), Mcmc=Mcmc1)
If you do that, you are able to set your own values for betabar and A.

Error in FUN(newX[, i], ...) : could not find function "sim" [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I was trying to learn R using a book. I was trying to do an example where for each row of the matrix, the corresponding element of the vector will be either 1 or 0, depending on whether the majority of the
first d elements in that row is 1 or 0. The code used was:-
copymaj <- function(rw,d) {
maj <- sum(rw[1:d]) / d
return(if(maj > 0.5) 1 else 0)
}
x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[2,] 1 1 1 1 0
[3,] 1 0 0 1 1
[4,] 0 1 1 1 0
apply(x,1,copymaj,3)
It is showing the above error. Also if I do apply(x,1,copymaj(3)), still error is shown.
R 2.13 is installed
Please help!

As #BenBarnes pointed out, you probably misspelled sum, I think you wrote sim instead of sum.
I was able to reproduce your error by doing:
copymaj0 <- function(rw,d) {
maj <- sim(rw[1:d]) / d # here you have sim, this causes the error
return(if(maj > 0.5) 1 else 0)
}
copymaj1 <- function(rw,d) {
maj <- sum(rw[1:d]) / d # here you have sum which works well for me
return(if(maj > 0.5) 1 else 0)
}
x <- matrix(c(1,0,1,1,0,
1,1,1,1,0,
1,0,0,1,1,
0,1,1,1,0), ncol=5, byrow=TRUE)
apply(x,1,copymaj0,3) # prints error
Error in FUN(newX[, i], ...) : could not find function "sim"
apply(x,1,copymaj1,3) # works well
[1] 1 1 0 1
I really think you misspelled sum.
apply(x,1,copymaj1(3)) won't work becase if you read ?apply you'll see
apply(X, MARGIN, FUN, ...), then apply(x,1,copymaj1(3)) wil produce an error because
... replaces the arguments to FUN (d=3 in your case) is missed. In order to pass optional arguments to your fun you have to use ... as in apply(x,1,copymaj1,3) but not using apply(x,1,copymaj1(3)).

Mystified by qr.Q(): what is an orthonormal matrix in "compact" form?

R has a qr() function, which performs QR decomposition using either LINPACK or LAPACK (in my experience, the latter is 5% faster). The main object returned is a matrix "qr" that contains in the upper triangular matrix R (i.e. R=qr[upper.tri(qr)]). So far so good. The lower triangular part of qr contains Q "in compact form". One can extract Q from the qr decomposition by using qr.Q(). I would like to find the inverse of qr.Q(). In other word, I do have Q and R, and would like to put them in a "qr" object. R is trivial but Q is not. The goal is to apply to it qr.solve(), which is much faster than solve() on large systems.

Introduction
R uses the LINPACK dqrdc routine, by default, or the LAPACK DGEQP3 routine, when specified, for computing the QR decomposition. Both routines compute the decomposition using Householder reflections. An m x n matrix A is decomposed into an m x n economy-size orthogonal matrix (Q) and an n x n upper triangular matrix (R) as A = QR, where Q can be computed by the product of t Householder reflection matrices, with t being the lesser of m-1 and n: Q = H1H2...Ht.
Each reflection matrix Hi can be represented by a length-(m-i+1) vector. For example, H1 requires a length-m vector for compact storage. All but one entry of this vector is placed in the first column of the lower triangle of the input matrix (the diagonal is used by the R factor). Therefore, each reflection needs one more scalar of storage, and this is provided by an auxiliary vector (called $qraux in the result from R's qr).
The compact representation used is different between the LINPACK and LAPACK routines.
The LINPACK Way
A Householder reflection is computed as Hi = I - viviT/pi, where I is the identity matrix, pi is the corresponding entry in $qraux, and vi is as follows:
vi[1..i-1] = 0,
vi[i] = pi
vi[i+1:m] = A[i+1..m, i] (i.e., a column of the lower triangle of A after calling qr)
LINPACK Example
Let's work through the example from the QR decomposition article at Wikipedia in R.
The matrix being decomposed is
> A <- matrix(c(12, 6, -4, -51, 167, 24, 4, -68, -41), nrow=3)
> A
[,1] [,2] [,3]
[1,] 12 -51 4
[2,] 6 167 -68
[3,] -4 24 -41
We do the decomposition, and the most relevant portions of the result is shown below:
> Aqr = qr(A)
> Aqr
$qr
[,1] [,2] [,3]
[1,] -14.0000000 -21.0000000 14
[2,] 0.4285714 -175.0000000 70
[3,] -0.2857143 0.1107692 -35
[snip...]
$qraux
[1] 1.857143 1.993846 35.000000
[snip...]
This decomposition was done (under the covers) by computing two Householder reflections and multiplying them by A to get R. We will now recreate the reflections from the information in $qr.
> p = Aqr$qraux # for convenience
> v1 <- matrix(c(p[1], Aqr$qr[2:3,1]))
> v1
[,1]
[1,] 1.8571429
[2,] 0.4285714
[3,] -0.2857143
> v2 <- matrix(c(0, p[2], Aqr$qr[3,2]))
> v2
[,1]
[1,] 0.0000000
[2,] 1.9938462
[3,] 0.1107692
> I = diag(3) # identity matrix
> H1 = I - v1 %*% t(v1)/p[1] # I - v1*v1^T/p[1]
> H2 = I - v2 %*% t(v2)/p[2] # I - v2*v2^T/p[2]
> Q = H1 %*% H2
> Q
[,1] [,2] [,3]
[1,] -0.8571429 0.3942857 0.33142857
[2,] -0.4285714 -0.9028571 -0.03428571
[3,] 0.2857143 -0.1714286 0.94285714
Now let's verify the Q computed above is correct:
> qr.Q(Aqr)
[,1] [,2] [,3]
[1,] -0.8571429 0.3942857 0.33142857
[2,] -0.4285714 -0.9028571 -0.03428571
[3,] 0.2857143 -0.1714286 0.94285714
Looks good! We can also verify QR is equal to A.
> R = qr.R(Aqr) # extract R from Aqr$qr
> Q %*% R
[,1] [,2] [,3]
[1,] 12 -51 4
[2,] 6 167 -68
[3,] -4 24 -41
The LAPACK Way
A Householder reflection is computed as Hi = I - piviviT, where I is the identity matrix, pi is the corresponding entry in $qraux, and vi is as follows:
vi[1..i-1] = 0,
vi[i] = 1
vi[i+1:m] = A[i+1..m, i] (i.e., a column of the lower triangle of A after calling qr)
There is another twist when using the LAPACK routine in R: column pivoting is used, so the decomposition is solving a different, related problem: AP = QR, where P is a permutation matrix.
LAPACK Example
This section does the same example as before.
> A <- matrix(c(12, 6, -4, -51, 167, 24, 4, -68, -41), nrow=3)
> Bqr = qr(A, LAPACK=TRUE)
> Bqr
$qr
[,1] [,2] [,3]
[1,] 176.2554964 -71.1694118 1.668033
[2,] -0.7348557 35.4388886 -2.180855
[3,] -0.1056080 0.6859203 -13.728129
[snip...]
$qraux
[1] 1.289353 1.360094 0.000000
$pivot
[1] 2 3 1
attr(,"useLAPACK")
[1] TRUE
[snip...]
Notice the $pivot field; we will come back to that. Now we generate Q from the information the Aqr.
> p = Bqr$qraux # for convenience
> v1 = matrix(c(1, Bqr$qr[2:3,1]))
> v1
[,1]
[1,] 1.0000000
[2,] -0.7348557
[3,] -0.1056080
> v2 = matrix(c(0, 1, Bqr$qr[3,2]))
> v2
[,1]
[1,] 0.0000000
[2,] 1.0000000
[3,] 0.6859203
> H1 = I - p[1]*v1 %*% t(v1) # I - p[1]*v1*v1^T
> H2 = I - p[2]*v2 %*% t(v2) # I - p[2]*v2*v2^T
> Q = H1 %*% H2
[,1] [,2] [,3]
[1,] -0.2893527 -0.46821615 -0.8348944
[2,] 0.9474882 -0.01602261 -0.3193891
[3,] 0.1361660 -0.88346868 0.4482655
Once again, the Q computed above agrees with the R-provided Q.
> qr.Q(Bqr)
[,1] [,2] [,3]
[1,] -0.2893527 -0.46821615 -0.8348944
[2,] 0.9474882 -0.01602261 -0.3193891
[3,] 0.1361660 -0.88346868 0.4482655
Finally, let's compute QR.
> R = qr.R(Bqr)
> Q %*% R
[,1] [,2] [,3]
[1,] -51 4 12
[2,] 167 -68 6
[3,] 24 -41 -4
Notice the difference? QR is A with its columns permuted given the order in Bqr$pivot above.

I have researched for this same problem as the OP asks and I don't think it is possible. Basically the OP question is whether having the explicitly computed Q, one can recover the H1 H2 ... Ht. I do not think this is possible without computing the QR from scratch but I would also be very interested to know whether there is such solution.
I have a similar issue as the OP but in a different context, my iterative algorithm needs to mutate the matrix A by adding columns and/or rows. The first time, the QR is computed using DGEQRF and thus, the compact LAPACK format. After the matrix A is mutated e.g. with new rows I can quickly build a new set of reflectors or rotators that will annihilate the non-zero elements of the lowest diagonal of my existing R and build a new R but now I have a set of H1_old H2_old ... Hn_old and H1_new H2_new ... Hn_new (and similarly tau's) which can't be mixed up into a single QR compact representation. The two possibilities I have are, and maybe the OP has the same two possibilities:
Always maintain Q and R explicitly separated whether when computed the first time or after every update at the cost of extra flops but keeping the required memory well bounded.
Stick to the compact LAPACK format but then every time a new update comes in, keep a list of all these mini sets of update reflectors. At the point of solving the system one would do a big Q'*c i.e. H1_u3*H2_u3*...*Hn_u3*H1_u2*H2_u2*...*Hn_u2*H1_u1*H2_u1...*Hn_u1*H1*H2*...*Hn*c where ui is the QR update number and this is potentially a lot of multiplications to do and memory to keep track of but definitely the fastest way.
The long answer from David basically explains what the compact QR format is but not how to get to this compact QR format having the explicit computed Q and R as input.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex