Regularized Latent Semantic Indexing in R - r

I am trying to implement the Regularized Latent Semantic Indexing (RLSI) algorithm on R.
The original paper can be found here:
http://research.microsoft.com/en-us/people/hangli/sigirfp372-wang.pdf
Below is my code.
Here, I generate a matrix D from two matrices U and V. Each column of U correspond to a topic vector, and it is made to be sparse. After that, I apply RLSI on the D matrix to see if I can factorize it into two matrices, one of which has sparse vectors like U.
However, the resulting U is far from being sparse. Actually, every element of it is filled with numbers.
Is there something wrong with my code?
Thank you very much in advance.
library(magrittr)
# functions
updateU <- function(D,U,V){
S <- V %*% t(V)
R <- D %*% t(V)
for(m in 1:M){
u_m <- rep(0, K)
u_previous <- u_m
diff_u <- 100
while(diff_u > 0.1){
for(k in 1:K){
w_mk <- R[m,k] - S[k,-k] %*% U[m,-k]
in_hinge <- (abs(w_mk) - 0.5 * lambda_1)
u_m[k] <- (ifelse(in_hinge > 0, in_hinge, 0) * ifelse(w_mk >= 0, 1, -1)) / S[k,k]
}
diff_u <- sum(u_m - u_previous)
u_previous <- u_m
}
U[m,] <- u_m
}
return(U)
}
updateV <- function(D,U,V){
Sigma <- solve(t(U) %*% U + lambda_2 * diag(K))
Phi <- t(U) %*% D
V <- Sigma %*% Phi
return(V)
}
# Set constants
M <- 5000
N <- 1000
K <- 30
lambda_1 <- 1
lambda_2 <- 0.5
# Create D
originalU <- c(rpois(50000, lambda = 10), rep(0, 100000)) %>% sample(., 150000) %>% matrix(., M, K)
originalV <- rpois(30000, lambda = 5) %>% sample(., 30000) %>% matrix(., K, N)
D <- originalU %*% originalV
# Initialize U and V
V <- matrix(rpois(30000, lambda = 5), K, N)
U <- matrix(0, M, K)
# Run RLSI (iterate 100 times for now)
for(t in 1:100){
cat(t,":")
U <- updateU(D,U,V)
V <- updateV(D,U,V)
loss <- sum((D - U %*% V) ^ 2)
cat(loss, "\n")
}

I've got it. Each row in U has to be set to a zero vector each time updateU function is run.

Related

What is going on inside the varimax function in R?

I have been trying to figure out the core part of the varimax function in R. I found a wiki link that writes out the algorithm. But why is B <- t(x) %*% (z^3 - z %*% diag(drop(rep(1, p) %*% z^2))/p) is computed? I also am not sure as to why SVD is computed of the matrix B. The iteration step is probably to maximize/minimize the variance, and the singular values would really be variances of Principal Components. But I am also unsure about that. I am pasting the whole code of varimax for convenience, but really the relevant part and therefore my question on what is actually happening under the hood, is within the for loop.
function (x, normalize = TRUE, eps = 1e-05)
{
nc <- ncol(x)
if (nc < 2)
return(x)
if (normalize) {
sc <- sqrt(drop(apply(x, 1L, function(x) sum(x^2))))
x <- x/sc
}
p <- nrow(x)
TT <- diag(nc)
d <- 0
for (i in 1L:1000L) {
z <- x %*% TT
B <- t(x) %*% (z^3 - z %*% diag(drop(rep(1, p) %*% z^2))/p)
sB <- La.svd(B)
TT <- sB$u %*% sB$vt
dpast <- d
d <- sum(sB$d)
if (d < dpast * (1 + eps))
break
}
z <- x %*% TT
if (normalize)
z <- z * sc
dimnames(z) <- dimnames(x)
class(z) <- "loadings"
list(loadings = z, rotmat = TT)
}
Edit: The algorithm is available in the book "Factor Analysis of Data Matrices" by Holt, Rinehart and Winston and the actual sources can be found therein. This book is also cited with the varimax function in R.

How to compute a probability matrix based on a binary matrix?

My attempt is:
library(igraph)
set.seed(41)
n<-10
A <- sample.int (2, n*n, TRUE)-1L; dim(A) <- c(n,n);
m <- sum(A)
g <- graph_from_adjacency_matrix(A)
k_in <- degree(g, v = V(g), mode = "in", loops = TRUE, normalized = FALSE)#; k_in
k_out <- degree(g, v = V(g), mode = "out", loops = TRUE, normalized = FALSE)#; k_out
p <- (k_in %*% t(k_out) / (2*m))/(k_in %*% t(k_out) / (2*m) + k_in %*% t(k_out) / (2*m))
round(p, 3)
All values of probability matrix p is 0.5.
I think the error in the denominator of p, because matrix A is not symmetry.
Question. How to specify the denominator correctly?
Edit. After the Stéphane Laurent's answer.
I think we should have for different value: k_j_out, k_i_in, k_i_out, k_j_in.
Finally, I need to obtain the weight matrix, W.
I <- matrix(0, n, n); diag(I) <- 1
W <- A %*% (I - P) - t(A) %*% (I - P)
And I think this matrix should symmetric.
The 2m is useless because it appears at both the numerator and the denominator. You can do:
p <- (k_in %*% t(k_out))/(k_out %*% t(k_in) + k_in %*% t(k_out))
Or, the same with less computations:
M <- k_in %*% t(k_out)
M / (M + t(M))
EDIT
We can check:
i <- 2; j <- 3
k_out[j] * k_in[i] / (k_out[j] * k_in[i] + k_out[i] * k_in[j])
# 0.5384615
p[i,j]
# 0.5384615

How to use a matrix as an input in a User-Defined Function and Loop it in R?

Here is the current script I have:
delta <- 1/52
T <- 0.5
S0 <- 25
sigma <- 0.30
K <- 25
r <- 0.05
n <- 1000000
m <- T/delta
S <- numeric(m + 1)
S[1] <- S0
#Payoff asian option
asian_option_price <- function() {
for(j in 1:m) {
W <- rnorm(1)
S[j + 1] <- S[j] * exp((r - 0.5 * sigma^2) * delta + sigma * sqrt(delta) * W)
}
Si.bar <- mean(S)
exp(-r * T) * max(Si.bar - K, 0)
}
#Loops
C <- raply(n, asian_option_price(), .progress = "text")
My issue is that I need to use "-W" for a second simulation right after this one is done. The way the script is made, "W" is inside my loop which makes it impossible (i think) to use the corresponding "-W" after that. I think I need to use an independent matrix filled with rnorm() mat(x) = matrix(rnorm(m*n,mean=0,sd=1), m, n) so that I can simply use -mat(x) in my second simulation. I don't get how to take "W" out of my loop and still use it's corresponding matrix. Any help would be very useful. Thanks!
Your idea to preallocate all the random numbers is correct. You could then loop over the individual entries. However, it is faster to go for a vectorized approach:
delta <- 1/52
T <- 0.5
S0 <- 25
sigma <- 0.30
K <- 25
r <- 0.05
n <- 100000
m <- ceiling(T/delta)
W <- matrix(rnorm(n*m), nrow = m, ncol = n)
S <- apply(exp((r - 0.5 * sigma^2) * delta + sigma * sqrt(delta) * W), 2, cumprod)
S <- S0 * rbind(1, S)
Si_bar <- apply(S, 2, mean)
mean(pmax(Si_bar -K, 0)) * exp(-r*T)

Why does this optimization algorithm in R stop after a few function evaluations?

I have a code which has been used for some paper.
After defining the function to be optimized, the author used the Nelder-Mead method to estimate the parameters needed. When I run the code, it freezes after 493 function evaluations have been used, it doesn't show any kind of error message or anything. I've been trying to find some info but I haven't been lucky. How can I modify the optim command in order to evaluate all possible combinations, and/or what is preventing the function from being optimized?
Here's the code. It's relatively long, BUT the second-to-last line (system.time(stcopfit...)) is the ONLY ONE I need to make work / fix / modify. So you can just copy&paste the code (as I said, taken from the author of the mentioned paper) and let it run, you don't have to go through the all code, just the last few lines. This is the data over which to run the optimization, i.e. a matrix of [0,1] uniform variables of dimension 2172x9.
Any help is appreciated, thanks!
Here's a screenshot in RStudio (it took around 2 minutes to arrive at 493, and then it's been stuck like this for the last 30 minutes):
Code:
#download older version of "sn" package
url <- "https://cran.r-project.org/src/contrib/Archive/sn/sn_1.0-0.tar.gz"
install.packages(url, repos=NULL, type="source")
install.packages(signal)
library(sn)
library(signal)
#1. redefine qst function
qst <- function (p, xi = 0, omega = 1, alpha = 0, nu = Inf, tol = 1e-08)
{
if (length(alpha) > 1)
stop("'alpha' must be a single value")
if (length(nu) > 1)
stop("'nu' must be a single value")
if (nu <= 0)
stop("nu must be non-negative")
if (nu == Inf)
return(qsn(p, xi, omega, alpha))
if (nu == 1)
return(qsc(p, xi, omega, alpha))
if (alpha == Inf)
return(xi + omega * sqrt(qf(p, 1, nu)))
if (alpha == -Inf)
return(xi - omega * sqrt(qf(1 - p, 1, nu)))
na <- is.na(p) | (p < 0) | (p > 1)
abs.alpha <- abs(alpha)
if (alpha < 0)
p <- (1 - p)
zero <- (p == 0)
one <- (p == 1)
x <- xa <- xb <- xc <- fa <- fb <- fc <- rep(NA, length(p))
nc <- rep(TRUE, length(p))
nc[(na | zero | one)] <- FALSE
fc[!nc] <- 0
xa[nc] <- qt(p[nc], nu)
xb[nc] <- sqrt(qf(p[nc], 1, nu))
fa[nc] <- pst(xa[nc], 0, 1, abs.alpha, nu) - p[nc]
fb[nc] <- pst(xb[nc], 0, 1, abs.alpha, nu) - p[nc]
regula.falsi <- FALSE
while (sum(nc) > 0) {
xc[nc] <- if (regula.falsi)
xb[nc] - fb[nc] * (xb[nc] - xa[nc])/(fb[nc] - fa[nc])
else (xb[nc] + xa[nc])/2
fc[nc] <- pst(xc[nc], 0, 1, abs.alpha, nu) - p[nc]
pos <- (fc[nc] > 0)
xa[nc][!pos] <- xc[nc][!pos]
fa[nc][!pos] <- fc[nc][!pos]
xb[nc][pos] <- xc[nc][pos]
fb[nc][pos] <- fc[nc][pos]
x[nc] <- xc[nc]
nc[(abs(fc) < tol)] <- FALSE
regula.falsi <- !regula.falsi
}
x <- replace(x, zero, -Inf)
x <- replace(x, one, Inf)
Sign <- function(x) sign(x)+ as.numeric(x==0)
q <- as.numeric(xi + omega * Sign(alpha)* x)
names(q) <- names(p)
return(q)
}
#2. initial parameter setting
mkParam <- function(Omega, delta, nu){
ndim <- length(delta)+1;
R <- diag(ndim);
for (i in 2:ndim){
R[i,1] <- R[1,i] <- delta[i-1];
if (i>=3){for (j in 2:(i-1)){R[i,j] <- R[j,i] <- Omega[i-1,j-1];}}
}
LTR <- t(chol(R));
Mtheta <- matrix(0, nrow=ndim, ncol=ndim);
for (i in 2:ndim){
Mtheta[i,1] <- acos(LTR[i,1]);
cumsin <- sin(Mtheta[i,1]);
if (i >=3){for (j in 2:(i-1)){
Mtheta[i,j] <- acos(LTR[i,j]/cumsin);
cumsin <- cumsin*sin(Mtheta[i,j]);}
}
}
c(Mtheta[lower.tri(Mtheta)], log(nu-2));
}
#3. from internal to original parameters
paramToExtCorr <- function(param){
ntheta <- dim*(dim+1)/2;
theta <- param[1:ntheta];
ndim <- (1+sqrt(1+8*length(theta)))/2;
LTR <- diag(ndim);
for (i in 2:ndim){
LTR[i,1] <- cos(theta[i-1]);
cumsin <- sin(theta[i-1]);
if (i >=3){for (j in 2:(i-1)){
k <- i+ndim*(j-1)-j*(j+1)/2;
LTR[i,j] <- cumsin*cos(theta[k]);
cumsin <- cumsin*sin(theta[k]);}
}
LTR[i,i] <- cumsin;
}
R <- LTR %*% t(LTR);
R;
}
#4. show estimated parameters and log likelihood
resultVec <- function(fit){
R <- paramToExtCorr(fit$par);
logLik <- -fit$value;
Omega <- R[-1, -1];
delta <- R[1, -1];
ntheta <- dim*(dim+1)/2;
nu <- exp(fit$par[ntheta+1])+2;
c(Omega[lower.tri(Omega)], delta, nu, logLik);
}
#5. negative log likelihood for multivariate skew-t copula
stcopn11 <- function(param){
N <- nrow(udat);
mpoints <- 150;
npar <- length(param);
nu <- exp(param[npar])+2;
R <- paramToExtCorr(param);
Omega <- R[-1, -1];
delta <- R[1, -1];
zeta <- delta/sqrt(1-delta*delta);
iOmega <- solve(Omega);
alpha <- iOmega %*% delta / sqrt(1-(t(delta) %*% iOmega %*% delta)[1,1]);
ix <- matrix(0, nrow=N, ncol=dim);
lm <- matrix(0, nrow=N, ncol=dim);
for (j in 1:dim){
minx <- qst(min(udat[,j]), alpha=zeta[j], nu=nu);
maxx <- qst(max(udat[,j]), alpha=zeta[j], nu=nu);
xx <- seq(minx, maxx, length=mpoints);
px <- sort(pst(xx, alpha=zeta[j], nu=nu));
ix[,j] <- pchip(px, xx, udat[,j]);
lm[,j] <- dst(ix[,j], alpha=zeta[j], nu=nu, log=TRUE);
}
lc <- dmst(ix, Omega=Omega, alpha=alpha, nu=nu, log=TRUE);
-sum(lc)+sum(lm)
}
#6. sample setting
dim <- 9;
smdelta <- c(-0.36,-0.33,-0.48,-0.36,-0.33,-0.48,-0.36,-0.33,-0.48);
smdf <- 5;
smOmega <- cor(udat);
smzeta <- smdelta/sqrt(1-smdelta*smdelta);
iOmega <- solve(smOmega);
smalpha <- iOmega %*% smdelta /sqrt(1-(t(smdelta) %*% iOmega %*% smdelta)[1,1]);
#7. estimation
iniPar <- mkParam(diag(dim),numeric(dim),6);
system.time(stcopfit<-optim(iniPar,stcopn11,control=list(reltol=1e-8,trace=6)));
resultVec(stcopfit);
The parameters you arrive at by step 493 lead to an infinite loop in your qst function: not having any idea what this very complex code is actually doing, I'm afraid I can't diagnose further. Here's what I did to get that far:
I stated cur.params <- NULL in the global environment, then put cur.params <<- params within stcopn11; this saves the current set of parameters to the global environment, so that when you break out of the optim() call manually (via Control-C or ESC depending on your platform) you can inspect the current set of parameters, and restart from them easily
I put in old-school debugging statements (e.g. cat("entering stcopn11\n") and cat("leaving stcopn11\n") at the beginning and at the next-to-last line of the objective function, a few within stopc11 to indicate progress markers within)
once I had the "bad" parameters I used debug(stcopn11) and stcopn11(cur.param) to step through the function
I discovered that it was hanging on dimension 3 (j==3 in the for loop within stcopn11) and particularly on the first qst() call
I added a maxit=1e5 argument to qst; initialized it <- 1 before the while loop; set it <- it+1 each time through the loop; changed the stopping criterion to while (sum(nc) > 0 && it<maxit); and added if (it==maxit) stop("hit max number of iterations in qst") right after the loop
1e5 iterations in qst took 74 seconds; I have no idea whether it might stop eventually, but didn't want to wait to find out.
This was my modified version of stcopn11:
cur.param <- NULL ## set parameter placeholder
##5. negative log likelihood for multivariate skew-t copula
stcopn11 <- function(param,debug=FALSE) {
cat("stcopn11\n")
cur.param <<- param ## record current params outside function
N <- nrow(udat)
mpoints <- 150
npar <- length(param)
nu <- exp(param[npar])+2
R <- paramToExtCorr(param)
Omega <- R[-1, -1]
delta <- R[1, -1]
zeta <- delta/sqrt(1-delta*delta)
cat("... solving iOmega")
iOmega <- solve(Omega)
alpha <- iOmega %*% delta /
sqrt(1-(t(delta) %*% iOmega %*% delta)[1,1])
ix <- matrix(0, nrow=N, ncol=dim)
lm <- matrix(0, nrow=N, ncol=dim)
cat("... entering dim loop\n")
for (j in 1:dim){
if (debug) cat(j,"\n")
minx <- qst(min(udat[,j]), alpha=zeta[j], nu=nu)
maxx <- qst(max(udat[,j]), alpha=zeta[j], nu=nu)
xx <- seq(minx, maxx, length=mpoints)
px <- sort(pst(xx, alpha=zeta[j], nu=nu))
ix[,j] <- pchip(px, xx, udat[,j])
lm[,j] <- dst(ix[,j], alpha=zeta[j], nu=nu, log=TRUE)
}
lc <- dmst(ix, Omega=Omega, alpha=alpha, nu=nu, log=TRUE)
cat("leaving stcopn11\n")
-sum(lc)+sum(lm)
}

Help speeding up a loop in R

basically i want to perform diagonal averaging in R. Below is some code adapted from the simsalabim package to do the diagonal averaging. Only this is slow.
Any suggestions for vectorizing this instead of using sapply?
reconSSA <- function(S,v,group=1){
### S : matrix
### v : vector
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- row(XX)+col(XX)-1
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- S[,group] %*% t(t(XX) %*% S[,group])
##Diagonal Averaging
.intFun <- function(i,x,ind) mean(x[ind==i])
RC <- sapply(1:N,.intFun,x=XX,ind=IND)
return(RC)
}
For data you could use the following
data(AirPassengers)
v <- AirPassengers
L <- 30
T <- length(v)
K <- T-L+1
x.b <- matrix(nrow=L,ncol=K)
x.b <- matrix(v[row(x.b)+col(x.b)-1],nrow=L,ncol=K)
S <- eigen(x.b %*% t(x.b))[["vectors"]]
out <- reconSSA(S, v, 1:10)
You can speed up the computation by almost 10 times with the help of a very specialized trick with rowsum:
reconSSA_1 <- function(S,v,group=1){
### S : matrix
### v : vector
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- row(XX)+col(XX)-1
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- S[,group] %*% t(t(XX) %*% S[,group])
##Diagonal Averaging
SUMS <- rowsum.default(c(XX), c(IND))
counts <- if(L <= K) c(1:L, rep(L, K-L-1), L:1)
else c(1:K, rep(K, L-K-1), K:1)
c(SUMS/counts)
}
all.equal(reconSSA(S, v, 1:10), reconSSA_1(S, v, 1:10))
[1] TRUE
library(rbenchmark)
benchmark(SSA = reconSSA(S, v, 1:10),
SSA_1 = reconSSA_1(S, v, 1:10),
columns = c( "test", "elapsed", "relative"),
order = "relative")
test elapsed relative
2 SSA_1 0.23 1.0000
1 SSA 2.08 9.0435
[Update: As Joshua suggested it could be speed up even further by using the crux of the rowsum code:
reconSSA_2 <- function(S,v,group=1){
### S : matrix
### v : vector
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- c(row(XX)+col(XX)-1L)
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- c(S[,group] %*% t(t(XX) %*% S[,group]))
##Diagonal Averaging
SUMS <- .Call("Rrowsum_matrix", XX, 1L, IND, 1:N,
TRUE, PACKAGE = "base")
counts <- if(L <= K) c(1:L, rep(L, K-L-1), L:1)
else c(1:K, rep(K, L-K-1), K:1)
c(SUMS/counts)
}
test elapsed relative
3 SSA_2 0.156 1.000000
2 SSA_1 0.559 3.583333
1 SSA 5.389 34.544872
A speedup of x34.5 comparing to original code!!
]
I can't get your example to produce sensible results. I think there are several errors in your function.
XX is used in sapply, but is not defined in the function
sapply works over 1:N, where N=144 in your example, but x.b only has 115 columns
reconSSA simply returns x
Regardless, I think you want:
data(AirPassengers)
x <- AirPassengers
rowMeans(embed(x,30))
UPDATE: I've re-worked and profiled the function. Most of the time is spent in mean, so it may be hard to get this much faster using R code. That said, you can 20% speedup by using sum instead.
reconSSA <- function(S,v,group=1){
N <- length(v)
L <- nrow(S)
K <- N-L+1
XX <- matrix(0,nrow=L,ncol=K)
IND <- row(XX)+col(XX)-1
XX <- matrix(v[row(XX)+col(XX)-1],nrow=L,ncol=K)
XX <- S[,group] %*% t(t(XX) %*% S[,group])
##Diagonal Averaging
.intFun <- function(i,x,ind) {
I <- ind==i
sum(x[I])/sum(I)
}
RC <- sapply(1:N,.intFun,x=XX,ind=IND)
return(RC)
}

Resources