Parallel big matrix multiplication - r

I need to multiply two big matrices A and B as follow:
library(bigmemory)
library(bigalgebra)
library(biganalytics)
A <- big.matrix( replicate(100, rnorm(10^5)) )
B <- big.matrix( replicate(10^5, rnorm(100)) )
AB <- A %*% B
How could I compute this multiplication in parallel?
The only tutorial I've come across so far is this one:
> library("doRedis")
> registerDoRedis(queue="example")
> L = foreach(j=1:2,.packages="VAM",.combine=c) %dopar%
+ {
+ key = paste("X",j,sep="")
+ ridx = ((j-1)*5 + 1):min((j*5),nrow(A))
+ X = A[ridx,] %*% B[,]
+ Y = as.big.matrix(X,backingfile=key)
+ vnew(Y, key)
+ key
+ }
> X = vam(matrix(L,nrow=2))
> sum(X[,] - A[,] %*% B[,])
[1] 0
But I'm not sure how to put it into practice. There may also be a simpler/more efficient way to achieve the same result?

Installing Microsoft R Open, I go from 3 sec to 0.1 sec!
library(bigmemory)
library(bigalgebra)
N <- 200
M <- 1e5
A <- big.matrix(N, M, init = rnorm(N * M))
B <- big.matrix(M, N, init = rnorm(N * M))
system.time(AB <- A %*% B)

Related

Improving my R function performance with VCM

I am running a simulation using Varying-Coefficient Models, yet there are some adjustments. There is no R package that can do what I am looking for.
My code is not running fast enough. I am looking forward to making vcm function run faster
###########################################################################
###########################################################################
### ###
### EPANECHNIKOV FUNCTION ###
### ###
###########################################################################
###########################################################################
epan <- function(t,h){
idx = 0.75 * (1 - (t/h)**2) / h
kernal = 0.50 * (abs(idx) + idx)
kernal
}
###########################################################################
###########################################################################
### ###
### UNPENALIZED ###
### VARYING COEFFICIENT MODEL ###
### ###
###########################################################################
###########################################################################
vcm <- function(x,y,z,z0) {
n = dim(x)[1]
p = dim(x)[2]
n0 = length(z0)
Z = outer(z,z0,"-")
Width = sd(z) * n**(-0.2) * 2
H = sapply(X = 1:n0, FUN = function(X) epan(t = Z[,X], h = Width))
diag(H) = 0
W_h = H / rep(colSums(H), each = n0)
G = lapply(X = 1:n0, FUN = function(X) cbind(x, Z[,X]*x))
AB = matrix(NA, n0, 2*p)
II = 1e-4 * diag(2*p) # to avoid singularity
for(i in 1:n0) {
AB[i,] = solve(crossprod(G[[i]] * W_h[,i], G[[i]]) + II) %*% crossprod(G[[i]] * W_h[,i], y)
}
AB
}
What I have done so far is
Profile the code and see where the slow part
Used sapply and lapply instead of for loop, yet no significant difference
How to use the code? Here is a small simulation where the code functions are used.
n = 100000
p = 5
n0 = 1000
z = runif(n)
z0 = seq(0.05, 0.95, length.out = n0)
x = MASS::mvrnorm(n, rep(0,p), diag(p))
gz = cbind(2*sin(2*pi*z), 3*z*(1-2*z), exp(-2*z + z**2), 2*z, 0)
y = apply(x * gz, 1, sum) + rnorm(n)
vvc_m = vcm(x,y,z,z0)
I am willing to use Rcpp or any other libraries if they would significantly improve my code's performance even though I have no experience with Rcpp.
Your help is appreciated!
The apply and lapply are not needed. Also, G[[i]] * W_h[,i] needs to be computed only once. These changes will shave off a few seconds, but the bulk of the time is spent in the for loop. You are probably correct that any gains there will have to be with Rcpp/RcppArmadillo.
vcm2 <- function(x,y,z,z0) {
n = dim(x)[1]
p = dim(x)[2]
n0 = length(z0)
Z = outer(z,z0,"-")
Width = sd(z) * n**(-0.2) * 2
H = epan(Z, Width)
diag(H) = 0
W_h = H / rep(colSums(H), each = n0)
AB = matrix(NA, n0, 2*p)
II = 1e-4 * diag(2*p) # to avoid singularity
G = matrix(x, n, 2*p)
idx = (p + 1):(2*p)
for(i in 1:n0) {
G[,idx] = Z[,i]*x
GW_h = G*W_h[,i]
AB[i,] = solve(crossprod(GW_h, G) + II) %*% crossprod(GW_h, y)
}
AB
}
system.time(vvc_m <- vcm(x,y,z,z0))
#> user system elapsed
#> 21.71 5.42 27.14
system.time(vvc_m2 <- vcm2(x,y,z,z0))
#> user system elapsed
#> 19.45 3.52 22.99
identical(vvc_m, vvc_m2)
#> [1] TRUE

Iterative optimization of alternative glm family

I'm setting up an alternative response function to the commonly used exponential function in poisson glms, which is called softplus and defined as $\frac{1}{c} \log(1+\exp(c \eta))$, where $\eta$ corresponds to the linear predictor $X\beta$
I already managed optimization by setting parameter $c$ to arbitrary fixed values and only searching for $\hat{\beta}$.
BUT now for the next step I have to optimize this parameter $c$ as well (iteratively changing between updated $\beta$ and current $c$).
I tried to write a log-lik function, score function and then setting up a Newton Raphson optimization (using a while loop)
but I don't know how to seperate the updating of c in an outer step and updating \beta in an inner step..
Are there any suggestions?
# Response function:
sp <- function(eta, c = 1 ) {
return(log(1 + exp(abs(c * eta)))/ c)
}
# Log Likelihood
l.lpois <- function(par, y, X){
beta <- par[1:(length(par)-1)]
c <- par[length(par)]
l <- rep(NA, times = length(y))
for (i in 1:length(l)){
l[i] <- y[i] * log(sp(X[i,]%*%beta, c)) - sp(X[i,]%*%beta, c)
}
l <- sum(l)
return(l)
}
# Score function
score <- function(y, X, par){
beta <- par[1:(length(par)-1)]
c <- par[length(par)]
s <- matrix(rep(NA, times = length(y)*length(par)), ncol = length(y))
for (i in 1:length(y)){
s[,i] <- c(X[i,], 1) * (y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
}
score <- rep(NA, times = nrow(s))
for (j in 1:length(score)){
score[j] <- sum(s[j,])
}
return(score)
}
# Optimization function
opt <- function(y, X, b.start, eps=0.0001, maxiter = 1e5){
beta <- b.start[1:(length(b.start)-1)]
c <- b.start[length(b.start)]
b.old <- b.start
i <- 0
conv <- FALSE
while(conv == FALSE){
eta <- X%*%b.old[1:(length(b.old)-1)]
s <- score(y, X, b.old)
h <- numDeriv::hessian(l.lpois,b.old,y=y,X=X)
invh <- solve(h)
# update
b.new <- b.old + invh %*% s
i <- i + 1
# Test
if(any(is.nan(b.new))){
b.new <- b.old
warning("convergence failed")
break
}
# convergence reached?
if(sqrt(sum((b.new - b.old)^2))/sqrt(sum(b.old^2)) < eps | i >= maxiter){
conv <- TRUE
}
b.old <- b.new
}
eta <- X%*%b.new[1:(length(b.new)-1)]
# covariance
invh <- solve(numDeriv::hessian(l.lpois,b.new,y=y,X=X))
fitted <- sp(eta, b.new[length(b.new)])
result <- list("coefficients" = c(beta = b.new),
"fitted.values" = fitted,
"covariance" = invh)
}
# Running fails ..
n <- 100
x <- runif(n, 0, 1)
Xdes <- cbind(1, x)
eta <- 1 + 2 * x
y <- rpois(n, sp(eta, c = 1))
opt(y,Xdes,c(0,1,1))
You have 2 bugs:
line 25:
(y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
this returns matrix so you must convert to numeric:
as.numeric(y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
line 23:
) is missing:
you have:
s <- matrix(rep(NA, times = length(y)*length(par), ncol = length(y))
while it should be:
s <- matrix(rep(NA, times = length(y)*length(par)), ncol = length(y))

How to use a matrix as an input in a User-Defined Function and Loop it in R?

Here is the current script I have:
delta <- 1/52
T <- 0.5
S0 <- 25
sigma <- 0.30
K <- 25
r <- 0.05
n <- 1000000
m <- T/delta
S <- numeric(m + 1)
S[1] <- S0
#Payoff asian option
asian_option_price <- function() {
for(j in 1:m) {
W <- rnorm(1)
S[j + 1] <- S[j] * exp((r - 0.5 * sigma^2) * delta + sigma * sqrt(delta) * W)
}
Si.bar <- mean(S)
exp(-r * T) * max(Si.bar - K, 0)
}
#Loops
C <- raply(n, asian_option_price(), .progress = "text")
My issue is that I need to use "-W" for a second simulation right after this one is done. The way the script is made, "W" is inside my loop which makes it impossible (i think) to use the corresponding "-W" after that. I think I need to use an independent matrix filled with rnorm() mat(x) = matrix(rnorm(m*n,mean=0,sd=1), m, n) so that I can simply use -mat(x) in my second simulation. I don't get how to take "W" out of my loop and still use it's corresponding matrix. Any help would be very useful. Thanks!
Your idea to preallocate all the random numbers is correct. You could then loop over the individual entries. However, it is faster to go for a vectorized approach:
delta <- 1/52
T <- 0.5
S0 <- 25
sigma <- 0.30
K <- 25
r <- 0.05
n <- 100000
m <- ceiling(T/delta)
W <- matrix(rnorm(n*m), nrow = m, ncol = n)
S <- apply(exp((r - 0.5 * sigma^2) * delta + sigma * sqrt(delta) * W), 2, cumprod)
S <- S0 * rbind(1, S)
Si_bar <- apply(S, 2, mean)
mean(pmax(Si_bar -K, 0)) * exp(-r*T)

Regularized Latent Semantic Indexing in R

I am trying to implement the Regularized Latent Semantic Indexing (RLSI) algorithm on R.
The original paper can be found here:
http://research.microsoft.com/en-us/people/hangli/sigirfp372-wang.pdf
Below is my code.
Here, I generate a matrix D from two matrices U and V. Each column of U correspond to a topic vector, and it is made to be sparse. After that, I apply RLSI on the D matrix to see if I can factorize it into two matrices, one of which has sparse vectors like U.
However, the resulting U is far from being sparse. Actually, every element of it is filled with numbers.
Is there something wrong with my code?
Thank you very much in advance.
library(magrittr)
# functions
updateU <- function(D,U,V){
S <- V %*% t(V)
R <- D %*% t(V)
for(m in 1:M){
u_m <- rep(0, K)
u_previous <- u_m
diff_u <- 100
while(diff_u > 0.1){
for(k in 1:K){
w_mk <- R[m,k] - S[k,-k] %*% U[m,-k]
in_hinge <- (abs(w_mk) - 0.5 * lambda_1)
u_m[k] <- (ifelse(in_hinge > 0, in_hinge, 0) * ifelse(w_mk >= 0, 1, -1)) / S[k,k]
}
diff_u <- sum(u_m - u_previous)
u_previous <- u_m
}
U[m,] <- u_m
}
return(U)
}
updateV <- function(D,U,V){
Sigma <- solve(t(U) %*% U + lambda_2 * diag(K))
Phi <- t(U) %*% D
V <- Sigma %*% Phi
return(V)
}
# Set constants
M <- 5000
N <- 1000
K <- 30
lambda_1 <- 1
lambda_2 <- 0.5
# Create D
originalU <- c(rpois(50000, lambda = 10), rep(0, 100000)) %>% sample(., 150000) %>% matrix(., M, K)
originalV <- rpois(30000, lambda = 5) %>% sample(., 30000) %>% matrix(., K, N)
D <- originalU %*% originalV
# Initialize U and V
V <- matrix(rpois(30000, lambda = 5), K, N)
U <- matrix(0, M, K)
# Run RLSI (iterate 100 times for now)
for(t in 1:100){
cat(t,":")
U <- updateU(D,U,V)
V <- updateV(D,U,V)
loss <- sum((D - U %*% V) ^ 2)
cat(loss, "\n")
}
I've got it. Each row in U has to be set to a zero vector each time updateU function is run.

Optimisation in R using Ucminf package

I am not able to apply ucminf function to minimise my cost function in R.
Here is my cost function:
costfunction <- function(X,y,theta){
m <- length(y);
J = 1/m * ((-t(y)%*%log(sigmoid(as.matrix(X)%*%as.matrix(theta)))) - ((1-t(y))%*%log(1-sigmoid(as.matrix(X)%*%as.matrix(theta)))))
}
Here is my sigmoid function:
sigmoid <- function(t){
g = 1./(1+exp(-t))
}
Here is my gradient function:
gradfunction <- function(X,y,theta){
grad = 1/ m * t(X) %*% (sigmoid(as.matrix(X) %*% as.matrix(theta) - y));
}
I am trying to do the following:
library("ucminf")
data <- read.csv("ex2data1.txt",header=FALSE)
X <<- data[,c(1,2)]
y <<- data[,3]
qplot(X[,1],X[,2],colour=factor(y))
m <- dim(X)[1]
n <- dim(X)[2]
X <- cbind(1,X)
initial_theta <<- matrix(0,nrow=n+1,ncol=1)
cost <- costfunction(X,y,initial_theta)
grad <- gradfunction(X,y,initial_theta)
This is where I want to call ucminf to find the minimum cost and values of theta. I am not sure how to do this.
Looks like you are trying to do the week2 problem of the machine learning course of Coursera.
No need to use ucminf packages here, you can simply use the R function optim it works
We will define the sigmoid and cost function first.
sigmoid <- function(z)
1 / (1 + exp(-z))
costFunction <- function(theta, X, y) {
m <- length(y)
J <- -(1 / m) * crossprod(c(y, 1 - y),
c(log(sigmoid(X %*% theta)), log(1 - sigmoid(X %*% theta))))
grad <- (1 / m) * crossprod(X, sigmoid(X %*% theta) - y)
list(J = J, grad = grad)
}
Let's load the data now, to make this code it reproductible, I put the data in my dropbox.
download.file("https://dl.dropboxusercontent.com/u/8750577/ex2data1.txt",
method = "curl", destfile = "/tmp/ex2data1.txt")
data <- matrix(scan('/tmp/ex2data1.txt', what = double(), sep = ","),
ncol = 3, byrow = TRUE)
X <- data[, 1:2]
y <- data[, 3, drop = FALSE]
m <- nrow(X)
n <- ncol(X)
X <- cbind(1, X)
initial_theta = matrix(0, nrow = n + 1)
We can then compute the result of the cost function at the initial theta like this
cost <- costFunction(initial_theta, X, y)
(grad <- cost$grad)
## [,1]
## [1,] -0.100
## [2,] -12.009
## [3,] -11.263
(cost <- cost$J)
## [,1]
## [1,] 0.69315
Finally we can use optim to ge the optimal theta
res <- optim(par = initial_theta,
fn = function(t) costFunction(t, X, y)$J,
gr = function(t) costFunction(t, X, y)$grad,
method = "BFGS", control = list(maxit = 400))
(theta <- res$par)
## [,1]
## [1,] -25.08949
## [2,] 0.20566
## [3,] 0.20089
(cost <- res$value)
## [1] 0.2035
If you have some problem with the function download.file, the data can be downloaded
here
As you did not provide a reproducible example it is hard to exactly give you the code you need, but the general idea is to hand the functions over to ucminf:
ucminf(start, costfunction, gradfunction, y = y, theta = initial_theta)
Note that start needs to be a vector of initial starting values which when handed over as X to the two functions need to produce a result. Usually you use random starting value (e.g., runif).

Resources