How to optimize finding trace of a square matrix multiplication? - r

I'm trying to optimize spdep function of R for my use case since it is very slow for large databases. I was doing mostly fine but I stuck at one point, where I am trying to find trace of my weights matrix for LM error test. I think the formula is tr[(W' + W) W] (page 82 of Anselin, L., Bera, A. K., Florax, R. and Yoon, M. J. 1996 Simple diagnostic tests for spatial dependence. Regional Science and Urban Economics, 26, 77–104.) W is a square weights matrix, holding the spatial relation of each observation with another. tr() operation is the sum of the diagonals.
In my case, the weights matrix is symmetric and the diagonals are zero. So, I thought the formula tr[(W' + W) W] equals to 2*sumsq(W), which is super fast. But apparently I am mistaken somewhere because the results do not match the results of the spdep library, which is likely to be right.
The relevant part of the spdep library is here. Can anybody help me how the result of the following function differs from 2*sumsq(W) or how to make it much faster? This function is where the lm.LMtests function gets clogged for large data sets.
tracew <- function (listw) {
dlmtr <- 0
n <- length(listw$neighbours)
if (n < 1) stop("non-positive n")
ndij <- card(listw$neighbours)
dlmtr <- 0
for (i in 1:n) {
dij <- listw$neighbours[[i]]
wdij <- listw$weights[[i]]
for (j in seq(length=ndij[i])) {
k <- dij[j]
# Luc Anselin 2006-11-11 problem with asymmetric listw
dk <- which(listw$neighbours[[k]] == i)
if (length(dk) > 0L && dk > 0L &&
dk <= length(listw$neighbours[[k]]))
wdk <- listw$weights[[k]][dk]
else wdk <- 0
dlmtr <- dlmtr + (wdij[j]*wdij[j]) + (wdij[j]*wdk)
}
}
dlmtr
}
Additional explanation for those who are not familiar with spdep library of R:
The input of the function, listw, holds a "graph" implementation of the weight matrix with two list of lists. listw$neighbors is a list, where each list item is a list of the indices of observations for which the observation has a relation to. listw$weights a list of the same structure with neighbors, except that it holds the weights of the relation.
Thanks in advance for any comments and directions.
# example code
# initiliaze
library(spdep)
library(multiway)
# load the tracew function above
data(columbus)
columbus = columbus[rep(row.names(columbus), 20), ] # the difference becomes dramatic when n is high. try not replicating at first to see the results.
# manual calculation, using sumsq
w = distm(cbind(columbus$X, columbus$Y))
w[w > 1000000] = Inf # remove some relations acc. to pre-defined rule
w = 1/(1+w)
diag(w) = 0
w = w / (sum(w) / length(columbus$X)) #"C style" standardization
2*sumsq(w)
# spdep calculation
neighs.band = dnearneigh(cbind(columbus$X, columbus$Y), 0, 1000, longlat = TRUE)
w.spdep = lapply(nbdists(neighs.band, cbind(columbus$X, columbus$Y), longlat = TRUE), function(x) 1/(0.001+x))
my.listw = nb2listw(neighs.band, glist = w.spdep, style="C")
tracew(my.listw)

Related

R code for replacing the values of Matrix

Hey everyone, I have a large Matrix X with the dimensions (654x7095). I wanted to subset this matrix and replace the values of this subsetted matrix of X with another matrix which I have created. The R-code is as follows -
install.packages("Matrix")
install.packages("base")
library(Matrix)
library(base)
T = 215
n = 3
k = 33
X = matrix(0,T*n,T*k)
IN = diag(n)
K1 = Matrix(0, n*n, n*(n-1)/2, sparse = TRUE)
for(i in 1:(n-1)){
K1[(2+(i-1)*(n+1)):(i*n), (1+(i-1)*(n-i/2)):(i*(n-i)*(i+1)/2)] <- diag(n-i)
}
yin = matrix(rnorm(645), ncol = 3)
Xu = matrix(rnorm(2150), ncol = 10)
#Till yet I have defined the variables and matrices which will be used in subsetting.
Above codes are perfectly fine, however, the code below is showing error -
#Loop for X subsetting
for(i in 1:T){
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- cbind( (t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN))))
}
# in this Kronecker() finds the Kronecker tensor product of two Matrix A and B. This function can be used with the help of "base" library.
When I am running this above code, the error which is showing is -
Error in X[(((i - 1) * n) + 1):(i * n), ] <- cbind((t(kronecker(yin[i, :
number of items to replace is not a multiple of replacement length
However, when I am running this same command in MATLAB it is working perfectly fine. MATLAB CODE -
X = zeros(T*n,T*k);
for i = 1:T
X((i-1)*n+1:i*n,(i-1)*k+1:i*k) = [kron(yin(i,:),IN)*K1, kron(Xu(i,:),IN)];
end
The output which MATLAB is giving is that it fills up the values in number of rows and columns which is defined in the Loop for subsetting the X. I have attached the snapshot of the desired output which MATLAB is giving. However, error is showing in R for the same.
Can someone enlighten me as where I am going wrong with the R code?
I appreciate the help, Many thanks.
I think the problem is how the class 'dgeMatrix' is handled. Try
for (i in 1:T) {
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- as.matrix(cbind((t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN)))))
}

Converting Mahalanobis p1 probabilities to p2 probabilities - is vectorization possible in this context?

I'm trying to write a function that takes in p1 probabilities for Mahalanobis distances and returns p2 probabilities. The formula for p2, along with a worked example is given at on the IBM website. I have written a function (below) that solves the problem, and allows me to reproduce the p2 values given in the worked example at the aforementioned webpage.
p1_to_p2 <- function(p1,N) {
p2 <- numeric(length(p1))
for (i in 1:length(p1))
{
k <- i;
p1_value <- p1[i];
start_value <- 1;
while (k >= 1)
{
start_value = start_value - choose(N,N-k+1) * (1-p1_value)^(N-k+1) * (p1_value)^(k-1)
k <- k-1;
}
p2[i] <- start_value;
}
return(p2)
}
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
N <- 73
p1_to_p2(p1,N)
Although the function works, it's been suggested to me by a colleague that it's inefficient/poorly written as it's not vectorized. This is indeed potentially relevant since in general we will be converting a lot more than just 5 p1 values to p2 values.
I have some limited experience vectorizing code, but I am wondering if a vectorized solution is possible in this context since within the loop the variable start_value constantly needs to update itself. If vectorization is not possible, is there some other way I should improve the code so that it works better?
Here is one way to do it, Breaking the steps here can help(Please read the comments):
#Input:
N <- 73
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
n <- N:(N-length(p1)+1)
# code:
mahalanobis_dist = function(x=x,n){
m = max(n)
max_min = Reduce(`*`,c(1, n[-length(n)]), accumulate = TRUE)
acc = c(1, Reduce(`*`, seq_along(n), accumulate = TRUE)[-length(n)])
comns = max_min/acc
exp <- comns*((1 - x)**n)*(x**(m - n))
return(1- sum(exp))
} ## the calculation of Mahalanobis distances
## This is just an iterator for each of the sequences we have to run the above function
ls <- lapply(n, function(x)(max(n):x))
## creating a list of iterators
## applying mapply, mapply or Map can iterate multiple inputs of the function,
## here the input p1 and ls , p1 is your input points, ls is the iterator created above
mapply(mahalanobis_dist,p1, ls)
## Applying the function on each iterators
#Output:
#> mapply(mahalanobis_dist,p1, ls)
#[1] 0.2864785 0.1299047 0.5461263 0.3973690
#[5] 0.2662369
Note:
Also, one can join the last two steps like below, with one function and correct iteration this can be achieved:
mapply(mahalanobis_dist,p1, lapply(n, function(x)(max(n):x)))

Computing the null space of a bigmatrix in R

I can not find any function or package to calculate the null space or (QR decomposition) of a bigmatrix (from library(bigmemory)) in R. For example:
library(bigmemory)
a <- big.matrix(1000000, 1000, type='double', init=0)
I tried the following but got the errors shown. How can I find the null space of a bigmemory object?
a.qr <- Matrix::qr(a)
# Error in as.vector(data) :
# no method for coercing this S4 class to a vector
q.null <- MASS::Null(a)
# Error in as.vector(data) :
# no method for coercing this S4 class to a vector
If you want to compute the full SVD of the matrix, you can use package bigstatsr to perform computations by block. A FBM stands for a Filebacked Big Matrix and is an object similar to a filebacked big.matrix object of package bigmemory.
library(bigstatsr)
options(bigstatsr.block.sizeGB = 0.5)
# Initialize FBM with random numbers
a <- FBM(1e6, 1e3)
big_apply(a, a.FUN = function(X, ind) {
X[, ind] <- rnorm(nrow(X) * length(ind))
NULL
}, a.combine = 'c')
# Compute t(a) * a
K <- big_crossprodSelf(a, big_scale(center = FALSE, scale = FALSE))
# Get v and d where a = u * d * t(v) the SVD of a
eig <- eigen(K[])
v <- eig$vectors
d <- sqrt(eig$values)
# Get u if you need it. It will be of the same size of u
# so that I store it as a FBM.
u <- FBM(nrow(a), ncol(a))
big_apply(u, a.FUN = function(X, ind, a, v, d) {
X[ind, ] <- sweep(a[ind, ] %*% v, 2, d, "/")
NULL
}, a.combine = 'c', block.size = 50e3, ind = rows_along(u),
a = a, v = v, d = d)
# Verification
ind <- sample(nrow(a), 1000)
all.equal(a[ind, ], tcrossprod(sweep(u[ind, ], 2, d, "*"), v))
This takes approximately 10 minutes on my computer.
#Mahon #user20650 #F.Privė For clarity I pinged the bigmemory team and asked
Essentially, is there an implementation of the QR function (QR Decomposition) that works with big memory matrixes?
I felt it useful to get clarity on the original question asked. #F.Privė - nice answer. Hopefully your answer, and their response will help guide people in the future. Their response below:
Thanks for the note. There is not currently an implementation of the qr decomposition. Ideally, you would implement this using Householder reflections (if the matrix is dense) or Givens rotations (if it is sparse).
The irlba package is compatible with bigmemory. It provides a truncated singular value decomposition. So, if your matrix is relatively sparse, you could truncate at the rank of the matrix. This is probably your best option. If you don't know the rank then you can use the package to update the truncation iteratively.
Please note that if your matrix is (tall and skinny or short and fat) then the SO solution is OK. However, anytime you resort to calculating the cross-product you lose some numerical stability. This can be an issue if you are planning on inverting the matrix.

R: How to calculate and sort two variables with a for loop

I want to perform an IDW cross-validation and find out which "power"-value gives the smallest RMSE. In order to do this, I want to store the "power" and "RMSE"-values in a list and sort them by the smallest RMSE, for example
I'd like something like this:
RMSE Power
[1,] 1.230 2.5
[2,] 1.464 1.5
[3,] 1.698 2.0
[4,] 1.932 3.0
What I have so far is this:
require(sp)
require(gstat)
data("meuse")
#### create grid:
pixels <- 500 #define resolution
#define extent
raster.grd <- expand.grid(x=seq(floor(min(x=meuse$x)),
ceiling(max(x=meuse$x)),
length.out=pixels),
y=seq(floor(min(y=meuse$y)),
ceiling(max(y=meuse$y)),
length.out=pixels))
# convert the dataframe to a spatial points and then to a spatial pixels
grd.pts <- SpatialPixels(SpatialPoints((raster.grd)))
grd <- as(grd.pts, "SpatialGrid")
gridded(grd) = TRUE
#### perform IDW and loop through different power-values
power = seq(from = 1.5, to = 3, by = 0.5)
results=list()
results.cv=list()
for(i in power) {
results[[paste0(i,"P")]] <- gstat::idw(meuse$zinc ~ 1, meuse, grd, idp = i)
results.cv[[paste0(i,"P")]] <- krige.cv(zinc ~ 1, meuse, nfold = nrow(meuse),set = list(idp = i))
}
Now my attempt to calculate and store the RMSE with a for-loop:
results_rmse <- list()
pwr <- names(results.cv)
for(i in results.cv){ #for each Element (1.5P, 2P, etc) in results.cv
for(j in 1:length(pwr)){ #for each Power
results_rmse <- sqrt(mean(i$residual^2))
print(pwr[j])
}
print(paste("RMSE",results_rmse))
}
But with this loop, it prints each RMSE individually. So I changed the code like this
results_rmse[[i]] <- sqrt(mean(i$residual^2))
But then I get an error
Error in results_rmse[[i]] <- sqrt(mean(i$residual^2)) : invalid subscript type 'S4'
I tried several versions of the for-loop, but I couldn't even figure out how to store the values in a list, not to mention to sort them by the smallest RMSE.
There is an extra loop for j in the RMSE calculation that is not needed as far as I understand the problem. Also, I rearranged the loop in such a way that it cycles through a sequence of elements rather than calling them by their names.
# Data, because your script doesn't run for me. The rest is identical from your code
for(i in power) {
results.cv[[paste0(i,"P")]]$residual <- rnorm(50)
}
# Fixed loop
for(i in 1:length(results.cv)){
results_rmse[[i]] <- sqrt(mean(results.cv[[i]]$residual^2))
}
names(results_rmse) <- names(results.cv)
Alternatively, the for loop can be avoided with the apply function. The result is a named list corresponding to the input names, so the last line can be omitted to achieve the same results_rmse.
results_rmse <- lapply(results.cv, function(x) sqrt(mean(x$residual^2)))
To print the data as you showed in your question:
cbind(RMSE=unlist(results_rmse), Power=power)

R: gdistance different results for accCost and costDistance

accCost() and costDistance() functions from R gdistance produce different values when going from source coordinate A to destination coordinate B. Shouldn't the cost accumulation value at B be equivalent to the costDistance value from A to B given an equivalent anisotropic transition matrix and that both functions use the Dijkstra algorithm?
If not, then what is the fundamental difference between the calculations? If so, what accounts for the different values derived from the code presented below? In the example, A to B costDistance=0.13 hours and accCost=0.11 hours at point B. My other tests suggest that accCost is consistently less than costDistance and consierably so over long distances. The code is based on the example provided in accCost documentation.
require(gdistance)
r <- raster(system.file("external/maungawhau.grd", package="gdistance"))
altDiff <- function(x){x[2] - x[1]}
hd <- transition(r, altDiff, 8, symm=FALSE)
slope <- geoCorrection(hd)
adj <- adjacent(r, cells=1:ncell(r), pairs=TRUE, directions=8)
speed <- slope
speed[adj] <- 6 * 1000 * exp(-3.5 * abs(slope[adj] + 0.05))#1000 to convert to a common spatial unit of meters
Conductance <- geoCorrection(speed)
A <- matrix(c(2667670, 6479000),ncol=2)
B <- matrix(c(2667800, 6479400),ncol=2)
ca <- accCost(Conductance,fromCoords=A)
extract(ca,B)
costDistance(Conductance,fromCoords=A,toCoords=B)
There should be no difference. The current version of accCost has a small bug that arises from a change in the igraph package.
For the moment, please see if this function solves the problem.
setMethod("accCost", signature(x = "TransitionLayer", fromCoords = "Coords"),
def = function(x, fromCoords)
{
fromCoords <- .coordsToMatrix(fromCoords)
fromCells <- cellFromXY(x, fromCoords)
if(!all(!is.na(fromCells))){
warning("some coordinates not found and omitted")
fromCells <- fromCells[!is.na(fromCells)]
}
tr <- transitionMatrix(x)
tr <- rBind(tr,rep(0,nrow(tr)))
tr <- cBind(tr,rep(0,nrow(tr)))
startNode <- nrow(tr) #extra node to serve as origin
adjP <- cbind(rep(startNode, times=length(fromCells)), fromCells)
tr[adjP] <- Inf
adjacencyGraph <- graph.adjacency(tr, mode="directed", weighted=TRUE)
E(adjacencyGraph)$weight <- 1/E(adjacencyGraph)$weight
shortestPaths <- shortest.paths(adjacencyGraph, v=startNode, mode="out")[-startNode]
result <- as(x, "RasterLayer")
result <- setValues(result, shortestPaths)
return(result)
}
)
This issue has been resolved in gdistance 1.2-1.

Resources