R: gdistance different results for accCost and costDistance - r

accCost() and costDistance() functions from R gdistance produce different values when going from source coordinate A to destination coordinate B. Shouldn't the cost accumulation value at B be equivalent to the costDistance value from A to B given an equivalent anisotropic transition matrix and that both functions use the Dijkstra algorithm?
If not, then what is the fundamental difference between the calculations? If so, what accounts for the different values derived from the code presented below? In the example, A to B costDistance=0.13 hours and accCost=0.11 hours at point B. My other tests suggest that accCost is consistently less than costDistance and consierably so over long distances. The code is based on the example provided in accCost documentation.
require(gdistance)
r <- raster(system.file("external/maungawhau.grd", package="gdistance"))
altDiff <- function(x){x[2] - x[1]}
hd <- transition(r, altDiff, 8, symm=FALSE)
slope <- geoCorrection(hd)
adj <- adjacent(r, cells=1:ncell(r), pairs=TRUE, directions=8)
speed <- slope
speed[adj] <- 6 * 1000 * exp(-3.5 * abs(slope[adj] + 0.05))#1000 to convert to a common spatial unit of meters
Conductance <- geoCorrection(speed)
A <- matrix(c(2667670, 6479000),ncol=2)
B <- matrix(c(2667800, 6479400),ncol=2)
ca <- accCost(Conductance,fromCoords=A)
extract(ca,B)
costDistance(Conductance,fromCoords=A,toCoords=B)

There should be no difference. The current version of accCost has a small bug that arises from a change in the igraph package.
For the moment, please see if this function solves the problem.
setMethod("accCost", signature(x = "TransitionLayer", fromCoords = "Coords"),
def = function(x, fromCoords)
{
fromCoords <- .coordsToMatrix(fromCoords)
fromCells <- cellFromXY(x, fromCoords)
if(!all(!is.na(fromCells))){
warning("some coordinates not found and omitted")
fromCells <- fromCells[!is.na(fromCells)]
}
tr <- transitionMatrix(x)
tr <- rBind(tr,rep(0,nrow(tr)))
tr <- cBind(tr,rep(0,nrow(tr)))
startNode <- nrow(tr) #extra node to serve as origin
adjP <- cbind(rep(startNode, times=length(fromCells)), fromCells)
tr[adjP] <- Inf
adjacencyGraph <- graph.adjacency(tr, mode="directed", weighted=TRUE)
E(adjacencyGraph)$weight <- 1/E(adjacencyGraph)$weight
shortestPaths <- shortest.paths(adjacencyGraph, v=startNode, mode="out")[-startNode]
result <- as(x, "RasterLayer")
result <- setValues(result, shortestPaths)
return(result)
}
)

This issue has been resolved in gdistance 1.2-1.

Related

Converting Mahalanobis p1 probabilities to p2 probabilities - is vectorization possible in this context?

I'm trying to write a function that takes in p1 probabilities for Mahalanobis distances and returns p2 probabilities. The formula for p2, along with a worked example is given at on the IBM website. I have written a function (below) that solves the problem, and allows me to reproduce the p2 values given in the worked example at the aforementioned webpage.
p1_to_p2 <- function(p1,N) {
p2 <- numeric(length(p1))
for (i in 1:length(p1))
{
k <- i;
p1_value <- p1[i];
start_value <- 1;
while (k >= 1)
{
start_value = start_value - choose(N,N-k+1) * (1-p1_value)^(N-k+1) * (p1_value)^(k-1)
k <- k-1;
}
p2[i] <- start_value;
}
return(p2)
}
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
N <- 73
p1_to_p2(p1,N)
Although the function works, it's been suggested to me by a colleague that it's inefficient/poorly written as it's not vectorized. This is indeed potentially relevant since in general we will be converting a lot more than just 5 p1 values to p2 values.
I have some limited experience vectorizing code, but I am wondering if a vectorized solution is possible in this context since within the loop the variable start_value constantly needs to update itself. If vectorization is not possible, is there some other way I should improve the code so that it works better?
Here is one way to do it, Breaking the steps here can help(Please read the comments):
#Input:
N <- 73
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
n <- N:(N-length(p1)+1)
# code:
mahalanobis_dist = function(x=x,n){
m = max(n)
max_min = Reduce(`*`,c(1, n[-length(n)]), accumulate = TRUE)
acc = c(1, Reduce(`*`, seq_along(n), accumulate = TRUE)[-length(n)])
comns = max_min/acc
exp <- comns*((1 - x)**n)*(x**(m - n))
return(1- sum(exp))
} ## the calculation of Mahalanobis distances
## This is just an iterator for each of the sequences we have to run the above function
ls <- lapply(n, function(x)(max(n):x))
## creating a list of iterators
## applying mapply, mapply or Map can iterate multiple inputs of the function,
## here the input p1 and ls , p1 is your input points, ls is the iterator created above
mapply(mahalanobis_dist,p1, ls)
## Applying the function on each iterators
#Output:
#> mapply(mahalanobis_dist,p1, ls)
#[1] 0.2864785 0.1299047 0.5461263 0.3973690
#[5] 0.2662369
Note:
Also, one can join the last two steps like below, with one function and correct iteration this can be achieved:
mapply(mahalanobis_dist,p1, lapply(n, function(x)(max(n):x)))

How to optimize finding trace of a square matrix multiplication?

I'm trying to optimize spdep function of R for my use case since it is very slow for large databases. I was doing mostly fine but I stuck at one point, where I am trying to find trace of my weights matrix for LM error test. I think the formula is tr[(W' + W) W] (page 82 of Anselin, L., Bera, A. K., Florax, R. and Yoon, M. J. 1996 Simple diagnostic tests for spatial dependence. Regional Science and Urban Economics, 26, 77–104.) W is a square weights matrix, holding the spatial relation of each observation with another. tr() operation is the sum of the diagonals.
In my case, the weights matrix is symmetric and the diagonals are zero. So, I thought the formula tr[(W' + W) W] equals to 2*sumsq(W), which is super fast. But apparently I am mistaken somewhere because the results do not match the results of the spdep library, which is likely to be right.
The relevant part of the spdep library is here. Can anybody help me how the result of the following function differs from 2*sumsq(W) or how to make it much faster? This function is where the lm.LMtests function gets clogged for large data sets.
tracew <- function (listw) {
dlmtr <- 0
n <- length(listw$neighbours)
if (n < 1) stop("non-positive n")
ndij <- card(listw$neighbours)
dlmtr <- 0
for (i in 1:n) {
dij <- listw$neighbours[[i]]
wdij <- listw$weights[[i]]
for (j in seq(length=ndij[i])) {
k <- dij[j]
# Luc Anselin 2006-11-11 problem with asymmetric listw
dk <- which(listw$neighbours[[k]] == i)
if (length(dk) > 0L && dk > 0L &&
dk <= length(listw$neighbours[[k]]))
wdk <- listw$weights[[k]][dk]
else wdk <- 0
dlmtr <- dlmtr + (wdij[j]*wdij[j]) + (wdij[j]*wdk)
}
}
dlmtr
}
Additional explanation for those who are not familiar with spdep library of R:
The input of the function, listw, holds a "graph" implementation of the weight matrix with two list of lists. listw$neighbors is a list, where each list item is a list of the indices of observations for which the observation has a relation to. listw$weights a list of the same structure with neighbors, except that it holds the weights of the relation.
Thanks in advance for any comments and directions.
# example code
# initiliaze
library(spdep)
library(multiway)
# load the tracew function above
data(columbus)
columbus = columbus[rep(row.names(columbus), 20), ] # the difference becomes dramatic when n is high. try not replicating at first to see the results.
# manual calculation, using sumsq
w = distm(cbind(columbus$X, columbus$Y))
w[w > 1000000] = Inf # remove some relations acc. to pre-defined rule
w = 1/(1+w)
diag(w) = 0
w = w / (sum(w) / length(columbus$X)) #"C style" standardization
2*sumsq(w)
# spdep calculation
neighs.band = dnearneigh(cbind(columbus$X, columbus$Y), 0, 1000, longlat = TRUE)
w.spdep = lapply(nbdists(neighs.band, cbind(columbus$X, columbus$Y), longlat = TRUE), function(x) 1/(0.001+x))
my.listw = nb2listw(neighs.band, glist = w.spdep, style="C")
tracew(my.listw)

input k-means in R

I'm trying to perform k-means on a dataframe with 69 columns and 1000 rows. First, I need to decide upon the optimal numbers of clusters first with the use of the Davies-Bouldin index. This algorithm requires that the input should be in the form of a matrix, I used this code first:
totalm <- data.matrix(total)
Followed by the following code (Davies-Bouldin index)
clusternumber<-0
max_cluster_number <- 30
#Davies Bouldin algorithm
library(clusterCrit)
smallest <-99999
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,b)
cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
print("##clusternumber##")
print(clusternumber)
print("##smallest##")
print(smallest)
I keep on getting this error:(list) object cannot be coerced to type 'double'.
How can I solve this?
Reproducable example:
a <- c(0,0,1,0,1,0,0)
b <- c(0,0,1,0,0,0,0)
c <- c(1,1,0,0,0,0,1)
d <- c(1,1,0,0,0,0,0)
total <- cbind(a,b,c,d)
The error is coming from cl<-as.numeric(cl). The result of a call to kmeans is an object, which is a list containing various information about the model.
Run ?kmeans
I would also recommend you add nstart = 20 to your kmeans call. k-means clustering is a random process. This will run the algorithm 20 times and find the best fit (i.e. for each number of centers).
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,centers = b,nstart = 20)
#cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
This gave me
[1] "##clusternumber##"
[1] 4
[1] "##smallest##"
[1] 0.138675
(tempoarily changing max clusters to 4 as reproducible data is a small set)
EDIT Integer Error
I was able to reproduce your error using
a <- as.integer(c(0,0,1,0,1,0,0))
b <- as.integer(c(0,0,1,0,0,0,0))
c <- as.integer(c(1,1,0,0,0,0,1))
d <- as.integer(c(1,1,0,0,0,0,0))
totalm <- cbind(a,b,c,d)
So that an integer matrix is created.
I was then able to remove the error by using
storage.mode(totalm) <- "double"
Note that
total <- cbind(a,b,c,d)
totalm <- data.matrix(total)
is unnecessary for the data in this example
> identical(total,totalm)
[1] TRUE

R Simulation Programming Efficiency

I am a relatively new R programmer and have written a script that takes some statistical results and will ultimately compare it to a vector of results in which the target variable has been randomized. The result vector contains the statistical results of n simulations. As the number of simulations increases (I would like to run 10,000 simulations at least) the run time is longer than I would like. I have tried increasing the performance in ways I know to modify the code, but would love the help of others in optimizing it. The relevant part of the code is below.
#CREATE DATA
require(plyr)
Simulations <- 10001
Variation <- c("Control", "A", "B","C")
Trials <- c(727,724,723,720)
NonResponse <- c(692,669,679,682)
Response <- c(35,55,44,38)
ConfLevel <- .95
#PERFORM INITIAL CALCS
NonResponse <- Trials-Response
Data <-data.frame(Variation, NonResponse, Response, Trials)
total <- ddply(Data,.(Variation),function(x){data.frame(value = rep(c(0,1),times = c(x$NonResponse,x$Response)))})
total <- total[sample(1:nrow(total)), ]
colnames(total) <- c("Variation","Response")
#CREATE FUNCTION TO PERFORM SIMULATIONS
targetshuffle <- function(x)
{
shuffle_target <- x[,"Response"]
shuffle_target <- data.frame(sample(shuffle_target))
revised <- cbind(x[,"Variation"], shuffle_target)
colnames(revised) <- c("Variation","Yes")
yes_variation <- data.frame(table(revised$Yes,revised$Variation))
colnames(yes_variation) <- c("Yes","Variation","Shuffled_Response")
Shuffled_Data <- subset(yes_variation, yes_variation$Yes==1)
Shuffled_Data <- Shuffled_Data[match(Variation, Shuffled_Data$Variation),]
yes_variation <- cbind(Data,Shuffled_Data)
VectorPTest_All <- yes_variation[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
Control_Only <- yes_variation[yes_variation$Variation=="Control",]
VectorPTest_Chall <- subset(yes_variation,!(Variation=="Control"))
VectorPTest_Chall <- VectorPTest_Chall[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
ControlResponse <- Control_Only$Response
ControlResponseRevised <- Control_Only$Shuffled_Response
ControlTotal <- Control_Only$Trials
VariationCount <- length(VectorPTest_Chall$Variation)
VP <- data.frame(c(VectorPTest_Chall,rep(ControlResponse),rep(ControlResponseRevised),rep(ControlTotal)))
names(VP) <- c("Variation","NonResponse","Response", "Trials", "ResponseShuffled", "ControlReponse",
"ControlResponseShuffled","ControlTotal")
VP1 <<- data.frame(VP[,c(5,7,4,8)])
VP2 <<- data.frame(VP[,c(3,6,4,8)])
ptest <- apply(VP1, 1, function(column) prop.test(x=c(column[1], column[2]),
n=c(column[3], column[4]), alternative="two.sided",
conf.level=ConfLevel, correct=FALSE)$p.value)
min_p_value <- min(ptest)
return(min_p_value)
}
#CALL FUNCTION
sim_result <- do.call(rbind, rlply(Simulations, targetshuffle(total)))
Offhand, one thing to look at is creating all the data frames. Each time you do that you're copying all the data in the constituent object. If the dimensions are predictable you might consider creating empty matrices at the beginning of the function and populating them as you go.

choose n most distant points in R

Given a set of xy coordinates, how can I choose n points such that those n points are most distant from each other?
An inefficient method that probably wouldn't do too well with a big dataset would be the following (identify 20 points out of 1000 that are most distant):
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
bestavg <- 0
bestSet <- NA
for (i in 1:1000){
subset <- xy[sample(1:nrow(xy),n),]
avg <- mean(dist(subset))
if (avg > bestavg) {
bestavg <- avg
bestSet <- subset
}
}
This code, based on Pascal's code, drops the point that has the largest row sum in the distance matrix.
m2 <- function(xy, n){
subset <- xy
alldist <- as.matrix(dist(subset))
while (nrow(subset) > n) {
cdists = rowSums(alldist)
closest <- which(cdists == min(cdists))[1]
subset <- subset[-closest,]
alldist <- alldist[-closest,-closest]
}
return(subset)
}
Run on a Gaussian cloud, where m1 is #pascal's function:
> set.seed(310366)
> xy <- cbind(rnorm(1000),rnorm(1000))
> m1s = m1(xy,20)
> m2s = m2(xy,20)
See who did best by looking at the sum of the interpoint distances:
> sum(dist(m1s))
[1] 646.0357
> sum(dist(m2s))
[1] 811.7975
Method 2 wins! And compare with a random sample of 20 points:
> sum(dist(xy[sample(1000,20),]))
[1] 349.3905
which does pretty poorly as expected.
So what's going on? Let's plot:
> plot(xy,asp=1)
> points(m2s,col="blue",pch=19)
> points(m1s,col="red",pch=19,cex=0.8)
Method 1 generates the red points, which are evenly spaced out over the space. Method 2 creates the blue points, which almost define the perimeter. I suspect the reason for this is easy to work out (and even easier in one dimension...).
Using a bimodal pattern of initial points also illustrates this:
and again method 2 produces much larger total sum distance than method 1, but both do better than random sampling:
> sum(dist(m1s2))
[1] 958.3518
> sum(dist(m2s2))
[1] 1206.439
> sum(dist(xy2[sample(1000,20),]))
[1] 574.34
Following #Spacedman's suggestion, I have written a function that drops a point from the closest pair, until the desired number of points remains. It seems to work well, however, it slows down pretty quickly as you add points.
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
subset <- xy
alldist <- as.matrix(dist(subset))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA
while (nrow(subset) > n) {
closest <- which(alldist == min(alldist,na.rm=T),arr.ind=T)
subset <- subset[-closest[1,1],]
alldist <- alldist[-closest[1,1],-closest[1,1]]
}

Resources