spBayes spLM function with duplicate coordinates - r

I am using the spRecover function in package spBayes to produce a spatial univariate model.
Here is a reproducible example where there I made a duplicate coordinate point. The modeling procedure itself executes just fine, but it won't let me recover the spatial effects for each site:
require(spBayes)
set.seed(444)
N = 200
y = rnorm(N,0,100)
x = rnorm(N,2,7)
df <- as.data.frame(cbind((rnorm(N,5,2.5)),rep('location1',N)))
coord <- cbind(runif(N,-30,30),runif(N,-180,180))
coord[2,] <- coord [1,]
n.samples <- 1000
bef.sp <- spLM(y ~ x, ## the equation
data = df, coords=coord, ## data and coordinates
starting=list("phi"=3/200,"sigma.sq"=0.08,"tau.sq"=0.02),## start values
tuning=list("phi"=0.1, "sigma.sq"=0.05, "tau.sq"=0.05), ## tuning values
priors=list("phi.Unif"=c(3/1500, 3/50), "sigma.sq.IG"=c(2, 0.08),"tau.sq.IG"=c(2, 0.02)), ## priors
cov.model="exponential",n.samples=n.samples)
burn.in <- floor(0.75*n.samples)
bef.sp <- spRecover(bef.sp, start=burn.in, thin=2)
The error received is:
Error in spRecover(bef.sp, start = burn.in, thin = 2) :
c++ error: dpotrf failed
I found a post by the package author indicating this error might come up if one has replicated coordinates. I definitely have duplicated coordinates, since many sites were sampled many times (on the same day; this is not a time-series issue). How do I get the model to accept that there is lots of replication within each coordinate pair, and to recover individual spatial effects values for each site?
Thanks!

Related

Principal component analysis using R. Automatic and manual results do not match

Two different methods of the principal component analysis were conducted to analyze the following data (ch082.dat) using the Box1's R-code, below.https://drive.google.com/file/d/1xykl6ln-bUnXIs-jIA3n5S3XgHjQbkWB/view?usp=sharing
The first method uses the rotation matrix (See 'ans_mat' under the '#rotated data' of the Box1's code) and,
the second method uses the 'pcomp' function (See 'rpca' under the '#rotated data' of the Box1's code).
However, there is a subtle discrepancy in the answer between the method using the rotation matrix and the method using the 'pcomp' function.
make it match
My Question
What should I do so that the result of the rotation matrix -based method matches the result of the'pcomp' function?
As far as I've tried with various data, including other data, the actual discrepancies seem to be limited to scale shifts and mirroring transformations.
The results of the rotation matrix -based method is shown in left panel.
The results of the pcomp function -based method is shown in right panel.
Mirror inversion can be seen in "ch082.dat" data.(See Fig.1);
It seems that, in some j, the sign of the "jth eigenvector of the correlation matrix" and the sign of the "jth column of the output value of the prcomp function" may be reversed. If there is a degree of overlap in the eigenvalues, it is possible that the difference may be more complex than mirror inversion.
Fig.1
There is a scale shift for the Box2's data (See See Fig.2), despite the centralization and normalization to the data.
Fig.2
Box.1
#dataload
##Use the 'setwd' function to specify the directory containing 'ch082.dat'.
##For example, if you put this file directly under the C drive of your Windows PC, you can run the following command.
setwd("C:/") #Depending on where you put the file, you may need to change the path.
getwd()
w1<-read.table("ch082.dat",header = TRUE,row.names = 1,fileEncoding = "UTF-8")
w1
#Function for standardizing data
#Thanks to https://qiita.com/ohisama2/items/5922fac0c8a6c21fcbf8
standalize <- function(data)
{ for(i in length(data[1,]))
{
x <- as.matrix(data[,i])
y <- (x-mean(x)/sd(x))
data[,i] <- y
}
return(data)}
#Method using rotation matrix
z_=standalize(w1)
B_mat=cor(z_) #Compute correlation matrix
eigen_m <- eigen(B_mat)
sample_mat <- as.matrix(z_)
ans_mat=sample_mat
for(j in 1:length(sample_mat[1,])){
ans_mat[,j]=sample_mat%*%eigen_m$vectors[,j]
}
#Method using "rpca" function
rpca <- prcomp(w1,center=TRUE, scale=TRUE)
#eigen vectors
eigen_m$vectors
rpca
#rotated data
ans_mat
rpca$x
#Graph Plots
par(mfrow=c(1,2))
plot(
ans_mat[,1],
ans_mat[,2],
main="Rotation using eigenvectors"
)
plot(rpca$x[,1], rpca$x[,2],
main="Principal component score")
par(mfrow=c(1,1))
#summary
summary(rpca)$importance
Box2.
sample_data <- data.frame(
X = c(2,4, 6, 5,7, 8,10),
Y = c(6,8,10,11,9,12,14)
)
X = c(2,4, 6, 5,7, 8,10)
Y = c(6,8,10,11,9,12,14)
plot(Y ~ X)
w1=sample_data
Reference
https://logics-of-blue.com/principal-components-analysis/
(Written in Japanease)
The two sets of results agree. First we can simplify your code a bit. You don't need your function or the for loop:
z_ <- scale(w1)
B_mat <- cor(z_)
eigen_m <- eigen(B_mat)
ans_mat <- z_ %*% eigen_m$vectors
Now the prcomp version
z_pca <- prcomp(z_)
z_pca$sdev^2 # Equals eigen_m$values
z_pca$rotation # Equals eigen_m$vectors
z_pca$x # Equals ans_mat
Your original code mislabeled ans_mat columns. They are actually the principal component scores. You can fix that with
colnames(ans_mat) <- colnames(z_pca$x)
The pc loadings (and therefore the scores) are not uniquely defined with respect to reflection. In other words multiplying all of the loadings or scores in one component by -1 flips them but does not change their relationships to one another. Multiply z_pca$x[, 1] by -1 and the plots will match:
z_pca$x[, 1] <- z_pca$x[, 1] * -1
dev.new(width=10, height=6)
par(mfrow=c(1,2))
plot(ans_mat[,1], ans_mat[,2], main="Rotation using eigenvectors")
plot(z_pca$x[,1], z_pca$x[,2], main="Principal component score")

Maximum pseudo-likelihood estimator for soft-core point process

I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner
And here is the R-code I came up with
`library(deldir)
library(tidyverse)
library(fields)
#MPLE
# irregular parameter k
k <- 0.4
## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates
dum.x <- seq(ramin, ramax, length = 50)
dum.y <- seq(demin, demax, length = 50)
dum <- expand.grid(dum.x, dum.y)
colnames(dum) <- c("RA", "DE")
## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted
bind.x <- bind_rows(X, dum) %>%
mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum))))
## Calculate Quadrature weights using Voronoi cell area
w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area
## Response
y <- bind.x$Ind/w
# the sum of distances between all pairs of points (the sufficient statistics)
tmp <- cbind(bind.x$RA, bind.x$DE)
t1 <- rdist(tmp)^(-2/k)
t1[t1 == Inf] <- 0
t1 <- rowSums(t1)
t <- -t1
# fit the model using quasipoisson regression
fit <- glm(y ~ t, family = quasipoisson, weights = w)
`
However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative.
Can anybody help and check if my code is correct? Thanks very much!

How to predict cluster labeling using DBSCAN object and Gower distance matrix for new data in R

I'm having issue with predicting cluster labeling for a test data, based on a dbscan clustering model on the training data.
I used gower distance matrix when creating the model:
> gowerdist_train <- daisy(analdata_train,
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
Using this gowerdist matrix, the dbscan clustering model created was:
> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)
Then I try to use predict to label a test dataset using the above dbscan object:
> predict(sb, newdata = analdata_test, data = analdata_train)
But I receive the following error:
Error in frNN(rbind(data, newdata), eps = object$eps, sort = TRUE,
...) : x has to be a numeric matrix
I can take a guess on where this error might be coming from, which is probably due to the absence of the gower distance matrix that hasn't been created for the test data.
My question is, should I create a gower distance matrix for all data (datanal_train + datanal_test) separately and feed it into predict? how else would the algorithm know what the distance of test data from the train data is, in order to label?
In that case, would the newdata parameter be the new gower distance matrix that contains ALL (train + test) data? and the data parameter in predict would be the training distance matrix, gowerdist_train?
What I am not quite sure about is how would the predict algorithm distinguish between the test and train data set in the newly created gowerdist_all matrix?
The two matrices (new gowerdist for all data and the gowerdist_train) would obviously not have the same dimensions. Also, it doesn't make sense to me to create a gower distance matrix only for test data because distances must be relative to the test data, not the test data itself.
Edit:
I tried using gower distance matrix for all data (train + test) as my new data and received an error when fed to predict:
> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)
ERROR: Error in 1:nrow(data) : argument of length 0 In addition:
Warning message: In rbind(data, newdata) : number of columns of
result is not a multiple of vector length (arg 1)
So, my suggested solution doesn't work.
I decided to create a code that would use KNN algorithm in dbscan to predict cluster labeling using gower distance matrix. The code is not very pretty and definitely not programmaticaly efficient but it works. Happy for any suggestions that would improve it.
The pseydocode is:
1) calculate new gower distance matrix for all data, including test and train
2) use the above distance matrix in kNN function (dbscan package) to determine the k nearest neighbours to each test data point.
3) determine the cluster labels for all those nearest points for each test point. Some of them will have no cluster labeling because they are test points themselves
4) create a count matrix to count the frequency of clusters for the k nearest points for each test point
5) use very simple likelihood calculation to choose the cluster for the test point based on its neighbours clusters (the maximum frequency). this part also considers the neighbouring test points. That is, the cluster for the test point is chosen only when the maximum frequency is largest when you add the number of neighbouring test points to the other clusters. Otherwise, it doesn't decide the cluster for that test point and waits for the next iteration when hopefully more of its neighboring test points have had their cluster label decided based on their neighbours.
6) repeat above (steps 2-5) until you've decided all clusters
** Note: this algorithm doesn't converge all the time. (once you do the math, it's obvious why that is) so, in the code i break out of the algorithm when the number of unclustered test points doesn't change after a while. then i repeat 2-6 again with new knn (change the number of nearest neighbours and then run the code again). This will ensure more points are involved in deciding in th enext round. I've tried both larger and smaller knn's and both work. Would be good to know which one is better. I haven't had to run the code more than twice so far to decide the clusters for the test data point.
Here is the code:
#calculate gower distance for all data (test + train)
gowerdist_test <- daisy(all_data[rangeofdataforgowerdist],
metric = "gower",
stand = FALSE,
type = list(asymm = listofasymmvars),
weights = Weights)
summary(gowerdist_test)
Then use the code below to label clusters for test data.
#library(dbscan)
# find the k nearest neibours for each point and order them with distance
iteration_MAX <- 50
iteration_current <- 0
maxUnclusterRepeatNum <- 10
repeatedUnclustNum <- 0
unclusteredNum <- sum(is.na(all_data$Cluster))
previousUnclustereNum <- sum(is.na(all_data$Cluster))
nn_k = 30 #number of neighbourhoods
while (anyNA(all_data$Cluster) & iteration_current < iteration_MAX)
{
if (repeatedUnclustNum >= maxUnclusterRepeatNum) {
print(paste("Max number of repetition (", maxUnclusterRepeatNum ,") for same unclustered data has reached. Clustering terminated unsuccessfully."))
invisible(gc())
break;
}
nn_test <- kNN(gowerdist_test, k = nn_k, sort = TRUE)
# for the TEST points in all data, find the closets TRAIN points and decide statistically which cluster they could belong to, based on the clusters of the nearest TRAIN points
test_matrix <- nn_test$id[1: nrow(analdata_test),] #create matrix of test data knn id's
numClusts <- nlevels(as.factor(sb_train$cluster))
NameClusts <- as.character(levels(as.factor(sb_train$cluster)))
count_clusters <- matrix(0, nrow = nrow(analdata_test), ncol = numClusts + 1) #create a count matrix that would count number of clusters + NA
colnames(count_clusters) <- c("NA", NameClusts) #name each column of the count matrix to cluster numbers
# get the cluster number of each k nearest neibhour of each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
test_matrix[i,j] <- all_data[nn_test$id[i,j], "Cluster"]
}
# populate the count matrix for the total clusters of the neighbours for each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
if (!is.na(test_matrix[i,j]))
count_clusters[i, c(as.character(test_matrix[i,j]))] <- count_clusters[i, c(as.character(test_matrix[i,j]))] + 1
else
count_clusters[i, c("NA")] <- count_clusters[i, c("NA")] + 1
}
# add NA's (TEST points) to the other clusters for comparison
count_clusters_withNA <- count_clusters
for (i in 2:ncol(count_clusters))
{
count_clusters_withNA[,i] <- t(rowSums(count_clusters[,c(1,i)]))
}
# This block of code decides the maximum count of cluster for each row considering the number other test points (NA clusters) in the neighbourhood
max_col_countclusters <- apply(count_clusters,1,which.max) #get the column that corresponds to the maximum value of each row
for (i in 1:length(max_col_countclusters)) #insert the maximum value of each row in its associated column in count_clusters_withNA
count_clusters_withNA[i, max_col_countclusters[i]] <- count_clusters[i, max_col_countclusters[i]]
max_col_countclusters_withNA <- apply(count_clusters_withNA,1,which.max) #get the column that corresponds to the maximum value of each row with NA added
compareCountClust <- max_col_countclusters_withNA == max_col_countclusters #compare the two count matrices
all_data$Cluster[1:nrow(analdata_test)] <- ifelse(compareCountClust, NameClusts[max_col_countclusters - 1], all_data$Cluster) #you subtract one because of additional NA column
iteration_current <- iteration_current + 1
unclusteredNum <- sum(is.na(all_data$Cluster))
if (previousUnclustereNum == unclusteredNum)
repeatedUnclustNum <- repeatedUnclustNum + 1
else {
repeatedUnclustNum <- 0
previousUnclustereNum <- unclusteredNum
}
print(paste("Iteration: ", iteration_current, " - Number of remaining unclustered:", sum(is.na(all_data$Cluster))))
if (unclusteredNum == 0)
print("Cluster labeling successfully Completed.")
invisible(gc())
}
I guess you can use this for any other type of clustering algorithm, it doesn't matter how you decided the cluster labels for the train data, as long as they are in your all_data before running the code.
Hope this help.
Not the most efficient or rigorous code. So, happy to see suggestions how to improve it.
*Note: I used t-SNE to compare the clustering of train with the test data and looks impressively clean. so, it seems it is working.

siber.ellipses -> Error in rmultireg(Y, X, Bbar, A, nu, V) : not a matrix

I am very new to R and am currently trying to create siber ellipses.
I watched the potcast Using ellipses to compare community members:(http://www.tcd.ie/Zoology/research/research/theoretical/Rpodcasts.php#siber) and got along just fine in the beginning. Whenever i get to the function of the siber.ellipses i get an Error:
(Error in rmultireg(Y, X, Bbar, A, nu, V) : not a matrix)
I can not figure out why. I get it with my own data as well as with the example data in the zip file provided along with the script.
I have researched the Error message online but could not come up with an answer.
It almost can be an error in the script or the data since I used those exactly as provided. My R version is 3.3.2
Does it have something to do with some kind of settings?
What could be the reason?
Can someone please helpe me :)
thanks
# this demo generates some random data for M consumers based on N samples and
# constructs a standard ellipse for each based on SEAc and SEA_B
rm(list = ls())
library(siar)
-------------------------------------------------------------------------
# ANDREW - REMOVE THESE LINES WHICH SHOULD BE REDUNDANT
# change this line
setwd("C:/Users/elisabeth/Desktop/R/demo")
# -----------------------------------------------------------------------------
# now close all currently open windows
graphics.off()
# read in some data
# NB the column names have to be exactly, "group", "x", "y"
mydata <- read.table("example_ellipse_data.txt",sep="\t",header=T)
# make the column names availble for direct calling
attach(mydata)
# now loop through the data and calculate the ellipses
ngroups <- length(unique(group))
# split the isotope data based on group
spx <- split(x,group)
spy <- split(y,group)
# create some empty vectors for recording our metrics
SEA <- numeric(ngroups)
SEAc <- numeric(ngroups)
TA <- numeric(ngroups)
dev.new()
plot(x,y,col=group,type="p")
legend("topright",legend=as.character(paste("Group ",unique(group))),
pch=19,col=1:length(unique(group)))
for (j in unique(group)){
# Fit a standard ellipse to the data
SE <- standard.ellipse(spx[[j]],spy[[j]],steps=1)
# Extract the estimated SEA and SEAc from this object
SEA[j] <- SE$SEA
SEAc[j] <- SE$SEAc
# plot the standard ellipse with d.f. = 2 (i.e. SEAc)
# These are plotted here as thick solid lines
lines(SE$xSEAc,SE$ySEAc,col=j,lty=1,lwd=3)
# Also, for comparison we can fit and plot the convex hull
# the convex hull is plotted as dotted thin lines
#
# Calculate the convex hull for the jth group's isotope values
# held in the objects created using split() called spx and spy
CH <- convexhull(spx[[j]],spy[[j]])
# Extract the area of the convex hull from this object
TA[j] <- CH$TA
# Plot the convex hull
lines(CH$xcoords,CH$ycoords,lwd=1,lty=3)
}
# print the area metrics to screen for comparison
# NB if you are working with real data rather than simulated then you wont be
# able to calculate the population SEA (pop.SEA)
# If you do this enough times or for enough groups you will easily see the
# bias in SEA as an estimate of pop.SEA as compared to SEAc which is unbiased.
# Both measures are equally variable.
print(cbind(SEA,SEAc,TA))
# So far we have fitted the standard ellipses based on frequentist methods
# and calculated the relevant metrics (SEA and SEAc). Now we turn our attention
# to producing a Bayesian estimate of the standard ellipse and its area SEA_B
reps <- 10^4 # the number of posterior draws to make
# Generate the Bayesian estimates for the SEA for each group using the
# utility function siber.ellipses
SEA.B <- siber.ellipses(x,y,group,R=reps)
Error in rmultireg(Y, X, Bbar, A, nu, V) : not a matrix
Reinstall the SIAR package using Jackson's Github site:
library(devtools)
install_github("andrewljackson/siar#v4.2.2", build_vingettes == TRUE)
library(siar)

Why does gstat.predict() function often return NaN values (GSTAT Package)? (R version 3.3.2, Windows 10)

I am trying to simulate a combination of two different random fields (yy1 and yy2 with different mean and correlation length) with an irregular boundary using Gstat package in R. I have attached the picture of my expected outcome. The code is not giving such output consistently and I am frequently getting atleast one of the yy1 and yy2 as NaNs, which results in the Undesired output as shown in image.
The key steps I used are:
1) Created two gstat objects with different means and psill (rf1 and rf2)
2) Created two computational grids (one for each random field) in the form of data frame with two variables “x” and “y” coordinates.
3) Predicted two random fields using unconditional simulation.
Any help in this regard would be highly appreciated.
Attachments: 2 images (link provided) and 1 R code
1) Expected Outcome
2) Undesired Outcome
library(gstat)
xy <- expand.grid(1:150, 1:200) # 150 x 200 grid is created in the form of a dataframe with x and y vectors
names(xy)<-c('x','y') # giving names to the variables
# creating gsat objects
rf1<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(1,0,0), model=vgm(psill=0.025, range=5, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
rf2<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(4,0,0), model=vgm(psill=0.025, range=10, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
# creating two computational grid
rows<-nrow(xy)
xy_shift <- expand.grid(60:90, 75:100)
names(xy_shift)<-c('x','y')
library(dplyr) # for antijoin
xy1<-xy[1:(rows/2),]
xy1<-anti_join(xy1, xy_shift, by = c("x","y")) # creating the irregular boundary
xy2<-rbind(xy[(rows/2+1):rows,],xy_shift)
library(sp)
yy1<- predict(rf1, newdata=xy1, nsim=1) # random field 1
yy2<- predict(rf2, newdata=xy2, nsim=1) # random field 2
rf1_label<-gl(1,length(yy1[,1]),labels="field1")
rf2_label<-gl(1,length(yy2[,1]),labels="field2")
yy1<-cbind(yy1,field=rf1_label)
yy2<-cbind(yy2,field=rf2_label)
yy<-rbind(yy1,yy2)
yyplot<-yy[,c(1,2,3)]
# plotting the field
gridded(yyplot) = ~x+y
spplot(obj=yyplot[1],scales=list(draw = TRUE))

Resources