K-means clustering of spatially constrained data - skater in spdep package - r

I want to cluster the codebook from a self-organizing map using k-means clustering. However, given the 'spatial' nature of the data, I want to constrain the clustering so that only contiguous nodes are clustered together.
After looking around, I decided to try and use the function skater in the spdep package.
Here's an example of what I've been doing.
# the 'codebook' data obtained from the self-organizing map.
# My grid is 15 by 15 nodes.
data <- data.frame(var1=rnorm(15*15, mean = 0, sd = 1), var2=rnorm(15*15, mean = 5, sd = 2))
# creating a matrix with all edges listed
# (so basically one row to show a connection between each pair of adjacent nodes)
require(spdep)
nbs <- cell2nb(nrow=15, ncol=15)
edges <- data.frame(node=rep(1:(tt.grid$xdim*tt.grid$ydim), each=4))
edges$nb <- NA
for (i in 1:(tt.grid$xdim*tt.grid$ydim)) {
vals <- nbs[[i]][1:4]
edges$nb[(i-1)*4+1] <- vals[1]
edges$nb[(i-1)*4+2] <- vals[2]
edges$nb[(i-1)*4+3] <- vals[3]
edges$nb[(i-1)*4+3] <-
vals[4] }
edges <- edges[which(!is.na(edges$nb)),]
edges$from <- apply(edges[c("node", "nb")], 1, min)
edges$to <- apply(edges[c("node", "nb")], 1, max)
edges <- edges[c("to", "from")]
edges <- edges[!duplicated(edges),]
edges <- as.matrix(edges)
I know the code above is really clumsy and not elegant (please bear with me). I tried using mstree(nb2listw(nbs))[,1:2] but it didn't list all the links. I'm not sure I quite understood what this was doing, so I created my matrix of edges manually.
Then I tried to use this matrix into the skater function
test <- skater(edges=edges, data=data, ncuts=5)
but I get the following error message:
Error in colMeans(data[id, , drop = FALSE]) :
error in evaluating the argument 'x' in selecting a method for function 'colMeans': Error in data[id, , drop = FALSE] : subscript out of bounds
However, if I use the mstree edges, I don't get an error message but the results don't make sense at all.
test <- skater(edges=mstree(nb2listw(nbs))[,1:2], data=data, ncuts=5)
Any help on this error message (or alternative suggestions as to how to do the spatially constrained clustering I would like to do) is much appreciated.

Related

igraph error long vectors not supported yet when trying to create adjacency matrix

I'm trying to perform a social network analysis in R, and I'm having some troubles with creating adjacency matrices from very large matrices using the igraph package. One of the main matrices is 10998555876 elements large (82 Gb) - created from a dataset with 176881 rows.
The error I get when running:
adjacency_matrix <- graph.adjacency(one_mode_matrix, mode = "undirected", weighted = TRUE, diag = TRUE)
is as follows:
Error in graph.adjacency.dense(adjmatrix, mode = mode, weighted = weighted, :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
The data is two-mode, so I've had to transpose it to get the one-mode matrix with the units I'm interested in. The code used before to create the matrix is:
graph <- graph.data.frame(data, directed = FALSE) # Making a graph object from the dataframe.
types <- bipartite.mapping(graph)$type
matrix <- as_incidence_matrix(graph, types = type) # Creating a two-mode matrix.
one_mode_matrix <- tcrossprod(matrix) # Transposing to get one-mode matrix.
diag(matrix) <- 0
mode(matrix) <- "numeric"
adjacency_matrix <- graph.adjacency(one_mode_matrix, mode = "undirected", weighted = TRUE, diag = FALSE) # This is where things break down.
Having done some research, e.g. in this thread https://github.com/igraph/rigraph/issues/255 , it looks like a problem in R base. It seems to me (without being an expert on these things) that igraph is trying to create an object in a format that R cannot handle because it is too big(?) Does anybody know how to handle this issue? Perhaps there are other packages for creating adjacency matrices that would do a better job on a large matrix?
Solution to anybody who might be interested:
I discovered that igraph can handle sparse matrices. Convert the matrix to a sparse matrix using the Matrix package like so:
sparse_matrix <- as(one_mode_matrix, "sparseMatrix")
Then make it into a graph object like this:
g <- graph_from_adjacency(sparse_matrix)
And ravel in all the functionality igraph has to offer.

Extraction of optimal scores from clValid - subscript out of bounds

I try to extract optimal scores from clValid package internal validation. For some datasets Model Based clustering algorithm is not able to define certain amounts of clusters. For the impossible number of clusters the package sets NAs for internal validation measures and in this case function optimalScores() doesn't work with the following error:
Error in which(x == min(x), arr.ind = TRUE)[1, ] :
subscript out of bounds
I know how to get optimal values from optimalScores() but it doesn't work. summary(intvalid) also provides the optimal scores however I could't find the way to extract them from the summary.
Example:
set.seed(199)
df1<-data.frame(replicate(4,sample(1:100,400,rep=TRUE)))
require(clValid)
intvalid <- clValid(df1, 2:10, clMethods=c("model"),
validation="internal", maxitems = 1000)
# doesn't work
optimalScores(intvalid)
# shows optimal scores
summary(intvalid)
How can I get the optimal scores despite NAs for some values?
Did u try:
intvalid <- clValid(df1, 2:10, clMethods="model",
validation="internal", maxitems = 1000)
I think if you use just one method you don't need the "c"

Duplicate data when using gstat or automap package in R

I am trying to using ordinary kriging to spatially predict data where an animal will occur based on predictor variables using the gstat or automap package in R. I have many (over 100) duplicate coordinate points, which I cannot throw out since those stations were sampled multiple times over many years. Every time that I run the code below for ordinary kriging, I get an LDL error, which is due to the duplicate points. Does anyone know how to fix this problem without throwing out data? I have tried the code from the automap package that is supposed to correct for duplicates but I can't get that to work. Thank you for the help!
coordinates(fish) <- ~ LONGITUDE+LATITUDE
x.range <- range(fish#coords[,1])
y.range <- range(fish#coords[,2])
grd <- expand.grid(x=seq(from=x.range[1], to=x.range[2], by=3), y=seq(from=y.range[1], to=y.range[2], by=3))
coordinates(grd) <- ~ x+y
plot(grd, pch=16, cex=.5)
gridded(grd) <- TRUE
library(gstat)
zerodist(fish) ###146 duplicate points
v <- variogram(log(WATER_TEMP) ~1, fish, na.rm=TRUE)
plot(v)
vgm()
f <- vgm(1, "Sph", 300, 0.5)
print(f)
v.fit <- fit.variogram(v,f)
plot(v, model=v.fit) ####In fit.variogram(v, d) : Warning: singular model in variogram fit
krg <- krige(log(WATER_TEMP) ~ 1, fish, grd, v.fit)
## [using ordinary kriging]
##"chfactor.c", line 131: singular matrix in function LDLfactor()Error in predict.gstat(g, newdata = newdata, block = block, nsim = nsim,: LDLfactor
##automap code for correcting for duplicates
fish.dup = rbind(fish, fish[1,]) # Create duplicate
coordinates(fish.dup) = ~LONGITUDE + LATITUDE
kr = autoKrige(WATER_TEMP, fish.dup, grd)
###Error in inherits(formula, "SpatialPointsDataFrame"):object 'WATER_TEMP' not found
###somehow my predictor variables are no longer available when in a Spatial Points Data Frame??
automap::autoKrige expects a formula as first argument, try
kr = autoKrige(WATER_TEMP~1, fish.dup, grd)
automaphas a very simple fix for duplicate observations, and that is to discard them. So, automapdoes not really solves the issue you have. I see some options:
Discard the duplicates.
Slightly perturb the coordinates of the duplicates so that they are not on exactly the same location anymore.
Perform space-time kriging using gstat.
In regard to your specific issue, please make your example reproducible. What I can guess is that rbind of your fish object is not doing what you expect...
Alternatively you can use the function jitterDupCoords of geoR package.
https://cran.r-project.org/web/packages/geoR/geoR.pdf

Traversing a Binary Tree to Get The Splitting Conditions - ctree(party), recursive function

I am trying to reproduce the error which I get with my dataset using a general dataset. Please correct me if I am missing something.
After fitting a Classification tree using library(party), I am trying to get the split conditions of the tree on each node. I managed to write a code, which i believed was working fine, until I found a bug. Could anyone help me to solve it?
My code:
require(party)
iris$Petal.Width <- as.factor(iris$Petal.Width)#imp to convert to factorial
(ct <- ctree(Species ~ ., data = iris))
plot(ct)
#print(ct)
a <- ct #convert it to s4 object
t <- a#tree
#recursive function to traverse the tree and get the splitting conditions
recurse_tree <- function(tree,ret_list=list(),sub_list=list()){
if(!tree$terminal){
sub_list$assign <-list(tree$psplit$splitpoint,tree$psplit$variableName,class(tree$psplit))
names(sub_list)[which(names(sub_list)=="assign")] <- paste("node",tree$nodeID,sep="")
ret_list <- recurse_tree(tree$left, ret_list, sub_list)
ret_list <- recurse_tree(tree$right, ret_list, sub_list)
}
if(tree$terminal){
ret_list$assign <- c(sub_list, tree$prediction)
names(ret_list)[which(names(ret_list)=="assign")] <- paste("node",tree$nodeID,sep="")
return(ret_list)
}
return(ret_list)
}
result <- recurse_tree(t) #call to the functions
Now, the result gives me the list of of all nodes and split conditions and predictions (I assumed). But, when I check the split conditions for
expected output on Node5: {1.1, 1.2, 1.6, 1.7 } # from printing the tree print(ct), I get this
output I get on Node5 from my function: {"1" , "1.3" ,"1.4" ,"1.5" } which is basically the split condition of Node6, which is wrong. How did I get this?
z <- result[2] #I know node5 is second in the list
z <- unlist(z,recursive = F,use.names = T) #unlist
levels(z[[3]][[1]]) [which((z[[3]][[1]])==0)] #to find levels of corresponding values
What I doubt, my function(recurse_tree) is always giving me the split conditions of the right terminal node and not left node. Any help will be appreciated.

spdep "Not yet able to subset general weights lists" listw

I have a problems with spdep(). Starting with a matrix of non-missing distances produced by a function
dist_m <- geoDistMatrix(data1, group = 'fips_dist')
dist_m[upper.tri(dist_m)] <- t(dist_m)[upper.tri(dist_m)]
we then turn into weights with linear inverse
max_dist <- max(dist_m)
w1 <- (max_dist + 1 - dist_m)/(max_dist + 1)
and now
lw <- mat2listw(w1, row.names = rownames(w1), style = 'M')
I check to make sure no missing weights:
any(is.na(lw$weights))
and since there aren't, go ahead with:
errorsarlm(cvote ~ inc, data = data1, lw, method = 'eigen', quiet = F, zero.policy = TRUE)
leads to the following error:
Error in subset.listw(listw, subset, zero.policy = zero.policy) :
Not yet able to subset general weights lists
This is because at least one observation in data1 is not complete, i.e. has missing values. Hence, errorsarlm wants to subset the data, i.e. restrict to complete cases. But it can't do it now - that's what the error message says.
Best is to subset the data manually or correct the incomplete cases.
This is because the spdep function created a listw object only for non-general weights by default. Set zero.polcy=TRUE beform you perform mat2listw or nb2listw function so that it consider non-neighbors that have zero value.

Resources