Simulate a list of kppm objects in R spatstat - r

I would like to use envelope simulation on a list of kppm objects.
I have a point dataset that is in a grid pattern. I would like to shift each point such that its position in the cell is random, then run a spatstat simulation to determine if the points are randomly distributed (obviously will be at sub cell size scale). Then randonly shift shift each point inside the cell, and re-run the simulation - repeating n times and generate an envelope of all the simulations. There are 2 ways to do this 1) using a list of point sets and 2) using simulate.
Method 1) - on a list called my_kppm_list
Lmat = envelope(fitM, Lest, nsim = 19, global=FALSE, simulate=my_kppm_list)
plot(Lmat)
How to create a list of cluster process models (kppm)?
Naive way:
my_list <- list(kppm_0, kppm_1)
This fails when trying to run simulation:
Lmat = envelope(fitM, Lest, nsim = 19, global=FALSE, simulate=my_list)
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim = nsim, :
‘simulate’ should be an expression, or a list of point patterns of the same kind as X
I can convert a list to .ppm
fitM_0 <- kppm(create_pts(), ~1, "MatClust")
fitM_1 <- kppm(create_pts(), ~1, "MatClust")
my_list <- list(fitM_0, fitM_1)
ppm_list <- lapply(my_list, as.ppm)
But trying to convert to kppm fails with an error
kppm_list <- lapply(my_list, as.kppm)
Method 2) apply a function in simulation such that a random shift is applied to each point then simulation run, and envelope of all simulations is used (see page 399 of Baddelley et al. Spatial Point Patterns book (2016)):
e_rand <- function(){
j_x <- matrix(unlist(runif(dim(c_df)[1], min=-10, max=10)), ncol = 1, byrow = T)
j_y <- matrix(unlist(runif(dim(c_df)[1], min=-10, max=10)), ncol = 1, byrow = T)
x_j<- c_df[,1]+j_x
y_j<- c_df[,2]+j_y
c_j <- ppp(x = x_j, y = y_j, window = window)
return(c_j)
}
Lmat = envelope(fitM, Lest, nsim = 19, global=TRUE, simulate=e_rand)
However I found that the null model (red dashed line in output plot) had kinks in it when simulate is added - kinks that do not appear without simulate.

How to create a list of cluster process models (kppm)?
This is not the problem. In your example you are successfully creating a list of objects of class kppm. If fit1 and fit2 are fitted models of class kppm, then
m <- list(fit1, fit2)
is a list of objects of class kppm.
The problem is that you then pass this list as an argument to the function envelope which does not accept this format. The error message from envelope.ppp or envelope.kppm says that the argument simulate must be either a fitted model, or a list of point patterns.
An envelope is constructed by generating simulated point patterns, computing a summary function for each simulated pattern, and computing the upper and lower extremes of these summary functions. The argument simulate (if it is given) is a recipe for generating the simulated point patterns. If simulate is an expression like expression(runifpoint(42)) then this expression will be evaluated nsim times to produce nsim different point patterns. If simulate is a fitted model, then nsim simulated realisations of the model will be generated. If simulate is a list of point patterns, then they will be taken as the simulated random patterns.
It is unclear what you want to do with your list of models.
Do you want to construct a single envelope, or a list of envelopes?
Option 1: you have a list m of models of class kppm. For each of these models m[[i]], you want to construct an envelope e[[i]], where the limits are determinedby simulation from m[[i]]`.
Option 2: you have a list m of models of class kppm. For each model m[[i]] you want to generate one point pattern, say X[[i]], and build an envelope e using these patterns.
For option 1, type something like
e <- anylapply(m, function(fit) {
envelope(Y, Lest, nsim = 19, global=FALSE, simulate=fit)})
For option 2,
X <- solapply(m, simulate, nsim=1, drop=TRUE)
e <- envelope(Y, Lest, nsim=19, global=FALSE, simulate=X)
If you wanted something else, please clarify.

Related

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Which are the purposes of using preProcess from "caret" package in R code?

"Hi everyone. When I see them using the K Nearest Network to classify the groups. I don't know why they just use the preProcess to standardize the data. Here are the code"
preProc <- preProcess(UB2[3:12])
UBn <- predict(preProc, UB2)
set.seed(12)
UBKm <- kmeans(UBn[3:12], centers = 5, iter.max = 1000)
You use preProcess to scale and center your variables, basically to have them in the same range.
In situations where the columns have different ranges, if you apply kmeans directly, it will mainly form clusters that minimize the variance on columns that have higher values.
For example we simulate three clusters that can be separated on variables of different scales:
library(caret)
library(MASS)
library(rgl)
set.seed(111)
Sigma <- matrix(c(10,1,1,1,1,1,1,1),3,3)
X = rbind(mvrnorm(n=200,c(50,1,1), Sigma),
mvrnorm(n=200,c(20,5,1), Sigma),
mvrnorm(n=200,c(20,2.5,2.5), Sigma))
X = data.frame(X,cluster=factor(rep(1:3,each=200)))
plot3d(X[,1:3],col=factor(rep(1:3,each=200)))
Not that X1 is in the range of 0-60 while X2,X3 are around -1 to 10..
If we do kmeans without scaling:
clus = kmeans(X[,1:3],3)
COLS = heat.colors(3)
plot3d(X[,1:3],col=COLS[clus$cluster])
It primarily tries to split using X1, ignoring X2,X3 resulting on a split in the original cluster 1.
So we scale and cluster:
clus = kmeans(predict(preProcess(X[,1:3]),X[,1:3]),3)
COLS = heat.colors(3)
plot3d(X[,1:3],col=COLS[clus$cluster])

How to get gap statistic for hierarchical average clustering

I perform a hierarchical cluster analysis based on 'average linkage' In base r, I use
dist_mat <- dist(cdata, method = "euclidean")
hclust_avg <- hclust(dist_mat, method = "average")
I want to calculate the gap statistics to decide optimal number of clusters. I use the 'cluster' library and the clusGap function. Since I can't pass the hclust solution nor specify average hiearchical clustering in the clusGap function, I use these lines:
cluster_fun <- function(x, k) list(cluster = cutree(hclust(dist(x, method = "euclidean"), method="average"), k = k))
gap_stat <- clusGap(cdata, FUN=cluster_fun, K.max=10, B=50)
print(gap_stat)
However, here I can't check the cluster solution. So, my question is - can I be sure that the gap statistic is calculated on the same solution as hclust_avg?
Is there a better way of doing this?
Yes it should be the same. In the clusGap function, it calls the cluster_fun for each k you provided, then calculates the pooled within cluster sum of squares around, as described in the paper
This is the bit of code called inside clusGap that calls your custom function:
W.k <- function(X, kk) {
clus <- if (kk > 1)
FUNcluster(X, kk, ...)$cluster
else rep.int(1L, nrow(X))
0.5 * sum(vapply(split(ii, clus), function(I) {
xs <- X[I, , drop = FALSE]
sum(dist(xs)^d.power/nrow(xs))
}, 0))
}
And from here, the gap statistics is calculated.
You can calculate the gap statistic using some custom code, but for the sake of reproducibility, etc, it might be easier to use this?
Thanhs for solving it. I must say this is good enough solution but you can try below given code as well.
# Gap Statistic for K means
def optimalK(data, nrefs=3, maxClusters=15):
"""
Calculates KMeans optimal K using Gap Statistic
Params:
data: ndarry of shape (n_samples, n_features)
nrefs: number of sample reference datasets to create
maxClusters: Maximum number of clusters to test for
Returns: (gaps, optimalK)
"""
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
score_g, df = optimalK(cluster_df, nrefs=5, maxClusters=30)
plt.plot(df['clusterCount'], df['gap'], linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('Gap Statistic');
plt.title('Gap Statistic vs. K');

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.
I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.
I found this code from one of CV's discussion sessions. I really don't understand several lines of code below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.
Questions:
Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.
How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.
The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.
Using perplexity for simple validation
Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.
For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):
# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
data("AssociatedPress", package = "topicmodels")
burnin = 1000
iter = 1000
keep = 50
full_data <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5
splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300
The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.
Using perplexity and cross-validation to determine a good number of topics
The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.
You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.
The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:
#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)
# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
library(topicmodels)
})
folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics
# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))
# we parallelize by the different number of topics. A processor is allocated a value
# of k, and does the cross-validation serially. This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
k <- candidate_k[j]
results_1k <- matrix(0, nrow = folds, ncol = 2)
colnames(results_1k) <- c("k", "perplexity")
for(i in 1:folds){
train_set <- full_data[splitfolds != i , ]
valid_set <- full_data[splitfolds == i, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
}
return(results_1k)
}
})
stopCluster(cluster)
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
"(ie five different models fit for each candidate number of topics)") +
labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")
We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.
I wrote the answer on CV that you refer to, here's a bit more detail:
seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.
AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.
Regarding the general question of optimal topic numbers, I now follow the example of Martin
Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:
library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50
fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
logLiks <- fitted#logLiks[-c(1:(burnin/keep))]
# assuming that burnin is a multiple of keep and
harmonicMean(logLiks)
So to do this over a sequence of topic models with different numbers of topics...
# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L#logLiks[-c(1:(burnin/keep))])
# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l")
# compute optimum number of topics
sequ[which.max(hm_many)]
## 6
Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.
Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

Randomizations and hierarchical tree

I am trying to permute (column-wise only) my data matrix a 1000 times and then do hierarchical clustering in "R" so I have the final tree on my data after 1000 randomizations.
This is where I am lost. I have this loop
for(i in 1:1000)
{
permuted <- test2_matrix[,sample(ncol(test2_matrix), 12, replace=TRUE)]; (this permutes my columns)
d = dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2);
clust = hclust(d, method = "complete", members=NULL);
}
png (filename="cluster_dendrogram_bootstrap.png", width=1024, height=1024, pointsize=10)
plot(clust)
I am not sure if the final tree is a product after the 1000 randomizations or just the last tree that it calculated in the loop. Also If I want to display the bootstrap values on the tree how should I go about it?
Many thanks!!
The value of clust in your example is indeed the final tree calculated in the loop. Here's a way of making and saving 1000 permutations of your matrix
make.permuted.clust <- function(i){ # this argument is not used
permuted <- data.matrix[,sample(ncol(data.matrix), 12, replace=TRUE)]
d <- dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
clust <- hclust(d, method = "complete", members=NULL)
clust # return value
}
all.clust <- lapply(1:1000, make.permuted.clust) # 1000 hclust trees
The second part of your question should be answered here.
You may be interested in the RandomForest method implemented in the randomForest package, which implements both bootstrapping of the data and of the splitting variables and allows you to save trees and get a consensus tree.
library(randomForest)
The original random forest (in FORTRAN 77) developers site
The package manual

Resources