Extracting gap statistic info to identify K for Kmeans clustering - r

I was looking at the 'cluster' library which has the function 'clusGap' to extract the number of clusters for Kmeans clustering.
This is the code:
# Compute Gap statistic (http://web.stanford.edu/~hastie/Papers/gap.pdf)
computeGapStatistic() <- function(data) {
gap <<- clusGap(shift_len_avg_data, FUN = kmeans, K.max = 8, B = 3)
if (ENABLE_PLOTS) {
plot(gap, main = "Gap statistic for the Nursing shift data")
}
print(gap)
return(gap)
}
Which gives me the following output when 'gap' is printed out:
> print(gap)
Clustering Gap statistic ["clusGap"].
B=3 simulated reference sets, k = 1..8
--> Number of clusters (method 'firstSEmax', SE.factor=1): 2
logW E.logW gap SE.sim
[1,] 8.702334 9.238385 0.53605067 0.007945542
[2,] 7.940133 8.544323 0.60418996 0.003790244
[3,] 7.772673 8.139836 0.36716303 0.005755805
[4,] 7.325798 7.849233 0.52343473 0.002732731
[5,] 7.233667 7.629954 0.39628748 0.003496058
[6,] 7.020220 7.439709 0.41948820 0.006451708
[7,] 6.707678 7.285907 0.57822872 0.002810682
[8,] 7.166932 7.150724 -0.01620749 0.004274151
and this is how the plot look like:
Question:
How do i extract the number of clusters from the 'gap' variable? 'gap' seems to be a list. From the above description it seems to have found 2 clusters.

I figured this out on my own. This is what i used: with(gap,maxSE(Tab[,"gap"],Tab[,"SE.sim"]))

Related

Using terra and sf in R: Why am I getting illogical distance measurements?

I am using terra to get "curvy" distances between points within a bounding polygon and comparing those to straight-line distances that ignore the polygon. The results I'm getting back don't make sense, and I am hoping you all could help me figure out what is going on.
We load the US Congressional map used in the 114th Congress for the state of Texas first:
texas = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/texascongressmaps")
ggplot() + geom_sf(data = texas$geometry)
We also make some storage objects:
longest.dist.district.straight = rep(NA, 36)
longest.dist.district.curved = rep(NA, 36)
Then, we go district by district (n = 36). For each, we take a sample of 100 random points within that district's polygon. Then, we ask "What is the longest straight-line distance between any two of our 100 points?" We then rasterize the polygon, mask it, and go point by point, asking "How far is this point from all others, assuming we cannot travel outside the polygon?" This means we'll have to bend around within the polygon to get between the points some of the time. We find the longest such distance between any two points. We then compare the straight-line and curvy-line approaches, with the assumption that the curvy-line approaches will always be longer by some amount...
for(c in 1:36) { #Texas had 36 districts.
if(c %% 3 == 0) {print(c)} # Progress bar
this.district = texas[c, ] #Get the current district
#We'll get a sample of 100 randomly placed points around the district.
rand.ptsDistrict = sf::st_sample(this.district,
size = 100,
type = 'random',
exact = TRUE)
#What's the max straight-line distance between any two points?
longest.dist.district.straight[c] = max(sf::st_distance(rand.ptsDistrict))
#Now, calculate our 'as the politician would walk' distances (aka curvy distances). We need to do this for each of our 100 points separately, with each as the target point in turn, and save the longest value we get...
current.raster = terra::ext(this.district) # Rasterizing
current.raster = terra::rast(current.raster,
nrow=100, ncol=100,
crs = crs(this.district),
vals = 1)
current.raster = terra::mask(current.raster, # Masking
terra::vect(this.district),
updatevalue = NA)
point.locs = terra::cellFromXY(current.raster, # Getting point locations in the new grid
sf::st_coordinates(rand.ptsDistrict))
longest.dists.i = rep(NA, 100) # Storage object
for(i in 1:100) {
point.i.loc = cellFromXY(current.raster, #Focal point this time.
st_coordinates(rand.ptsDistrict[i]))
point.noni.loc = cellFromXY(current.raster, #All other points
st_coordinates(rand.ptsDistrict[-i]))
terra::values(current.raster)[point.i.loc] = 2 # Make focal point the target value
all.dists = terra::gridDistance(current.raster, #Get all distances to the target value
target = 2, scale = 1)
longest.dists.i[i] = max(values(all.dists)[point.noni.loc], na.rm=TRUE) # Find the longest of these for this point and store it.
terra::values(current.raster)[point.i.loc] = 1
}
longest.dist.district.curved[c] = max(longest.dists.i) # Find the longest curved distance between any two points in the current district.
}
When I do this, I always get straight-line distances that are strictly longer than the curvy distances from the same district, which doesn't logically make sense--how could a straight line between two points ever be longer than a curvy line between them?
> (cbind(longest.dist.district.straight, longest.dist.district.curved))
longest.dist.district.straight longest.dist.district.curved
[1,] 239285.77 121703.64
[2,] 63249.88 48238.89
[3,] 49495.09 24823.91
[4,] 290542.38 147894.80
[5,] 213758.13 108663.63
[6,] 129261.83 68351.77
[7,] 36705.18 22081.22
[8,] 165759.58 87749.33
[9,] 38317.61 19903.54
[10,] 196211.38 100959.66
[11,] 505130.81 261479.58
[12,] 79502.87 45134.11
[13,] 604901.43 313317.24
[14,] 201724.57 115286.81
[15,] 414257.14 208204.75
[16,] 61867.34 32115.77
[17,] 193198.96 103829.75
[18,] 41693.26 26462.02
[19,] 433902.07 225041.00
[20,] 32201.45 17060.41
[21,] 212300.45 119597.54
[22,] 88143.49 46720.59
[23,] 777236.95 394663.54
[24,] 39692.06 21192.98
[25,] 299336.81 153871.46
[26,] 65901.64 35200.83
[27,] 272822.43 158724.70
[28,] 362477.84 205297.74
[29,] 40210.19 30094.43
[30,] 44693.37 23430.33
[31,] 93781.16 50340.85
[32,] 38941.81 21047.40
[33,] 52395.85 31169.46
[34,] 394586.71 206545.50
[35,] 138182.61 73556.10
[36,] 223351.15 112601.38
I can only guess I have either messed up the code somewhere or else have found a bug. Please help! Thanks!
Edit: I just noticed after posting this that it looks like if I were to multiply the curvy distances by 2, I'd get values that were believable (the curvy distances are always longer but by a variable amount)--but I don't see a coding reason to need to do this...can anyone else see one I'm missing?
You are comparing the shortest-distance ("as the crow flies" to those who have not seen crows fly) with the grid-distance (move from the center of a cell to the center of a neighboring cell), only allowing to use the grid cells that fall within a district.
When I run a condensed version of your code, I see that the distances are very similar, with the grid distance always longer, as they should be, except for district 14 as that district is not contiguous.
library(terra)
#terra 1.6.47
texas <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/texascongressmaps")
tex <- vect(texas)
# generate random points
set.seed(0)
b <- spatSample(tex[, "DISTRICT"], size = 100, method="random", strata=1:nrow(tex))
# max distance between any two random points by district.
pdist <- sapply(tex$DISTRICT, \(i) max( distance( b[b$DISTRICT == i, ])) )
# max grid distance between any two random points by district.
pgrid <- rep(NA, nrow(tex))
for (i in 1:nrow(tex)) {
r <- rast(tex[i,], nrow=100, ncol=100)
r <- rasterize(tex[i,], r)
xy <- crds(b[b$DISTRICT==i, ])
cells <- cellFromXY(r, xy)
maxdists <- rep(NA, 100)
for(j in 1:100) {
r[cells[j]] <- 2
dists <- gridDist(r, target=2)
# Find the longest of these for this point
maxdists[j] <- max( dists[ cells[-j] ], na.rm=TRUE)
r[cells[j]] <- 1
}
pgrid[i] <- max(maxdists)
}
The results look good:
head(cbind(pdist, pgrid))
# pdist pgrid
#1 217746.46 223906.22
#2 61707.87 99422.07
#3 50520.61 51479.98
#4 282744.13 293656.59
#5 196074.08 202014.45
#6 120913.60 126532.72
plot(pdist, pgrid)
abline(0, 1, col="red")
If your results are different you are perhaps using an older version of "terra"? I assume you are because you are using gridDistance which works with a warning because it was renamed to gridDist in the current version.
You use different grid cell sizes for each district. I do not know what your goal is, but it might be more reasonable to use a single template raster for all of Texas. You could do something like
# outside the loop
rr <- rast(tex, res=1/60, vals=1)
# inside the loop
r <- crop(rr, tex[i,], mask=TRUE)

How to get the optimal number of clusters from the clusGap function as an output?

I have a data frame with 2 variables and I want to use the clusGap function to find the number of clusters that would be optimal to use. This code has a similar result:
library(cluster)
x <- as.vector(runif(100, 0, 1))
y <- as.vector(runif(100, 0, 1))
df <- data.frame(x, y)
gap_stat <- clusGap(df, FUN = kmeans, nstart = n,
K.max = 10, B = 50)
gap_stat
Result:
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = df, FUNcluster = kmeans, K.max = 10, B = 50, nstart = n)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
--> Number of clusters (method 'firstSEmax', SE.factor=1): 1
logW E.logW gap SE.sim
[1,] 2.569315 2.584217 0.0149021144 0.03210076
[2,] 2.285049 2.284537 -0.0005116382 0.03231529
[3,] 2.053193 2.033653 -0.0195399122 0.03282376
[4,] 1.839085 1.835590 -0.0034952935 0.03443303
[5,] 1.691219 1.708479 0.0172603348 0.03419994
[6,] 1.585084 1.597277 0.0121935992 0.03440672
[7,] 1.504763 1.496853 -0.0079104306 0.03422321
[8,] 1.416176 1.405903 -0.0102731340 0.03371149
[9,] 1.333721 1.323658 -0.0100626869 0.03245958
[10,] 1.253199 1.250366 -0.0028330498 0.03034140
As you can see in line 4, the optimal number of clusters is 1. I would like the function to have 1 as an output. I need the optimal number of outputs to be an object in the environment, such as n is 1.
Typically such information is somewhere directly inside the object, like gap_stat$nc. To look for it str(gap_stat) would typically suffice.
In this case, however, the above strategy isn't enough. But the fact that you can see your number of interest in the output, means that print.clusGap (because the class of gap_stat is clusGap) will show how to obtain this number. So, inspecting cluster:::print.clusGap leads to
maxSE(f = gap_stat$Tab[, "gap"], SE.f = gap_stat$Tab[, "SE.sim"])
# [1] 1
This may have been less transparent in the past, but you can actually specify the method directly:
nc <- maxSE(f = gap_stat$Tab[,"gap"],
SE.f = gap_stat$Tab[,"SE.sim"],
method = "firstSEmax",
SE.factor = 1)

How to calculate "terms" from predict-function manually when regression has an interaction term

does anyone know how predict-function calculates terms when there are an interaction term in a regression model? I know how to solve terms when regression has no interaction terms in it but when I add one I cant solve those manually anymore. Here is some example data and I would like to see how to calculate those values manually. Thanks! -Aleksi
set.seed(2)
a <- c(4,3,2,5,3) # first I make some data
b <- c(2,1,4,3,5)
e <- rnorm(5)
y= 0.6*a+e
data <- data.frame(a,b,y)
model1 <- lm(y~a*b,data=data) # regression
predict(model1,type='terms',data) # terms
#This gives the result:
a b a:b
1 0.04870807 -0.3649011 0.2049069
2 -0.03247205 -0.7298021 0.7740928
3 -0.11365216 0.3649011 0.2049069
4 0.12988818 0.0000000 -0.5919534
5 -0.03247205 0.7298021 -0.5919534
attr(,"constant")
[1] 1.973031
Your model is technically y ~ b0 + b1*a + b2*a*b + e. Calculating a is done by multiplying independent variable by its coefficient and centering the result. So for example, terms for a would be
cf <- coef(model1)
scale(a * cf[2], scale = FALSE)
[,1]
[1,] 0.04870807
[2,] -0.03247205
[3,] -0.11365216
[4,] 0.12988818
[5,] -0.03247205
which matches your output above.
And since interaction term is nothing else than multiplying independent variables, this translates to
scale(a * b * cf[4], scale = FALSE)
[,1]
[1,] 0.2049069
[2,] 0.7740928
[3,] 0.2049069
[4,] -0.5919534
[5,] -0.5919534

Extract knots, basis, coefficients and predictions for P-splines in adaptive smooth

I'm using the mgcv package to fit some polynomial splines to some data via:
x.gam <- gam(cts ~ s(time, bs = "ad"), data = x.dd,
family = poisson(link = "log"))
I'm trying to extract the functional form of the fit. x.gam is a gamObject, and I've been reading the documentation but haven't found enough information in order to manually reconstruct the fitted function.
x.gam$smooth contains information about whether the knots have been placed;
x.gam$coefficients gives the spline coefficients, but I don't know what order polynomial splines are used and looking in the code has not revealed anything.
Is there a neat way to extract the knots, coefficients and basis used so that one can manually reconstruct the fit?
I don't have your data, so I take the following example from ?adaptive.smooth to show you where you can find information you want. Note that though this example is for Gaussian data rather than Poisson data, only the link function is different; all the rest are just standard.
x <- 1:1000/1000 # data between [0, 1]
mu <- exp(-400*(x-.6)^2)+5*exp(-500*(x-.75)^2)/3+2*exp(-500*(x-.9)^2)
y <- mu+0.5*rnorm(1000)
b <- gam(y~s(x,bs="ad",k=40,m=5))
Now, all information on smooth construction is stored in b$smooth, we take it out:
smooth <- b$smooth[[1]] ## extract smooth object for first smooth term
knots:
smooth$knots gives you location of knots.
> smooth$knots
[1] -0.081161 -0.054107 -0.027053 0.000001 0.027055 0.054109 0.081163
[8] 0.108217 0.135271 0.162325 0.189379 0.216433 0.243487 0.270541
[15] 0.297595 0.324649 0.351703 0.378757 0.405811 0.432865 0.459919
[22] 0.486973 0.514027 0.541081 0.568135 0.595189 0.622243 0.649297
[29] 0.676351 0.703405 0.730459 0.757513 0.784567 0.811621 0.838675
[36] 0.865729 0.892783 0.919837 0.946891 0.973945 1.000999 1.028053
[43] 1.055107 1.082161
Note, three external knots are placed beyond each side of [0, 1] to construct spline basis.
basis class
attr(smooth, "class") tells you the type of spline. As you can read from ?adaptive.smooth, for bs = ad, mgcv use P-splines, hence you get "pspline.smooth".
mgcv use 2nd order pspline, you can verify this by checking the difference matrix smooth$D. Below is a snapshot:
> smooth$D[1:6,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 -2 1 0 0 0
[2,] 0 1 -2 1 0 0
[3,] 0 0 1 -2 1 0
[4,] 0 0 0 1 -2 1
[5,] 0 0 0 0 1 -2
[6,] 0 0 0 0 0 1
coefficients
You have already known that b$coefficients contain model coefficients:
beta <- b$coefficients
Note this is a named vector:
> beta
(Intercept) s(x).1 s(x).2 s(x).3 s(x).4 s(x).5
0.37792619 -0.33500685 -0.30943814 -0.30908847 -0.31141148 -0.31373448
s(x).6 s(x).7 s(x).8 s(x).9 s(x).10 s(x).11
-0.31605749 -0.31838050 -0.32070350 -0.32302651 -0.32534952 -0.32767252
s(x).12 s(x).13 s(x).14 s(x).15 s(x).16 s(x).17
-0.32999553 -0.33231853 -0.33464154 -0.33696455 -0.33928755 -0.34161055
s(x).18 s(x).19 s(x).20 s(x).21 s(x).22 s(x).23
-0.34393354 -0.34625650 -0.34857906 -0.05057041 0.48319491 0.77251118
s(x).24 s(x).25 s(x).26 s(x).27 s(x).28 s(x).29
0.49825345 0.09540020 -0.18950763 0.16117012 1.10141701 1.31089436
s(x).30 s(x).31 s(x).32 s(x).33 s(x).34 s(x).35
0.62742937 -0.23435309 -0.19127140 0.79615752 1.85600016 1.55794576
s(x).36 s(x).37 s(x).38 s(x).39
0.40890236 -0.20731309 -0.47246357 -0.44855437
basis matrix / model matrix / linear predictor matrix (lpmatrix)
You can get model matrix from:
mat <- predict.gam(b, type = "lpmatrix")
This is an n-by-p matrix, where n is the number of observations, and p is the number of coefficients. This matrix has column name:
> head(mat[,1:5])
(Intercept) s(x).1 s(x).2 s(x).3 s(x).4
1 1 0.6465774 0.1490613 -0.03843899 -0.03844738
2 1 0.6437580 0.1715691 -0.03612433 -0.03619157
3 1 0.6384074 0.1949416 -0.03391686 -0.03414389
4 1 0.6306815 0.2190356 -0.03175713 -0.03229541
5 1 0.6207361 0.2437083 -0.02958570 -0.03063719
6 1 0.6087272 0.2688168 -0.02734314 -0.02916029
The first column is all 1, giving intercept. While s(x).1 suggests the first basis function for s(x). If you want to view what individual basis function look like, you can plot a column of mat against your variable. For example:
plot(x, mat[, "s(x).20"], type = "l", main = "20th basis")
linear predictor
If you want to manually construct the fit, you can do:
pred.linear <- mat %*% beta
Note that this is exactly what you can get from b$linear.predictors or
predict.gam(b, type = "link")
response / fitted values
For non-Gaussian data, if you want to get response variable, you can apply inverse link function to linear predictor to map back to original scale.
Family information are stored in gamObject$family, and gamObject$family$linkinv is the inverse link function. The above example will certain gives you identity link, but for your fitted object x.gam, you can do:
x.gam$family$linkinv(x.gam$linear.predictors)
Note this is the same to x.gam$fitted, or
predict.gam(x.gam, type = "response").
Other links
I have just realized that there were quite a lot of similar questions before.
This answer by Gavin Simpson is great, for predict.gam( , type = 'lpmatrix').
This answer is about predict.gam(, type = 'terms').
But anyway, the best reference is always ?predict.gam, which includes extensive examples.

parallel k-means in R

I am trying to understand how to parallelize some of my code using R. So, in the following example I want to use k-means to cluster data using 2,3,4,5,6 centers, while using 20 iterations.
Here is the code:
library(parallel)
library(BLR)
data(wheat)
parallel.function <- function(i) {
kmeans( X[1:100,100], centers=?? , nstart=i )
}
out <- mclapply( c(5, 5, 5, 5), FUN=parallel.function )
How can we parallel simultaneously the iterations and the centers?
How to track the outputs, assuming I want to keep all the outputs from k-means across all, iterations and centers, just to learn how?
This looked very simple to me at first ... and then i tried it. After a lot of monkey typing and face palming during my lunch break however, I arrived at this:
library(parallel)
library(BLR)
data(wheat)
mc = mclapply(2:6, function(x,centers)kmeans(x, centers), x=X)
It looks right though I didn't check how sensible the clustering was.
> summary(mc)
Length Class Mode
[1,] 9 kmeans list
[2,] 9 kmeans list
[3,] 9 kmeans list
[4,] 9 kmeans list
[5,] 9 kmeans list
On reflection the command syntax seems sensible - although a lot of other stuff that failed seemed reasonable too...The examples in the help documentation are maybe not that great.
Hope it helps.
EDIT
As requested here is that on two variables nstart and centers
(pars = expand.grid(i=1:3, cent=2:4))
i cent
1 1 2
2 2 2
3 3 2
4 1 3
5 2 3
6 3 3
7 1 4
8 2 4
9 3 4
L=list()
# zikes horrible
pars2=apply(pars,1,append, L)
mc = mclapply(pars2, function(x,pars)kmeans(x, centers=pars$cent,nstart=pars$i ), x=X)
> summary(mc)
Length Class Mode
[1,] 9 kmeans list
[2,] 9 kmeans list
[3,] 9 kmeans list
[4,] 9 kmeans list
[5,] 9 kmeans list
[6,] 9 kmeans list
[7,] 9 kmeans list
[8,] 9 kmeans list
[9,] 9 means list
How'd you like them apples?
There's a CRAN package called knor that is derived from a research paper that improves the performance using a memory efficient variant of Elkan's pruning algorithm. It's an order of magnitude faster than everything in these answers.
install.packages("knor")
require(knor)
iris.mat <- as.matrix(iris[,1:4])
k <- length(unique(iris[, dim(iris)[2]])) # Number of unique classes
nthread <- 4
kms <- Kmeans(iris.mat, k, nthread=nthread)
You may use parallel to try K-Means from different random starting points on multiple cores.
The code below is an example. (K=K in K-means, N= number of random starting points, C = number of cores you would like to use)
suppressMessages( library("Matrix") )
suppressMessages( library("irlba") )
suppressMessages( library("stats") )
suppressMessages( library("cluster") )
suppressMessages( library("fpc") )
suppressMessages( library("parallel") )
#Calculate KMeans results
calcKMeans <- function(matrix, K, N, C){
#Parallel running from various of random starting points (Using C cores)
results <- mclapply(rep(N %/% C, C), FUN=function(nstart) kmeans(matrix, K, iter.max=15, nstart=nstart), mc.cores=C);
#Find the solution with smallest total within sum of square error
tmp <- sapply(results, function(r){r[['tot.withinss']]})
km <- results[[which.min(tmp)]]
#return cluster, centers, totss, withinss, tot.withinss, betweenss, size
return(km)
}
runKMeans <- function(fin_uf, K, N, C,
#fout_center, fout_label, fout_size,
fin_record=NULL, fout_prediction=NULL){
uf = read.table(fin_uf)
km = calcKMeans(uf, K, N, C)
rm(uf)
#write.table(km$cluster, file=fout_label, row.names=FALSE, col.names=FALSE)
#write.table(km$center, file=fout_center, row.names=FALSE, col.names=FALSE)
#write.table(km$size, file=fout_size, row.names=FALSE, col.names=FALSE)
str(km)
return(km$center)
}
Hope it helps!

Resources