R, Spatial clustering by value - r

I have this simple dataset. The dataset is by hypothetical geographical unit (i.e. postal code) and has 3 variables: longitude, latitude and someValue (sales).
lon<-rep(1:10,each=10)
lat<-rep(1:10,10)
someValue<-rnorm(100, mean = 20, sd = 5)
dataset<-data.frame(lon,lat,someValue)
The problem I’m facing is territory alignment. Given a proposed number of territories I need to group postal codes into territories in such a way that the territories consist of adjacent postal codes and the sum of someValue is roughly the same (+/- 15% of the average for the specified number of territories)
The best idea I have at this point is to: 1. do clustering on lon/lat first to establish candidates; 2. do clustering on someValue using centroids from step 1 as centers with iter.max=1; 3 iterate over 1 and 2 until some convergence cut-off.
I would like to ask the community: what would be a proper methodology to implement something like this in R? I did search for Spatial Clustering and was not able to find anything relevant

you can do the clustering using kmeans by only considering the first two columns (x and y):
#How Many cluster do you want to have initially?
initialClasses <- 2
#clustering using kmeans
initClust <- kmeans(dataset[,1:2], initialClasses, iter.max = 100)
dataset$classes <- initClust$cluster
initClust$cluster then contains your cluster classes. You can add them to your dataframe and use dplyr to calculate some statistics. For example to sum of someValue per cluster:
library(dplyr)
statistics <- dataset %>% group_by(classes) %>%summarize(sum=sum(someValue))
Here for example the sum of someValue over two classes:
classes sum
(int) (dbl)
1 1 975.7783
2 2 978.9166
Let's say your data is equally distributed and you want the sum of someValue per cluster to be smaller. Then you need to rerun the clustering with more (i.e. 3) classes:
newRun <- kmeans(dataset[,1:2], 3, iter.max = 100)
dataset$classes <- newRun$cluster
Here the output statistics for three classes:
classes sum
(int) (dbl)
1 1 577.6573
2 2 739.9668
3 3 637.0707
By wrapping this inside a loop and calculating more criteria (i.e. variance) you can tune your clustering into the right size. Hope it helps.

Related

Hierarchical clustering: consistent and dichotomous group names representing hierarchy within tree

I aim at producing a typology of sites based through hierarchical clustering of species abundance data. Therefore, I successively cut the dendrogram into 2, 3, 4 ... z groups.
The cluster group names are automatically attributed by the function cutree() representing numbers from 1 to z in a non-consistent manner. For instance, in a clustering with three groups, "group 2" may not correspond to "group 2" in a clustering with six groups. This makes interpreting the dendrogram very difficult.
The code below provides a reproducible example. It produces a hierarchical clustering of 50 observations and successively cuts the dendrogram in a for loop. The final output data frame 'cluster.grps' contains the cluster group affiliation for each obervation and successive cutting height (HC_2 = hierarchical clustering with 2 groups; HC_3 = hc with three groups; etc.).
set.seed(1)
data <- data.frame(replicate(10,sample(0:10,50,rep=TRUE))) # create random site x species dataframe
clust <- hclust(dist(data), method = "ward.D") # implement hierarchical clustering
# Set maximum number of groups
z <- 6
# Loop for successive tree cutting
lst <- list()
for (i in 2:z) {
# Slicing the dendrogram
cutree <- cutree(clust, k = i) # k = number of groups
lst[[(i-1)]] <- cutree
}
names(lst) <- 2:(z-1)
cluster.grps <- as.data.frame(lst)
colnames(cluster.grps) <- paste("HC",as.character(2:(z)),sep ="_")
I now wish to attribute dichotomous names that represent the level of hierarchy in the tree: 1, 2 for the first level; 1.1, 1.2, 2.1, 2.2 for the second level; 1.1.1, 1.1.2, 1.2.1, 1.2.2, etc. for the third level and so on.
Ideally, the table 'cluster.grps' would look like this:
Site
HC_2
HC_3
HC_4
Site 1
1
1.1
1.1
Site 2
2
2
2.1
Site 3
1
1.2
1.2
Site 4
2
2
2.2
My first thought was to code nested clusterings in which I start with a first clustering of all observations into two groups and subsequently splitting each group of the first clustering independently into two consecutive groups, yielding four groups at the second hierarchical level. This requires quite a long code, though and I was wondering whether there might be a more elegant way.
Any thoughts?

Creating clusters via weighted randomization

I need to assign weights on a sample for a country. I already have the population by region 85 (regions) but I cannot perform the cluster sampling. Basically, I need to create 100 clusters each with 15 units. Overall 1500 respondents. I have an excel file with all the variables for the 85 regions.
Question 1:
How can I use the already generated population probability to do a weighted randomization for 100 clusters (with 15 units each)?
Question 2:
I need to draw from the 85 regions and generate 100 clusters. Logically, the capital and some of the other big cities should have more than 1 clusters due to higher population which gives them higher probability of having a cluster. Thus, How can I draw the clusters (15 units each) and assign a number of clusters to the different regions? For instance, the cluster probability is 0.08 percent and this will mean that I need 8 clusters of the 100 (15 units each) to be assigned to the capital. How do I add that column?
Specifically the problem with my current results is that I cannot generate the column with the number of clusters per region. For instance, region A to have 3 clusters, while region B 1 and so forth.
Here is my code:
data1$clusProb1 = (data1$Population.2018)/sum(data1$Population.2018)
sampInd = c(1:length(data1$Federal.Subject),sample(1:length(data1$Federal.Subject), length(data1$Federal.Subject)*14, prob = data1$clusProb, replace = TRUE))
sampFields = data.frame(id = 1:(length(data1$Federal.Subject)*15), Gender = sample(c(0,1), length(data1$Federal.Subject)*15, replace=TRUE), replace=TRUE))
sampleData = cbind(data1[sampInd,],sampFields)
sampleData
summary(sampleData)
The result should look like:
Cluster number Region
1 A
2 A
3 A
4 C
5 D
6
NOTE: A representing the regions with higher population which should have more clusters assigned to them.

Clustering - how to find the nearest to a cluster

Hints I got as to a different question puzzled me quite a bit.
I got an exercise, actually part of a larger exercise:
Cluster some data, using hclust (done)
Given a totally new vector, find out to which of the clusters you got in 1 it is nearest.
According to the excercise, this should be done in quite short a time.
However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters.
As I suppose I was unclear:
Say, for instance, I feed hclust a matrix which consists of 15 1x5 Vectors, 5 times (1 1 1 1 1 ), 5 times (2 2 2 2 2) and 5 times (3 3 3 3 3). This should give me three quite distinct clusters of size 5, anyone can easily do that by hand. Is there a command to use so that I can actually find out from the program that there are 3 such clusters in my hclust-object and what they contain?
You'll have to think about what the right metric is to define closeness to the cluster. Building on the example in the hclust doc, here's a way to compute the means for each cluster and then measure the distance between the new data point and the set of means.
# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]
# Put the B data into 10 clusters
hc <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]
# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL
# Now add the hold out state to the set of averages
M <-rbind(M,KY)
# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust = which.min(D[-length(D)])
memb[memb==KYclust]
# Now cluster the full set of states and compare the results.
hc <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]
In contrast to k-means, clusters found by hclust can be of arbitrary shape.
The distance to the nearest cluster center therefore is not always meaningful.
Doing a 1 nearest neighbor style assignment probably is better.

how to calculate all pairwise distances in two dimensions

Say I have data concerning the position of animals on a 2d plane (as determined by video monitoring from a camera directly overhead). For example a matrix with 15 rows (1 for each animal) and 2 columns (x position and y position)
animal.ids<-letters[1:15]
xpos<-runif(15) # x coordinates
ypos<-runif(15) # y coordinates
raw.data.t1<-data.frame(xpos, ypos)
rownames(raw.data.t1) = animal.ids
I want to calculate all the pairwise distances between animals. That is, get the distance from animal a (row 1) to the animal in row 2, row3...row15, and then repeat that step for all rows, avoiding redundant distance calculations. The desire output of a function that does this would be the mean of all the pairwise distances. I should clarify that I mean the simple linear distance, from the formula d<-sqrt(((x1-x2)^2)+((y1-y2)^2)). Any help would be greatly appreciated.
Furthermore, how could this be extended to a similar matrix with an arbitrarily large even number of columns (every two columns representing x and y positions at a given time point). The goal here would be to calculate mean pairwise distances for every two columns and output a table with each time point and its corresponding mean pairwise distance. Here is an example of the data structure with 3 time points:
xpos1<-runif(15)
ypos1<-runif(15)
xpos2<-runif(15)
ypos2<-runif(15)
xpos3<-runif(15)
ypos3<-runif(15)
pos.data<-cbind(xpos1, ypos1, xpos2, ypos2, xpos3, ypos3)
rownames(pos.data) = letters[1:15]
The aptly named dist() will do this:
x <- matrix(rnorm(100), nrow=5)
dist(x)
1 2 3 4
2 7.734978
3 7.823720 5.376545
4 8.665365 5.429437 5.971924
5 7.105536 5.922752 5.134960 6.677726
See ?dist for more details
Why do you compare d<-sqrt(((x1-x2)^2)+((y1-y2)^2))?
Do d^2<-(((x1-x2)^2)+((y1-y2)^2)). It will cost you much less.

R: Statistics of distribution

I have the number of samples per unit and need to calculate statistics with R.
The table is like this (all rows and columns are actually filled with values, I only write a few here for easier visibility, and there are many more columns):
Hour 1 2 3 4
H1 72 11 98 65
H2 19 27
H3
H4
H5
:
H200000
I.e. the first hour (H1) there were 72 samples of value 1, 11 samples of value 2, etc. The second hour(H2) there were 19 samples of value 1, 27 samples of value 2, etc.
I need to calculate the mean and standard deviation per hour (i.e. per row). As there are many thousands of rows I need a fast method.
Example: The manual mean-calculation for hour 1 (H1) would be:
(72x1 + 11x2 + 98x3 + 65x4)/(72+11+98+65) = 2.6
I suppose there are R-methods or packages that can do this, but I fail to find where. Your support is highly appreciated.
Thanks,
Chris
You want to calculate a weighted mean, so you need weighted.mean. For the first row:
values <- c(1, 2, 3, 4)
weights <- c(72, 11, 98, 65)
weighted.mean(values, weights)
The weighted standard deviation is not well-defined. You could use a hand-rolled weighted RMS as an estimator (but this assumes that your input sample is really from a single Gaussian, i.e. there are no outliers -- not sure if that's the case for your example).
# same values and weights as above
sqrt(sum(values^2*weights^2))/sum(weights)
You should read your data into a table and iterate over every row. Also, "many thousands of rows" is not necessarily a large number for such a simple calculation. This is very basic stuff, maybe checking out a tutorial would also be beneficial.
You are much better off (i.e. faster calculations) using matrix operations instead of applying something by row. For example, assuming X is the matrix containing your data, you can get the weighted means the following way:
w <- 1:ncol(X)
w <- w/sum(w) #scale to have a sum of 1
wmeans <- X %*% w
Assuming your table is a matrix called dataset of n * 20000 and you have the weigths in a weights array you just need to do:
# The 1 as 2nd parameter indicates to apply the function on the rows
w.means <- apply(dataset, 1, weighted.mean, w=weights)

Resources