Ordering clustered points using Kmeans and R - r

I have set of data (of 5000 points with 4 dimensions) that I have clustered using kmeans in R.
I want to order the points in each cluster by their distance to the center of that cluster.
Very simply, the data looks like this (I am using a subset to test out various approaches):
id Ans Acc Que Kudos
1 100 100 100 100
2 85 83 80 75
3 69 65 30 29
4 41 45 30 22
5 10 12 18 16
6 10 13 10 9
7 10 16 16 19
8 65 68 100 100
9 36 30 35 29
10 36 30 26 22
Firstly, I used the following method to cluster the dataset into 2 clusters:
(result <- kmeans(data, 2))
This returns a kmeans object that has the following methods:
cluster, centers etc.
But I cannot figure out how to compare each point and produce an ordered list.
Secondly, I tried the seriation approach as suggested by another SO user here
I use these commands:
clus <- kmeans(scale(x, scale = FALSE), centers = 3, iter.max = 50, nstart = 10)
mns <- sapply(split(x, clus$cluster), function(x) mean(unlist(x)))
result <- dat[order(order(mns)[clus$cluster]), ]
Which seems to produce an ordered list but if I bind it to the labeled clusters (using the following cbind command):
result <- cbind(x[order(order(mns)[clus$cluster]), ],clus$cluster)
I get the following result, which does not appear to be ordered correctly:
id Ans Acc Que Kudos clus
1 3 69 65 30 29 1
2 4 41 45 30 22 1
3 5 10 12 18 16 2
4 6 10 13 10 9 2
5 7 10 16 16 19 2
6 9 36 30 35 29 2
7 10 36 30 26 22 2
8 1 100 100 100 100 1
9 2 85 83 80 75 2
10 8 65 68 100 100 2
I don't want to be writing commands willy-nilly but understand how the approach works. If anyone could help out or spread some light on this, it would be really great.
EDIT:::::::::::
As the clusters can be easily plotted, I'd imagine there is a more straightforward way to get and rank the distances between points and the center.
The centers for the above clusters (when using k = 2) are as follows. But I do not know how to get and compare this with each individual point.
Ans Accep Que Kudos
1 83.33333 83.66667 93.33333 91.66667
2 30.28571 30.14286 23.57143 20.85714
NB::::::::
I don't need top use kmeans but I want to specify the number of clusters and retrieve an ordered list of points from those clusters.

Here is an example that does what you ask, using the first example from ?kmeans. It is probably not terribly efficient, but is something to build upon.
#Taken straight from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, 2)
x <- cbind(x,cl = cl$cluster)
#Function to apply to each cluster to
# do the ordering
orderCluster <- function(i,data,centers){
#Extract cluster and center
dt <- data[data[,3] == i,]
ct <- centers[i,]
#Calculate distances
dt <- cbind(dt,dist = apply((dt[,1:2] - ct)^2,1,sum))
#Sort
dt[order(dt[,4]),]
}
do.call(rbind,lapply(sort(unique(cl$cluster)),orderCluster,data = x,centers = cl$centers))

Related

test for in-tile for Dirichlet tile, using R

So I can take points and use the R libraries deldir or spatstat::dirichlet to find the dirichlet tesselation of those points.
Now I have a point not in the set, and I want to know the indices of the points forming the dirichlet tile which my not-in-set-point is interior to. I can get there by knowing the tile label (or index).
Are there any libraries or methods to do this? I'm thinking spatstat, but not finding something there yet.
The function cut.ppp() can take a point pattern and find which tesselation
tile each point in the pattern belongs to. Below is the code for a simple
example of a point pattern that only contains a single point (0.5, 0.5).
library(spatstat)
dd <- dirichlet(cells)
plot.tess(dd, do.labels = TRUE)
xx <- ppp(.5, .5, window = Window(dd))
plot(xx, add = TRUE, col = "red", cex = 2, pch = 20)
yy <- cut(xx, dd)
yy
#> Marked planar point pattern: 1 point
#> Multitype, with levels =
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#> 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
#> window: rectangle = [0, 1] x [0, 1] units
marks(yy)
#> [1] 18
#> 42 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 42
Created on 2018-12-03 by the reprex package (v0.2.1)
If X is a point pattern and B is a tessellation, then
M <- marks(cut(X, B))
returns a factor (vector of categorical values) identifying which tile contains each of the points of X. Alternatively,
M <- tileindex(X$x, X$y, B)
or
f <- as.function(B)
M <- f(X)

multiplying columns in R

I have a data frame like this.
> abc
ID 1.x 2.x 1.y 2.y
1 4 10 20 30 40
2 16 5 10 5 10
3 42 16 17 18 19
4 91 20 20 20 20
5 103 103 42 56 84
How do I create two additional columns '1' and '2' by multiplying 1.x * 1.y and 2.x * 2.y in a generalized way?
I am trying to get a generalized solution where number of columns can be too many. So I want to multiply all x with all y. While x and y are fixed, n has to be figured out from data frame.
For simplicity lets assume n is also fixed however it is a large number.
One thing i can try is :-
abc[,c(6,7)]=abc[,c(2,3)]*abc[,c(4,5)]
It will work only if col positions are contiguous. This is good enough for me. If anyone can have more generalized solution, it will benefit us all.
If there are only couple of variables to multiply, we can do this with transform by multiplying the columns of interest
transform(abc, new1 = `1.x`*`1.y`, new2 = `2.x`*`2.y`, check.names = FALSE)
# ID 1.x 2.x 1.y 2.y new1 new2
#1 4 10 20 30 40 300 800
#2 16 5 10 5 10 25 100
#3 42 16 17 18 19 288 323
#4 91 20 20 20 20 400 400
#5 103 103 42 56 84 5768 3528
If we have lots of columns, then one approach is to split the dataset into a list of data.frames by taking the substring of names and then loop through the list and multiply the rows with do.call
abc[paste0("new", 1:2)] <- lapply(split.default(abc[-1],
sub("\\.[a-z]+$", "", names(abc)[-1])), function(x) do.call(`*`, x))
Or another option is (based on the pairwise column multiplication)
apply(aperm(array(unlist(abc[-1]), c(5, 2, 2)),
c(3, 1, 2)), 3, matrixStats::colProds)
Mutate will preserve the original variables. Mutate_all will allow you to multiply all columns in your dataframe.
abc %>%
mutate(new_vary1 = `1.x`* `2.x`,
new_vary2 = `1.y`* `2.y`) %>%
mutate_all(funs(.*`1.x`))

Ceil and floor values in R

I have a data.table of integers with values between 1 and 60.
My question is about flooring or ceiling any number to the following values: 12 18 24 30 36 ... 60.
For example, let's say my data.table contains the number 13. I want R to "transform" this number into 12 and 18 as 13 lies in between those numbers. Moreover, if I have 18 I want R to keep it at 18.
If my data.table contains the value 50, I want R to convert that number into 48 and 54 and so on.
My goal is to get two different data.tables. One where the floored values are saved and one where the ceiled values are saved.
Any idea how one could do this in R?
EDIT: Numbers smaller than 12 should always be transformed to 12.
Example output:
If have the following data.table data.table(c(1,28,29,41,53,53,17,41,41,53))
I want the following two output data.tables: floored values data.table(c(12,24,24,36,48,48,12,36,36,48))
I want the following two output data.tables: ceiled values data.table(c(12,30,30,42,54,54,18,42,42,54))
Here is a fairly direct way (edited to round up to 12 if any values are below):
df <- data.frame(nums = 10:20)
df$floors <- with(df,pmax(12,6*floor(nums/6)))
df$ceils <- with(df,pmax(12,6*ceiling(nums/6)))
Leading to:
> df
nums floors ceils
1 10 12 12
2 11 12 12
3 12 12 12
4 13 12 18
5 14 12 18
6 15 12 18
7 16 12 18
8 17 12 18
9 18 18 18
10 19 18 24
11 20 18 24
Here's a way we could do this, using sapply and the which.min functions. From your question, it's not immediately clear how values < 12 should be handled.
x <- 1:60
num_list <- seq(12, 60, 6)
floorr <- sapply(x, function(x){
diff_vec <- x - num_list
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
ceill <- sapply(x, function(x){
diff_vec <- num_list - x
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
tail(cbind(x, floorr, ceill))
x floorr ceill
[55,] 55 54 60
[56,] 56 54 60
[57,] 57 54 60
[58,] 58 54 60
[59,] 59 54 60
[60,] 60 60 60

Performence for calculating the distance between two positions on a tree?

Here is a tree. The first column is an identifier for the branch, where 0 is the trunk, L is the first branch on the left and R is the first branch on the right. LL is the branch on the extreme left after the second bifurcation, etc.. the variable length contains the length of each branch.
> tree
branch length
1 0 20
2 L 12
3 LL 19
4 R 19
5 RL 12
6 RLL 10
7 RLR 12
8 RR 17
tree = data.frame(branch = c("0","L", "LL", "R", "RL", "RLL", "RLR", "RR"), length=c(20,12,19,19,12,10,12,17))
tree$branch = as.character(tree$branch)
and here is a drawing of this tree
Here are two positions on this tree
posA = tree[4,]; posA$length = 12
posB = tree[6,]; posB$length = 3
The positions are given by the branch ID and the distance (variable length) to the origin of the branch (more info in edits).
I wrote the following messy distance function to calculate the shortest distance along the branches between any two points on the tree. The shortest distance along the branches can be understood as the minimal distance an ant would need to walk along the branches to reach one position from the other position.
distance = function(tree, pos1, pos2){
if (identical(pos1$branch, pos2$branch)){Dist=pos1$length-pos2$length;return(Dist)}
pos1path = strsplit(pos1$branch, "")[[1]]
if (pos1path[1]!="0") {pos1path = c("0", pos1path)}
pos2path = strsplit(pos2$branch, "")[[1]]
if (pos2path[1]!="0") {pos2path = c("0", pos2path)}
loop = 1:min(length(pos1path), length(pos2path))
loop = loop[-which(loop == 1)]
CommonTrace="included"; for (i in loop) {
if (pos1path[i] != pos2path[i]) {
CommonTrace = i-1; break
}
}
if(CommonTrace=="included"){
CommonTrace = min(length(pos1path), length(pos2path))
if (length(pos1path) > length(pos2path)) {
longerpos = pos1; shorterpos = pos2; longerpospath = pos1path
} else {
longerpos = pos2; shorterpos = pos1; longerpospath = pos2path
}
distToNode = 0
if ((CommonTrace+1) != length(longerpospath)){
for (i in (CommonTrace+1):(length(longerpospath)-1)){
distToNode = distToNode + tree$length[tree$branch == paste0(longerpospath[2:i], collapse='')]
}
}
Dist = distToNode + longerpos$length + (tree[tree$branch == shorterpos$branch,]$length-shorterpos$length)
if (identical(shorterpos, pos1)){Dist=-Dist}
return(Dist)
} elseĀ { # if they are sisterbranch
Dist=0
if((CommonTrace+1) != length(pos1path)){
for (i in (CommonTrace+1):(length(pos1path)-1)){
Dist = Dist + tree$length[tree$branch == paste0(pos1path[2:i], collapse='')]
}
}
if((CommonTrace+1) != length(pos2path)){
for (i in (CommonTrace+1):(length(pos2path)-1)){
Dist = Dist + tree$length[tree$branch == paste(pos2path[2:i], collapse='')]
}
}
Dist = Dist + pos1$length + pos2$length
return(Dist)
}
}
I think the algorithm works fine but it is not very efficient. Note the sign of the distance that is important. This sign only makes sense when the two positions are not found on "sister branches". That is the sign makes sense only if one of the two positions is found in the way between the roots and the other position.
distance(tree, posA, posB) # -22
I then just loop through all positions of interest like that:
allpositions=rbind(tree, tree)
allpositions$length = c(1,5,8,2,2,3,5,6,7,8,2,3,1,2,5,6)
mat = matrix(-1, ncol=nrow(allpositions), nrow=nrow(allpositions))
for (i in 1:nrow(allpositions)){
for (j in 1:nrow(allpositions)){
posA = allpositions[i,]
posB = allpositions[j,]
mat[i,j] = distance(tree, posA, posB)
}
}
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1 0 -24 -39 -21 -40 -53 -55 -44 -6 -27 -33 -22 -39 -52 -55 -44
# 2 24 0 -15 7 26 39 41 30 18 -3 -9 8 25 38 41 30
# 3 39 15 0 22 41 54 56 45 33 12 6 23 40 53 56 45
# 4 21 7 22 0 -19 -32 -34 -23 15 10 16 -1 -18 -31 -34 -23
# 5 40 26 41 19 0 -13 -15 8 34 29 35 18 1 -12 -15 8
# 6 53 39 54 32 13 0 8 21 47 42 48 31 14 1 8 21
# 7 55 41 56 34 15 8 0 23 49 44 50 33 16 7 0 23
# 8 44 30 45 23 8 21 23 0 38 33 39 22 7 20 23 0
# 9 6 -18 -33 -15 -34 -47 -49 -38 0 -21 -27 -16 -33 -46 -49 -38
# 10 27 3 -12 10 29 42 44 33 21 0 -6 11 28 41 44 33
# 11 33 9 -6 16 35 48 50 39 27 6 0 17 34 47 50 39
# 12 22 8 23 1 -18 -31 -33 -22 16 11 17 0 -17 -30 -33 -22
# 13 39 25 40 18 -1 -14 -16 7 33 28 34 17 0 -13 -16 7
# 14 52 38 53 31 12 -1 7 20 46 41 47 30 13 0 7 20
# 15 55 41 56 34 15 8 0 23 49 44 50 33 16 7 0 23
# 16 44 30 45 23 8 21 23 0 38 33 39 22 7 20 23 0
As an example, let's consider the first and the third positions in the object allpositions. The distance between them is 39 (and -39) because an ant would need to walk 19 units on branch 0 and then walk 12 units on branch L and finally the ant would need to walk 8 units on branch LL. 19 + 12 + 8 = 39
The issue is that I have about 20 very big trees with about 50000 positions and I would like to calculate the distance between any two positions. There are therefore 20 * 50000^2 distances to compute. It takes forever! Can you help me to improve my code?
EDIT
Please let me know if anything is still unclear
tree is a description of a tree. The tree has branches of a certain length. The name of the branches (variable: branch) gives indication about the relationship between the branches. The branch RL is a "parent branch" of the two branches RLL and RLR, where R and L stand for right and left.
allpositions is an data.frame, where each line represents one independent position on the tree. You can think of the position of a squirrel. The position is defined by two information. 1) The branch (variable: branch) on which the squirrel is standing and the the distance between the beginning of the branch and the position of the squirrel (variable: length).
Three examples
Consider a first squirrel that is at position (variable: length) 8 on the branch RL (which length is 12) and a second squirrel that is at position (variable: length) 2 on the branch RLL or RLR. The distance between the two squirrels is 12 - 8 + 2 = 6 (or -6).
Consider a first squirrel that is at position (variable: length) 8 on the branch RL and a second squirrel that is at position (variable: length) 2 on the branch RR. The distance between the two squirrels is 8 + 2 = 10 (or -10).
Consider a first squirrel that is at position (variable: length) 8 on the branch R (which length is 19) and a second squirrel that is at position (variable: length) 2 on the branch RLL. Knowing the that branch RL has a length of 12, the distance between the two squirrels is 19 - 8 + 12 + 2 = 25 (or -25).
The code below uses the igraph package to compute the distances between positions in tree and seems noticeably faster than the code you posted in your question. The approach is to create graph vertices at branch intersections and at positions along tree branches at the positions specified in allpositions. Graph edges are the branch segments between these vertices. It uses igraph to build a graph for the tree and allpositions and then finds the distances between the vertices corresponding to allposition data.
t.graph <- function(tree, positions) {
library(igraph)
# Assign vertex name to tree branch intersections
n_label <- nchar(tree$branch)
tree$high_vert <- tree$branch
tree$low_vert <- tree$branch
tree$brnch_type <- "tree"
for( i in 1:nrow(tree) ) {
tree$low_vert[i] <- if(n_label[i] > 1) substr(tree$branch[i], 1, n_label[i]-1)
else { if(tree$branch[i] %in% c("R","L")) "0"
else "root" }
}
# combine position data with tree data
positions$brnch_type <- "position"
temp <- merge(positions, tree, by = "branch")
positions <- temp[, c("branch","length.x","high_vert","low_vert","brnch_type.x")]
positions$high_vert <- paste(positions$branch, positions$length.x, sep="_")
colnames(positions) <- c("branch","length","high_vert","low_vert","brnch_type")
tree <- rbind(tree, positions)
# use positions to segment tree branches
tree_brnch <- split(tree, tree$branch)
tree <- data.frame( branch=NA_character_, length = NA_real_, high_vert = NA_character_,
low_vert = NA_character_, brnch_type =NA_character_, seg_len= NA_real_)
for( ib in 1: length(tree_brnch)) {
brnch_seg <- tree_brnch[[ib]][order(tree_brnch[[ib]]$length, decreasing=TRUE), ]
n_seg <- nrow(brnch_seg)
brnch_seg$seg_len <- brnch_seg$length
for( is in 1:(n_seg-1) ) {
brnch_seg$seg_len[is] <- brnch_seg$length[is] - brnch_seg$length[is+1]
brnch_seg$low_vert[is] <- brnch_seg$high_vert[is+1]
}
tree <- rbind(tree, brnch_seg)
}
tree <- tree[-1,]
# Create graph of tree and positions
tree_graph <- graph.data.frame(tree[,c("low_vert","high_vert")])
E(tree_graph)$label <- tree$high_vert
E(tree_graph)$brnch_type <- tree$brnch_type
E(tree_graph)$weight <- tree$seg_len
# calculate shortest distances between position vertices
position_verts <- V(tree_graph)[grep("_", V(tree_graph)$name)]
vert_dist <- shortest.paths(tree_graph, v=position_verts, to=position_verts, mode="all")
return(dist_mat= vert_dist )
}
I've benchmarked igraph code ( the t.graph function) against the code posted in your question by making a function named Remi for your code over allposition data using your distance function. Sample trees were created as extensions of your tree and allpositions data for trees of 64, 256, and 2048 branches and allpositions equal to twice these sizes. Comparisons of execution times are shown below. Notice that times are in milliseconds.
microbenchmark(matR16 <- Remi(tree, allpositions), matG16 <- t.graph(tree, allpositions),
matR256 <- Remi(tree256, allpositions256), matG256 <- t.graph(tree256, allpositions256), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
matR8 <- Remi(tree, allpositions) 58.82173 58.82173 59.92444 59.92444 61.02714 61.02714 2
matG8 <- t.graph(tree, allpositions) 11.82064 11.82064 13.15275 13.15275 14.48486 14.48486 2
matR256 <- Remi(tree256, allpositions256) 114795.50865 114795.50865 114838.99490 114838.99490 114882.48114 114882.48114 2
matG256 <- t.graph(tree256, allpositions256) 379.54559 379.54559 379.76673 379.76673 379.98787 379.98787 2
Compared to the code you posted, the igraph results are only about 5 times faster for the 8 branch case but are over 300 times faster for 256 branches so igraph seems to scale better with size. I've also benchmarked the igraph code for the 2048 branch case with the following results. Again times are in milliseconds.
microbenchmark(matG8 <- t.graph(tree, allpositions), matG64 <- t.graph(tree64, allpositions64),
matG256 <- t.graph(tree256, allpositions256), matG2k <- t.graph(tree2k, allpositions2k), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
matG8 <- t.graph(tree, allpositions) 11.78072 11.78072 12.00599 12.00599 12.23126 12.23126 2
matG64 <- t.graph(tree64, allpositions64) 73.29006 73.29006 73.49409 73.49409 73.69812 73.69812 2
matG256 <- t.graph(tree256, allpositions256) 377.21756 377.21756 410.01268 410.01268 442.80780 442.80780 2
matG2k <- t.graph(tree2k, allpositions2k) 11311.05758 11311.05758 11362.93701 11362.93701 11414.81645 11414.81645 2
so the distance matrix for about 4000 positions is calculated in less than 12 seconds.
t.graph returns the distance matrix where the rows and columns of the matrix are labeled by branch names - position on the branch so for example
0_7 0_1 L_8 L_5 LL_8 LL_2 R_3 R_2 RL_2 RL_1 RLL_3 RLL_2 RLR_5 RR_6
L_5 18 24 3 0 15 9 8 7 26 25 39 38 41 30
shows the distances from L-5, the position 5 units along the L branch, to the other positions.
I don't know that this will handle your largest cases, but it may be helpful for some. You also have problems with the storage requirements for your largest cases.

R: applying a function on whole dataset to find points within a circle

I have a difficulty with application of the data frame on my function in R. I have a data.frame with three columns ID of a point, its location on x axis and its location on y axis. All I need to do is to find for a given point IDs of points that lies in its neighborhood. I've made the function that shows whether the point lies within a circle where the center is a location of observed point and returns it's ID if true.
Here is my code:
point_id <- locationdata$point_id
x_loc <- locationdata$x_loc
y_loc <- locationdata$y_loc
locdata <- data.frame(point_id, x_loc, y_loc)
#radius set to1km
incircle3 <- function(x_loc, y_loc, center_x, center_y, pointid, r = 1000000){
dx = (x_loc-center_x)
dy = (y_loc-center_y)
if (b <- dx^2 + dy^2 <= r^2){
print(shopid)} ##else {print('')}
}
Unfortunately I don't know how to apply this function on the whole data frame. So once I enter the locations of the observed point it would return me IDs of all points that lies in the neighborhood. Ideally I would need to find this relation for all the points automatically. So it would return me the points that lies in the neighborhood of each point from the dataset. Previously I have been inserting the center_x and center_y manually.
Thank you very much for your advices in advance!
You can tackle this with R's dist function:
# set the random seed and create some dummy data
set.seed(101)
dummy <- data.frame(id=1:100, x=runif(100), y=runif(100))
> head(dummy)
id x y
1 1 0.37219838 0.12501937
2 2 0.04382482 0.02332669
3 3 0.70968402 0.39186128
4 4 0.65769040 0.85959857
5 5 0.24985572 0.71833452
6 6 0.30005483 0.33939503
Call the dist function which returns a dist object. The default distance metric is Euclidean which is what you have coded in your question.
dists <- dist(dummy[,2:3])
Loop over the distance matrix and return the indices for each id that are within some constant distance:
neighbors <- apply(as.matrix(dists), 1, function(x) which(x < 0.33))
> neighbors[[1]]
1 6 7 8 19 23 30 32 33 34 42 44 46 51 55 87 88 91 94 99
Here's a modification to handle volatile ids:
set.seed(101)
dummy <- data.frame(id=sample(1:100, 100), x=runif(100), y=runif(100))
> head(dummy)
id x y
1 38 0.12501937 0.60567568
2 5 0.02332669 0.56259740
3 70 0.39186128 0.27685556
4 64 0.85959857 0.22614243
5 24 0.71833452 0.98355758
6 29 0.33939503 0.09838715
dists <- dist(dummy[,2:3])
neighbors <- apply(as.matrix(dists), 1, function(x) {
dummy$id[which(x < 0.33)]
})
names(neighbors) <- dummy$id
> neighbors[['38']]
[1] 38 5 55 80 63 76 17 71 47 11 88 13 41 21 36 31 73 61 99 59 39 89 94 12 18 3

Resources