R, SOM, Kohonen Package, Outlier Detection

R, SOM, Kohonen Package, Outlier Detection - r

With SOM I experimented a little. First I used MiniSOM in Python but I was not impressed and changed to the kohonen package in R, which offers more features than the previous one. Basically, I applied SOM for three use cases: (1) clustering in 2D with generated data, (2) clustering with more-dimensional data: built-in wine data set, and (3) outlier detection. I solved all the three use cases but I would like to raise a question in connection with the outlier detection I applied. For this purpose I used the vector som$distances, which contains a distance for each row of the input data set. The values with excelling distances can be outliers. However, I do not know how this distance is computed. The package description (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf) states for this metric : "distance to the closest unit".
Could you please tell how this distance is computed?
Could you please comment the outlier detection I used? How would you have done it? (In the generated data set it really finds the outliers. In
the real wine data set there are four relatively excelling values among the 177 wine sorts. See
the charts below. The idea that crossed my mind to use bar charts for depicting this I really like.)
Charts:
Generated data, 100 point in 2D in 5 distinct clusters and 2
outliers (Category 6 shows the outliers):
Distances shown for all the 102 data points, the last two ones are
the outliers which were correctly identified. I repeated the test
with 500, and 1000 data points and added solely 2 outliers. The
outliers were also found in those cases.
Distances for the real wine data set with potential outliers:
The row id of the potential outliers:
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
it produces the following output:
index value
59 59 12.22916
110 110 13.41211
121 121 15.86576
158 158 11.50079
My annotated code snippet:
data(wines)
scaled_wines <- scale(wines)
# creating and training SOM
som.wines <- som(scaled_wines, grid = somgrid(5, 5, "hexagonal"))
summary(som.wines)
#looking for outliers, dist = distance to the closest unit
som.wines$distances
len <- length(som.wines$distances)
index_in_vector <- c(1:len)
df_wine<-data.frame(cbind(index_in_vector, som.wines$distances))
colnames(df_wine) <-c("index", "value")
po <-ggplot(df_wine, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som.wines$distances") + xlab("Number of Rows in the Data Set")
plot(po)
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
Further Code Samples
With regard to the discussion in the comments I also post the code snippets asked for. As far as I remember, the code lines responsible for clustering I constructed based on samples I found in the description of the Kohonen package (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf). However, I am not completely sure, it was more than a year ago. The code is provided as is without any warranty :-). Please bear in mind that a particular clustering approach may perform with different accuracy on different data. I would also recommend to compare it with t-SNE on the wine data set (data(wines) available in R). Moreover, implement the heat-maps to demonstrate how the data with regard to individual variables are located. (In the case of the above example with 2 variables it is not important but it would be nice for the wine data set).
Data Generation with Five Clusters and 2 Outliers and Plotting
library(stats)
library(ggplot2)
library(kohonen)
generate_data <- function(num_of_points, num_of_clusters, outliers=TRUE){
num_of_points_per_cluster <- num_of_points/num_of_clusters
cat(sprintf("#### num_of_points_per_cluster = %s, num_of_clusters = %s \n", num_of_points_per_cluster, num_of_clusters))
arr<-array()
standard_dev_y <- 6000
standard_dev_x <- 2
# for reproducibility setting the random generator
set.seed(10)
for (i in 1:num_of_clusters){
centroid_y <- runif(1, min=10000, max=200000)
centroid_x <- runif(1, min=20, max=70)
cat(sprintf("centroid_x = %s \n, centroid_y = %s", centroid_x, centroid_y ))
vector_y <- rnorm(num_of_points_per_cluster, mean=centroid_y, sd=standard_dev_y)
vector_x <- rnorm(num_of_points_per_cluster, mean=centroid_x, sd=standard_dev_x)
cluster <- array(c(vector_y, vector_x), dim=c(num_of_points_per_cluster, 2))
cluster <- cbind(cluster, i)
arr <- rbind(arr, cluster)
}
if(outliers){
#adding two outliers
arr <- rbind(arr, c(10000, 30, 6))
arr <- rbind(arr, c(150000, 70, 6))
}
colnames(arr) <-c("y", "x", "Cluster")
# WA to remove the first NA row
arr <- na.omit(arr)
return(arr)
}
scatter_plot_data <- function(data_in, couloring_base_indx, main_label){
df <- data.frame(data_in)
colnames(df) <-c("y", "x", "Cluster")
pl <- ggplot(data=df, aes(x = x,y=y)) + geom_point(aes(color=factor(df[, couloring_base_indx])))
pl <- pl + ggtitle(main_label) + theme(plot.title = element_text(hjust = 0.5))
print(pl)
}
##################
# generating data
data <- generate_data(100, 5, TRUE)
print(data)
scatter_plot_data(data, couloring_base_indx<-3, "Original Clusters without Outliers \n 102 Points")
Preparation, Clustering and Plotting
I used the hierarchical clustering approach with the Kohonen Map (SOM).
normalising_data <- function(data){
# normalizing data points not the cluster identifiers
mtrx <- data.matrix(data)
umtrx <- scale(mtrx[,1:2])
umtrx <- cbind(umtrx, factor(mtrx[,3]))
colnames(umtrx) <-c("y", "x", "Cluster")
return(umtrx)
}
train_som <- function(umtrx){
# unsupervised learning
set.seed(7)
g <- somgrid(xdim=5, ydim=5, topo="hexagonal")
#map<-som(umtrx[, 1:2], grid=g, alpha=c(0.005, 0.01), radius=1, rlen=1000)
map<-som(umtrx[, 1:2], grid=g)
summary(map)
return(map)
}
plot_som_data <- function(map){
par(mfrow=c(3,2))
# to plot some charactristics of the SOM map
plot(map, type='changes')
plot(map, type='codes', main="Mapping Data")
plot(map, type='count')
plot(map, type='mapping') # how many data points are held by each neuron
plot(map, type='dist.neighbours') # the darker the colours are, the closer the point are; the lighter the colours are, the more distant the points are
#to switch the plot config to the normal
par(mfrow=c(1,1))
}
plot_disstances_to_the_closest_point <- function(map){
# to see which neuron is assigned to which value
map$unit.classif
#looking for outliers, dist = distance to the closest unit
map$distances
max(map$distances)
len <- length(map$distances)
index_in_vector <- c(1:len)
df<-data.frame(cbind(index_in_vector, map$distances))
colnames(df) <-c("index", "value")
po <-ggplot(df, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som$distances") + xlab("Number of Rows in the Data Set")
plot(po)
return(df)
}
###################
# unsupervised learning
umtrx <- normalising_data(data)
map<-train_som(umtrx)
plot_som_data(map)
#####################
# creating the dendogram and then the clusters for the neurons
dendogram <- hclust(object.distances(map, "codes"), method = 'ward.D')
plot(dendogram)
clusters <- cutree(dendogram, 7)
clusters
length(clusters)
#visualising the clusters on the map
par(mfrow = c(1,1))
plot(map, type='dist.neighbours', main="Mapping Data")
add.cluster.boundaries(map, clusters)
Plots with the Clusters
You can also create nice heat-maps for selected variables but I had not implemented them for clustering with 2 variables it does not really make sense. If you implement it for the wine data set, please add the code and the charts to this post.
#see the predicted clusters with the data set
# 1. add the vector of the neuron ids to the data
mapped_neurons <- map$unit.classif
umtrx <- cbind(umtrx, mapped_neurons)
# 2. taking the predicted clusters and adding them the the original matrix
# very good description of the apply functions:
# https://www.guru99.com/r-apply-sapply-tapply.html
get_cluster_for_the_row <- function(x, cltrs){
return(cltrs[x])
}
predicted_clusters <- sapply (umtrx[,4], get_cluster_for_the_row, cltrs<-clusters)
mtrx <- cbind(mtrx, predicted_clusters)
scatter_plot_data(mtrx, couloring_base_indx<-4, "Predicted Clusters with Outliers \n 100 points")
See the predicted clusters below in case there were outliers and in case there were not.

I am not quite sure though, but I often find that the distance measurement of two objects reside in a particular dimensional space uses mostly Euclidean distance. For example, two points A and B in a two dimensional space having location of A(x=3, y=4) and B(x=6, y=8) are 5 distance unit apart. It is a result of performing calculation of squareroot((3-6)^2 + (4-8)^2). This is also applied to the data whose greater dimension, by adding trailing power of two of the difference of the two point's value in a particular dimension. If A(x=3, y=4, z=5) and B(x=6, y=8, z=7) then the distance is squareroot((3-6)^2 + (4-8)^2 + (5-7)^2), and so on. In kohonen, I think that after the model has finished the training phase, the algorithm then calculates the distances of each datum to all nodes and then assign it to the nearest node (a node which has the smallest distance to it). Eventually, the values inside the variable 'distances' returned by the model is the distance of every datum to its nearest node. One thing to note from your script is that the algorithm does not measure the distance directly from the original property values that the data have, because they have been scaled prior to feeding the data to the model. The distance measurement is applied to the scaled version of the data. The scaling is a standard procedure to eliminate the dominance of a variable on top of another.
I believe that your method is acceptable, because the values inside the 'distances' variable are the distance of each datum to its nearest node. So if a value of the distance between a datum and its nearest node is high, then this also means: the distance of the datum to other nodes are obviously much much higher.

Related

Calculation of allowed space within monte carlo simulated data of 3 variables (cube in 3D coordinates)

I´m working on the topic of calculating the robust working range of a process. For this purpose I´m building models from DOE data and simulating data with a monte carlo approach. Filtering the data with a criteria for the response leads to a allowed space (see plots for better visualization).
In the example below, there are 3 variables and the goal is to calculate the biggest possible square (in parallel with the axis) within the allowed room. This would describe the working range of the process. The coding is just to get every variable in the same range (-1 to 1).
library(tidyverse)
library(MASS)
library(ggplot2)
library(gridExtra)
library(rgl)
df<-data.frame(
X1=runif(100,0,2),
X2=runif(100,10,30),
X3=runif(100,5,75))%>%
mutate(Y1=2*X1-2*X2+X3)
f1<-Y1~X1+X2+X3
model1<- lm(f1, data=df)
m.c <- NULL
n=10000
for (k in 1:n)
{
X1=runif(1,0,2)
X2=runif(1,10,30)
X3=runif(1,5,75)
m.c = rbind(m.c, data.frame(X1, X2, X3))
}
m.c_coded<-m.c%>%
mutate(predict1=predict(model1, newdata = .))%>%
mutate(X1=(X1-1/1))%>%
mutate(X2=(X2-20)/10)%>%
mutate(X3=(X3-40)/35)
Space<- m.c_coded%>%
filter(predict1<=0)
p1<-ggplot(Space)+
geom_point(aes(X1, X2))+
xlim(-1,1)+
ylim(-1,1)
p2<-ggplot(Space)+
geom_point(aes(X1, X3))+
xlim(-1,1)+
ylim(-1,1)
p3<-ggplot(Space)+
geom_point(aes(X2, X3))+
xlim(-1,1)+
ylim(-1,1)
grid.arrange(arrangeGrob(p1,p2,p3, nrow = 1), nrow = 1)
MODR_plot3D<-plot3d( x=Space$X1, y=Space$X2, z=Space$X3, type = "p",
xlim = (c(-1,1)), ylim(c(-1,1)), zlim = (c(-1,1))
)
There are specialized programms for that (DOE software) which can calculate this so called Design-space, but I want to implement it in my R skript. Sadly I do not have any idea, how I can calculate the position (edges) of this square. My approach would be to find the maximum distance to the surface on (center of the square).
Does anyone an idea how I can calculate this cube in a proper way? If possible I want to extend this also for the n-dimensional room.

How can I create a scatter plot in R to visualise the result of a SOM clustering model?

I have a dataset (this is just a dummy, my real datasets are much larger) in which there are five variables: two spatial variables X and Y (basically pairs of coordinates) and three attributes A, B and C associated to each X,Y point:
X Y A B C
1 1 34 11 26
1 2 47 16 31
1 3 60 21 36
1 4 73 26 41
1 5 86 31 46
2 1 99 36 51
... with 15 more rows
If I run a k-Means Clustering model on the dataset, I can easily produce a plot in which each X,Y point is coloured according to the related cluster:
library(tidyverse)
#Read the dataset
My_ds <- read_delim("test_dataset.csv",delim = ",", escape_double = FALSE, trim_ws = TRUE)
#Set the number of clusters
kClusters <- 3
#Create the model
kMeans <- kmeans(My_ds[ , c("A", "B", "C")], centers = kClusters)
#Plot the result
ggplot(My_ds, aes(X, Y)) +
geom_point(col = kMeans$cluster,
size = 15) +
theme_minimal()
k-Means scatter plot
With the kohonen package I can also use a different clustering approach based on self-organising maps (SOM):
library(kohonen)
#Prepare the dataset
My_ds_SOM <- as.matrix(scale(My_ds[ , c("A", "B", "C")]))
#Set the grid
My_Grid <- somgrid(xdim = 3, ydim = 3, topo = "hexagonal")
#Create the model
My_Model <- som(X = My_ds_SOM,
grid = My_Grid)
However, I cannot find a way to produce a scatter plot similar to the one above and based on the SOM clusters. With k-Means I used kMeans$cluster to control the colour of the X,Y points, what should I use with SOM?
Update 1
OK, I made some progress thanks to this blog post. The key is to perform clustering on the SOM nodes, to isolate groups of samples with similar metrics.
First, an estimate of the number of clusters that would be suitable can be ascertained using a K-means algorithm and looking for an elbow-point in the plot of within cluster sum of squares (WCSS):
#View WCSS for K-means
mydata <- getCodes(My_Model)
wcss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:8) { #Second number is of one's choosing (I used number_of_nodes-1)
wcss[i] <- sum(kmeans(mydata, centers=i)$withinss)
}
plot(wcss)
WCSS plot
Then I use hierarchical clustering and the SOM plot function to visualise the clusters on the node map:
#Define colour palette
pretty_palette <- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
#Use hierarchical clustering to cluster the codebook vectors
som_cluster <- cutree(hclust(dist(getCodes(My_Model))), 3)
#Plot these results
plot(My_Model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(My_Model, som_cluster)
Clusters on node map
Finally, I assign labels to the original data by using the som_cluster variable that maps nodes, with the som_model$unit.classif variable that maps data samples to nodes:
#Get vector with cluster value for each original data sample
cluster_assignment <- som_cluster[My_Model$unit.classif]
#Add the assignment as a column in the original data
My_ds$cluster <- cluster_assignment
#Plot the result
ggplot(My_ds, aes(X, Y)) +
geom_point(col = My_ds$cluster,
size = 15) +
theme_minimal()
SOM+hierarchical scatter plot
Applying hierarchical clustering on top of the SOM nodes makes the process a bit convoluted, as SOM already helps reduce the dimensions and cluster neighbouring nodes together. But this was the only way I could get what I wanted.
Update 2
Some more progress. This time I'm focusing on making the whole process fully automatic. Specifically, I want to avoid choosing 1) the SOM grid size and 2) the number of clusters during the clustering of the node map.
Regarding point 1, I used a rule of thumb suggested by Vesanto J, Alhoniemi E. Clustering of the self-organizing map. IEEE Transactions on neural networks. 2000 May;11(3):586-600, which is #nodes = 5*sqrt(#observations). Therefore, setting the grid for the SOM model works like this:
My_dim <- as.integer(sqrt(5*sqrt(nrow(My_ds_SOM))))
My_Grid <- somgrid(xdim = My_dim, ydim = My_dim, topo = "hexagonal")
Of course, this works best with large datasets. In any case, this approach should be a starting point only, the grid size can (and should) then be adjusted by looking at the resulting node count plot, weight vector plot and heatmap.
About point 2, when using hierarchical clustering to cluster the codebook vectors, the kgs function of the maptree package allows the optimal number of clusters to be calculated automatically:
library(maptree)
distance <- dist(getCodes(My_Model))
clustering <- hclust(distance)
optimal_k <- kgs(clustering, distance, maxclus = 20)
clusters <- as.integer(names(optimal_k[which(optimal_k == min(optimal_k))]))
som_cluster <- cutree(clustering, clusters)
Also in this case, the number of clusters determined by the code can be compared to the one suggested by the WCSS plot, to check if there is a significant discrepancy.

Best way to cluster long/lat hotspot points in one city in R?

I am new to R and (unsupervised) machine learning. I'm trying to find out the best cluster solution for my data in R.
What is my data about?
I have a dataset with +/- 800 long / lat WGS84 coordinates in one city.
Long is in the range 6.90 - 6.95
lat is in the range 52.29 - 52.33
What do I want?
I want to find "hotspots" based on their density. As example: minimum 5 long/lat points in a range of 50 meter. This is a point plot example:
Why do I want this?
As example: let's assume that every single point is a car accident. By clustering the points I hope to see which areas need attention. (min x points in a range of x meter needs attention)
What have I found?
The following clustering algorithms seems possible for my solution:
DBscan (https://cran.r-project.org/web/packages/dbscan/dbscan.pdf)
HDBscan(https://cran.r-project.org/web/packages/dbscan/vignettes/hdbscan.html)
OPTICS (https://www.rdocumentation.org/packages/dbscan/versions/0.9-8/topics/optics)
City Clustering Algorithm (https://cran.r-project.org/web/packages/osc/vignettes/paper.pdf)
My questions
What is the best solution or algorithm for my case in R?
Is it true that I have to convert my long/lat to a distance / Haversine matrix first?

Find something interested on: https://gis.stackexchange.com/questions/64392/finding-clusters-of-points-based-distance-rule-using-r
I changed this code a bit, using the outliers as places where a lot happens
# 1. Make spatialpointsdataframe #
xy <- SpatialPointsDataFrame(
matrix(c(x,y), ncol=2), data.frame(ID=seq(1:length(x))),
proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))
# 2. Use DISTM function to generate distance matrix.#
mdist <- distm(xy)
# 3. Use hierarchical clustering with complete methode#
hc <- hclust(as.dist(mdist), method="complete")
# 4. Show dendogram#
plot(hc, labels = input$street, xlab="", sub="",cex=0.7)
# 5. Set distance: in my case 300 meter#
d=300
# 6. define clusters based on a tree "height" cutoff "d" and add them to the SpDataFrame
xy$clust <- cutree(hc, h=d)
# 7. Add clusters to dataset#
input$cluster <- xy#data[["clust"]]
# 8. Plot clusters #
plot(input$long, input$lat, col=input$cluster, pch=20)
text(input$long, input$lat, labels =input$cluster)
# 9. Count n in cluster#
selection2 <- input %>% count(cluster)
# 10. Make a boxplot #
boxplot(selection2$n)
#11. Get first outlier#
outlier <- boxplot.stats(selection2$n)$out
outlier <- sort(outlier)
outlier <- as.numeric(outlier[1])
#12. Filter clusters greater than outlier#
selectie3 <- as.vector(selection2 %>% filter(selection2$n >= outlier[1]) %>% select(cluster))
#13. Make a new DF with all outlier clusters#
heatclusters <- input %>% filter(cluster%in% c(selectie3$cluster))
#14. Plot outlier clusters#
plot(heatclusters$long, heatclusters$lat, col=heatclusters$cluster)
#15. Plot on density map ##
googlemap + geom_point(aes(x=long , y=lat), data=heatclusters, color="red", size=0.1, shape=".") +
stat_density2d(data=heatclusters,
aes(x =long, y =lat, fill= ..level..), alpha = .2, size = 0.1,
bins = 10, geom = "polygon") + scale_fill_gradient(low = "green", high = "red")
Don't know if this a good solution. But it seems to work. Maybe someone has any other suggestion?

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.

So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

spatial distribution of points, R

What would be an easy way to generate a 3 different spatial distribution of points (N = 20 points) using R. For example, 1) random, 2) uniform, and 3) clustered on the same space (50 x 50 grid)?

1) Here's one way to get a very even spacing of 5 points in a 25 by 25 grid numbered from 1 each direction. Put points at (3,18), (8,3), (13,13), (18,23), (23,8); you should be able to generalize from there.
2) as you suggest, you could use runif ... but I'd have assumed from your question you actually wanted points on the lattice (i.e. integers), in which case you might use sample.
Are you sure you want continuous rather than discrete random variables?
3) This one is "underdetermined" - depending on how you want to define things there's a bunch of ways you might do it. e.g. if it's on a grid, you could sample points in such a way that points close to (but not exactly on) already sampled points had a much higher probability than ones further away; a similar setup works for continuous variables. Or you could generate more points than you need and eliminate the loneliest ones. Or you could start with random uniform points and them make them gravitate toward their neighbors. Or you could generate a few cluster-centers (4-10, say), and then scatter points about those centers. Or you could do any of a hundred other things.

A bit late, but the answers above do not really address the problem. Here is what you are looking for:
library(sp)
# make a grid of size 50*50
x1<-seq(1:50)-0.5
x2<-x1
grid<-expand.grid(x1,x2)
names(grid)<-c("x1","x2")
# make a grid a spatial object
coordinates(grid) <- ~x1+x2
gridded(grid) <- TRUE
First: random sampling
# random sampling
random.pt <- spsample(x = grid, n= 20, type = 'random')
Second: regular sampling
# regular sampling
regular.pt <- spsample(x = grid, n= 20, type = 'regular')
Third: clustered at a distance of 2 from a random location (can go outside the area)
# random sampling of one location
ori <- data.frame(spsample(x = grid, n= 1, type = 'random'))
# select randomly 20 distances between 0 and 2
n.point <- 20
h <- rnorm(n.point, 1:2)
# empty dataframe
dxy <- data.frame(matrix(nrow=n.point, ncol=2))
# take a random angle from the randomly selected location and make a dataframe of the new distances from the original sampling points, in a random direction
angle <- runif(n = n.point,min=0,max=2*pi)
dxy[,1]= h*sin(angle)
dxy[,2]= h*cos(angle)
cluster <- data.frame(x=rep(NA, 20), y=rep(NA, 20))
cluster$x <- ori$coords.x1 + dxy$X1
cluster$y <- ori$coords.x2 + dxy$X2
# make a spatial object and plot
coordinates(cluster)<- ~ x+y
plot(grid)
plot(cluster, add=T, col='green')
plot(random.pt, add=T, col= 'red')
plot(regular.pt, add=T, col= 'blue')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R, SOM, Kohonen Package, Outlier Detection - r

Related

Calculation of allowed space within monte carlo simulated data of 3 variables (cube in 3D coordinates)

How can I create a scatter plot in R to visualise the result of a SOM clustering model?

Best way to cluster long/lat hotspot points in one city in R?

Find correct 2D translation of a subset of coordinates

spatial distribution of points, R

Categories

Resources