I want to group a list of Long and Lats (my_long_lats) based on pre determined center points (my_center_Points).
When I run:-
k <- kmeans(as.matrix(my_long_lats), centers = as.matrix(my_center_Points))
k$centers does not equal my_center_Points.
I assume k-means has adjusted my center points to the optimal center. But what I need is for my_center_Points to not change and group my_long_lats around them.
In this link
they talk about setting initial centers but How do I set centers that wont change once I run the k means? Or is there a better clustering algorithm for this?
I could even settle for minimizing the movement of the centers.
I still have a lot to learn in R, any help is really appreciated.
centers are automatically evaluated after performing kmeans clustering. In fact, determining centers is a vital point in order to divide into cluster groups. Couple of options I can think that can help you.
Limit iter.max. You can set it to just 1 in kmeans function call. This will not guarantee to keep centers fixed but changes will be less if you are dealing with large data sets.
Use of dummy data. You can add many dummy data in your actual data sets around chosen centers. This will put extra weight towards along pre-determined centers. Most likely centers will remain unchanged.
Here is the calculation using the geosphere library to properly compute the distance from latitude and longitude.
The variable closestcenter is the result which identifies the closest center to each point.
#define random data
centers<-data.frame(x=c(44,44, 50, 50), y=c(44, 50, 44, 50))
pts<-data.frame(x=runif(50, 40, 55), y=runif(50, 40, 55))
#allocate space
distance<-matrix(-1, nrow = length(pts$x), ncol= length(centers$x))
library(geosphere)
#calculate the dist matrix - the define centers to each point
#columns represent centers and the rows are the data points
dm<-apply(data.frame(1:length(centers$x)), 1, function(x){ replace(distance[,x], 1:length(pts$x), distGeo(centers[x,], pts))})
#find the column with the smallest distance
closestcenter<-apply(dm, 1, which.min)
#color code the original data for verification
colors<-c("black", "red", "blue", "green")
plot(pts , col=colors[closestcenter], pch=19)
Related
I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.
I have to figure out the percentage of overlap between polytopes in n-dimensional spaces, where my only available source of reference is a set of randomly sampled points within those polytopes.
Assume that the following two R objects are two sets of randomly sampled points from two different polytopes in 5 dimensions:
one <- matrix(runif(5000, min = 0, max = 5), ncol = 5)
two <- matrix(runif(5000, min = 0, max = 4), ncol = 5)
In this example, I selected a smaller range for the second object, so we know that there should be less than 10% overlap. Let me know if I am wrong.
EDIT:
Just to make it really clear, the question is what is the percentage of overlap between those two objects?
I need a method that generalizes to n-dimensional spaces.
This stackoverflow question is somewhat similar to what I am trying to do, but I didn't manage to get it to work.
So, the most straightforward way is to use the hypervolume package.
library(hypervolume)
one <- hypervolume(matrix(runif(5000, min = 0, max = 5), ncol = 5))
two <- hypervolume(matrix(runif(5000, min = 0, max = 4), ncol = 5))
three = hypervolume_set(one, two, check.memory=FALSE)
get_volume(three)
This will get you the volume.
hypervolume_overlap_statistics(three)
This function will output four different metrics, one if which is the Jaccard Similarity Index.
The Jaccard Similarity is the proportion of overlap between the two sample sets (the intersection divided by the union).
Alternatives
Chris suggested volesti as an alternative. Another alternative would be the geometry package.
They do not calculate the proportion straight away. Here you need to find the intersection (e.g. intersectn in geometry, VpolytopeIntersection in volesti), then calculate the volume for the polytopes separately and also their intersection, then you need to divide the volume of the intersection with the sum of the volumes for the two polytopes.
Here, they are also using a different method to calculate the volume and it might be more appropriate for you if you are trying to construct convex hulls in an n-dimensional space. For me, hypervolume is a better solution, because I am doing something more akin to Hutchinson’s n-dimensional hypervolume concept from ecology and evolutionary biology.
Let's just say I have the following scatterplot:
set.seed(665544)
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
plot(x)
I want to divide this scatterplot into square cells of a specified size and then count how many points fall into each unique cell. This will essentially give me the local density value of that cell. What is the best way of doing this? Is there an R package that can help? Perhaps a 2D histogram method like in Matlab?
Quick clarifications:
1.) I'd like the function/method to take the following 3 arguments: dimensions of total area, dimensions of cell (OR number of cells), and the data. It would then perhaps output a matrix where each value corresponds to a cell's point count.
2.) Q: Why do you want to use this method to determine local density? Isn't this much easier:
library(dbscan)
pointdensity(x, eps = .1, type = "frequency")
A: This method calculates the local density around each point. Though easy, this definition of local density then makes it very difficult (optimization algorithms necessary) to assign new data in a way that it matches the local density distribution of the original data set.
The figure is the plot of x,y set in a excel file, total 8760 pair of x and y. I want to remove the noise data pair in red circle area and output a new excel file with remain data pair. How could I do it in R?
Using #G5W's example:
Make up data:
set.seed(2017)
x = runif(8760, 0,16)
y = c(abs(rnorm(8000, 0, 1)), runif(760,0,8))
XY = data.frame(x,y)
Fit a quantile regression to the 90th percentile:
library(quantreg)
library(splines)
qq <- rq(y~ns(x,20),tau=0.9,data=XY)
Compute and draw the predicted curve:
xvec <- seq(0,16,length.out=101)
pp <- predict(qq,newdata=data.frame(x=xvec))
plot(y~x,data=XY)
lines(xvec,pp,col=2,lwd=2)
Keep only points below the predicted line:
XY2 <- subset(XY,y<predict(qq,newdata=data.frame(x)))
plot(y~x,data=XY2)
lines(xvec,pp,col=2,lwd=2)
You can make the line less wiggly by lowering the number of knots, e.g. y~ns(x,10)
Both R and EXCEL read and write .csv files, so you can use those to transfer the data back and forth.
You do not provide any data so I made some junk data to produce a similar problem.
DATA
set.seed(2017)
x = runif(8760, 0,16)
y = c(abs(rnorm(8000, 0, 1)), runif(760,0,8))
XY = data.frame(x,y)
One way to identify noise points is by looking at the distance to the nearest neighbors. In dense areas, nearest neighbors will be closer. In non-dense areas, they will be further apart. The package dbscan provides a nice function to get the distance to the k nearest neighbors. For this problem, I used k=6, but you may need to tune for your data. Looking at the distribution of distances to the 6th nearest neighbor we see that most points have 6 neighbors within a distance of 0.2
XY6 = kNNdist(XY, 6)
plot(density(XY6[,6]))
So I will assume that point whose 6th nearest neighbor is further away are noise points. Just changing the color to see which points are affected, we get
TYPE = rep(1,8760)
TYPE[XY6[,6] > 0.2] = 2
plot(XY, col=TYPE)
Of course, if you wish to restrict to the non-noise points, you can use
NonNoise = XY[XY6[,6] > 0.2,]
I have a data frame that has 3 values for each point in the form: (x, y, boolean). I'd like to find an area bounded by values of (x, y) where roughly half the points in the area are TRUE and half are FALSE.
I can scatterplot then data and color according to the 3rd value of each point and I get a general idea but I was wondering if there would be a better way. I understand that if you take a small enough area where there are only 2 points and one if TRUE and the other is FALSE then you have 50/50 so I was thinking there has to be a better way of deciding what size area to look for.
Visually I see this has drawing a square on the scatter plot and moving it around the x and y axis each time checking the number of TRUE and FALSE points in the area, but is there a way to determine what a good size for the area is based on the values?
Thanks
EDIT: G5W's answer is a step in the right direction but based on their scatterplot, I'm looking to create a square / rectangle idea in which ~ half the points are green and half are red. I understand that there is potentially an infinite amount of those areas but thinking there might be a good way to determine an optimal size for the area (maybe it should contain at least a certain percentage of the points or something)
Note update below
You do not provide any sample data, so I have created some bogus data like this:
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
y = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
z = rep(c(TRUE,FALSE), each=100))
I think that what you want is how much area is taken up by each of the TRUE and FALSE points. A way to interpret that task is to find the convex hull for each group and take its area. That is, find the minimum convex polygon that contains a group. The function chull will compute the convex hull of a set of points.
plot(TestData[,1:2], pch=20, col=as.numeric(TestData$z)+2)
CH1 = chull(TestData[TestData$z,1:2])
CH2 = chull(TestData[!TestData$z,1:2])
polygon(TestData[which(TestData$z)[CH1],1:2], lty=2, col="#00FF0011")
polygon(TestData[which(!TestData$z)[CH2],1:2], lty=2, col="#FF000011")
Once you have the polygons, the polyarea function from the pracma package will compute the area. Note that it computes a "signed" area so you either need to be careful about which direction you traverse the polygon or take the absolute value of the area.
library(pracma)
abs(polyarea(TestData[which(TestData$z)[CH1],1],
TestData[which(TestData$z)[CH1],2]))
[1] 16.48692
abs(polyarea(TestData[which(!TestData$z)[CH2],1],
TestData[which(!TestData$z)[CH2],2]))
[1] 15.17897
Update
This is a completely different answer based on the updated question. I am leaving the old answer because the question now refers to it.
The question now gives a little more information about the data ("There are about twice as many FALSE than TRUE") so I have made an updated bogus data set to reflect that.
set.seed(2017)
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(200, 1, 1)),
y = c(rnorm(100, 1, 1), rnorm(200, -1,1)),
z = rep(c(TRUE,FALSE), c(100,200)))
The problem is now to find regions where the density of TRUE and FALSE are approximately equal. The question asked for a rectangular region, but at least for this data, that will be difficult. We can get a good visualization to see why.
We can use the function kde2d from the MASS package to get the 2-dimensional density of the TRUE points and the FALSE points. If we take the difference of these two densities, we need only find the regions where the difference is near zero. Once we have this difference in density, we can visualize it with a contour plot.
library(MASS)
Grid1 = kde2d(TestData$x[TestData$z], TestData$y[TestData$z],
lims = c(c(-3,3), c(-3,3)))
Grid2 = kde2d(TestData$x[!TestData$z], TestData$y[!TestData$z],
lims = c(c(-3,3), c(-3,3)))
GridDiff = Grid1
GridDiff$z = Grid1$z - Grid2$z
filled.contour(GridDiff, color = terrain.colors)
In the plot it is easy to see the place that there are far more TRUE than false near (-1,1) and where there are more FALSE than TRUE near (1,-1). We can also see that the places where the difference in density is near zero lie in a narrow band in the general area of the line y=x. You might be able to get a box where a region with more TRUEs is balanced by a region with more FALSEs, but the regions where the density is the same is small.
Of course, this is for my bogus data set which probably bears little relation to your real data. You could perform the same sort of analysis on your data and maybe you will be luckier with a bigger region of near equal densities.