Using pdfCluster to analyze density patterns - r

So i'm trying to analyze a square and cluster all of the points of certain groups together. I'm thinking that pdfCluster is the best way to go since I need to measure the density of the points through a kernel density estimator to get the correct clusters and then I need to actually group them together to create a plot (I have the long/lat of the points). I'm really stuck on this and any help would be greatly appreciated.
I’m running across an issue with my code while trying to use a Kernel Density Estimator to cluster points. I am working with my data in two different ways trying to find the most optimal. First, I have my data in the form of a matrix. An example of this is below, and I have my latitude and longitude in my code attached to the columns and rows in the matrix.
m <- c(
c(8.83,8.89,8.81,8.87,8.9,8.87),
c(8.89,8.94,8.85,8.94,8.96,8.92),
c(8.84,8.9,8.82,8.92,8.93,8.91),
c(8.79,8.85,8.79,8.9,8.94,8.92),
c(8.79,8.88,8.81,8.9,8.95,8.92),
c(8.8,8.82,8.78,8.91,8.94,8.92),
c(8.75,8.78,8.77,8.91,8.95,8.92),
c(8.8,8.8,8.77,8.91,8.95,8.94),
c(8.74,8.81,8.76,8.93,8.98,8.99),
c(8.89,8.99,8.92,9.1,9.13,9.11),
c(8.97,8.97,8.91,9.09,9.11,9.11),
c(9.04,9.08,9.05,9.25,9.28,9.27),
c(9,9.01,9,9.2,9.23,9.2),
c(8.99,8.99,8.98,9.18,9.2,9.19),
c(8.93,8.97,8.97,9.18,9.2,9.18)
)
dim(m) <- c(15,6)
I also have my data in a data table where column 1 is my latitude, column 2 is my longitude, and column 3 is the value.
z <- c(
c(8.83,8.89, 2),
c(8.89,8.94, 4),
c(8.84,8.9, 1),
c(8.79,8.852, 4),
c(8.79,8.88, 5),
c(8.8,8.82, 2),
c(8.75,8.78, 1),
c(8.8,8.8, 2),
c(8.74,8.81, 7),
c(8.89,8.99, 1),
c(8.97,8.97, 6),
c(9.04,9.08, 8),
c(9,9.01, 1),
c(8.99,8.99, 8),
c(8.93,8.97, 2)
)
dim(z) <- c(15,3)
The actual data I am using is from larger rasters and shapefiles.
The raster is from http://beta.sedac.ciesin.columbia.edu/data/set/gpw-v4-population-count/data-download.
And the shapefiles are from http://www.gadm.org/download — I am using Nigeria.
The main question of this post is clustering and the optimal data format for clustering functions. I currently have all of the grid points of the entire country with their (Lat, Long, Value). I want to run a Kernel Density Estimator across all of the points and then cluster based on certain values. Looking at the pdfCluster package it seems to do just that except i’m not sure how to allow it to accept (lat/long) values and run across a geographic plane. Since my data is across a geographic area and isn’t completely continuous i’m running in to errors. Any hints for how to modify the pdfCluster package for accepting such values or what dataset is best to use would be greatly appreciated.

Related

interpret plot of multivariate time series clustering results from dtwclust

I'm using DTWCLUST package in r for multivariate time series clustering. Here's my code.
data("uciCT")
mvc <- tsclust(CharTrajMV, k = 4L, distance = "gak", seed = 390L)
plot(mvc)
The CharTrajMV data set has 100 observations with 3 variables. As I understand, clusters are determined based on 3 variables as opposed to univariate time series clustering.
Each cluster graph shows several similarly patterned time series (observations) belonging to that cluster. How is this graph drawn? There are 3 time series variables used for clustering, how does one pattern graph come out? I mean the input is 3-dimentional(variables) dataset, but the output is 1-dimentional.
Moreover, I can get the 3 variables's centroid for each cluster (using mvc#centroids)
plot(mvc, labels = list(nudge_x = -10, nudge_y = 1), type="centroids")
this code shows only one centroid for each cluster. Can I get 3 variables' centroid graphs for each cluster with plot option? or is this right approach?
This is covered in the documentation. Plotting so many different series in separate panes would get very congested, so, for multivariate plots, the variables are appended one after the other, and you get vertical dotted lines to see the place where that happened, maybe injecting some missing values in some places to account for differences in length. This does mean the x axis isn't so meaningful anymore, but it's only meant to be a quick visualization aid.

DBSCAN Clustering returning single cluster with noise points

I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.

How do I produce a probability histogram?

I've just started learning R, and was wondering, say I have the dataset quake, and I want to generate the probability histogram of quakes near Fiji, would the code simply be hist(quakes$lat,freq=F)?
A histogram shows the frequency or proportion of a given value out of all the values in a data set. You need a numeric vector as the x argument for hist(). There is no flat variable in quakes, but there is a lat variable. hist(quakes$lat, freq = F) would show the following:
This shows the north/south geographical distribution of earthquakes, centering around -20, and, since it is approximately normal (with a left skew) suggests that there is a mechanism for earthquake generation that centers around a specific latitude.
The best way to learn is to try. If you wonder if that would be the way to do it, try it.
You might also want to look at this tutorial on creating kernel density plots with ggplot.

Scale outlier data to normalized data for visualization in R

I'm working with some data that has a few major outliers, mostly due to the technology used to capture the data. I removed these to normalize the data; however, for the nature of the work, I've been asked to visualize every participant's results in a series of graphs in order to compare performances. I'm a little new to R, so while the normalization wasn't difficult, I'm a little stumped as to how I might go about re-introducing these outliers to the scale of the normalized data. Is there a way to scale outliers to previously normalized data (mean=0) without skewing the data?
EDIT: I realize I left a lot of info out (still new to asking questions here), so here's an example of what my process looks like right now:
#example data of 20 participants, 18 of which are normal-range and 2 of which
#are outliers in a data frame
time <- rnorm (18, mean = 30, sd = 10)
distance <- rnorm(18, mean = 100, sd = 20)
time <- c(time, 2, 100)
distance <- c(distance, 30, 1000)
df <- data.frame(time, distance)
The outliers were mostly known due to the nature of the data collection, so removed them:
dfClean <- df[-c(19, 20),]
And plotted the data to check for normalcy after (step skipped here because data was generated to be normal).
From there, I normalized the columns in the data set so that each variable would have a mean of 0 and a st of 1 so they could be plotted together. The goal is to use this as a "normal" range to be able to visualize spread and outliers in future data (accent on visualization).
#using package clusterSim
dfNorm <- data.Normalization(dfClean, type="n13", normalization = "column")
The problem is, I'm not sure how to scale outliers to this range afterwards...or if I'm even understanding the scale function correctly. So, how do I plot all the subjects in the original df, including outliers, on a normalized mean=0 scale?
I am not sure if we can provide any external links to solve stackoverflow's issue.
Still you can refer below links to relove your problem-
https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/
I used this many times and found it useful.

Graphing results of dbscan in R

Your comments, suggestions, or solutions are/will be greatly appreciated, thank you.
I'm using the fpc package in R to do a dbscan analysis of some very dense data (3 sets of 40,000 points between the range -3, 6).
I've found some clusters, and I need to graph just the significant ones. The problem is that I have a single cluster (the first) with about 39,000 points in it. I need to graph all other clusters but this one.
The dbscan() creates a special data type to store all of this cluster data in. It's not indexed like a data frame would be (but maybe there is a way to represent it as such?).
I can graph the dbscan type using a basic plot() call. But, like I said, this will graph the irrelevant 39,000 points.
tl;dr:
how do I graph only specific clusters of a dbscan data type?
If you look at the help page (?dbscan) it is organized like all others into sections labeled Description, Usage, Arguments, Details and Value. The Value section describes what the function dbscan returns. In this case it is simply a list (a standard R data type) with a few components.
The cluster component is simply an integer vector whose length it equal to the number of rows in your data that indicates which cluster each observation is a member of. So you can use this vector to subset your data to extract only those clusters you'd like and then plot just those data points.
For example, if we use the first example from the help page:
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
ds <- dbscan(x, 0.2)
we can then use the result, ds to plot only the points in clusters 1-3:
#Plot only clusters 1, 2 and 3
plot(x[ds$cluster %in% 1:3,])
Without knowing the specifics of dbscan, I can recommend that you look at the function smoothScatter. It it very useful for examining the main patterns in a scatterplot when you otherwise would have too many points to make sense of the data.
The probably most sensible way of plotting DBSCAN results is using alpha shapes, with the radius set to the epsilon value. Alpha shapes are closely related to convex hulls, but they are not necessarily convex. The alpha radius controls the amount of non-convexity allowed.
This is quite closely related to the DBSCAN cluster model of density connected objects, and as such will give you a useful interpretation of the set.
As I'm not using R, I don't know about the alpha shape capabilities of R. There supposedly is a package called alphahull, from a quick check on Google.

Resources