DBSCAN on high dense dataset. R - r

I've been recently studying DBSCAN with R for transit research purposes, and I'm hoping if someone could help me with this particular dataset.
Summary of my dataset is described below.
BTIME ATIME
1029 20001 21249
2944 24832 25687
6876 25231 26179
11120 20364 21259
11428 25550 26398
12447 24208 25172
What I am trying to do is to cluster these data using BTIME as x axis, ATIME as y axis. A pair of BTIME and ATIME represents the boarding time and arrival time of a subway passenger.
For more explanation, I will add the scatter plot of my total data set.
However if I split my dataset in different smaller time periods, the scatter plot looks like this. I would call this a sample dataset.
If I perform a DBSCAN clustering on the second image(sample data set), the clustering is performed as expected.
However it seems that DBSCAN cannot perform cluster on the total dataset with smaller scales. Maybe because the data is too dense.
So my question is,
Is there a way I can perform clustering in the total dataset?
What criteria should be used to separate the time scale of the data
I think the total data set is highly dense, which was why I tried clustering on a sample time period.
If I seperate my total data into smaller time scale, how would I choose the hyperparameters for each seperated dataset? If I look at the data, the distribution of the data is similar both in the total dataset and the seperated sample dataset.
I would sincerely appreciate some advices.

Related

Divide a set of curves into groups using functional data analysis

I have a dataset containing about 500 curves. In my dataset every row is a curve (which comes from some experimental measurements) and in the columns there are the measurement intervals (I don't think it's important, but intervals are not time measurements but frequency measurements).
Here you can find the data:
https://drive.google.com/file/d/1q1F1any8RlCIrn-CcQEzLWyrsyTBCCNv/view?usp=sharing
curves
t1
t2
1
-57.48
-57.56
2
-56.22
-56.28
3
-57.06
-57.12
I want to divide this dataset into 2 - 4 homogeneous groups of curves.
I've seen that there are some packages in R (fda and funHDDC) that allow you to find clusters but I don't know how to create the list with which to start the analysis, and I also don't understand why the initial dataset doesn't fit. How can I transform the data I have into a list suitable for processing with the above packages?
What results should I expect?

time series with multiple observations per unit of time

I have a dataset of the daily spreads of 500 stocks. My eventual goal is to make a model using extreme value theory. However as one of the first steps, I want to check my data for volatility clustering and leptokurticity. So I first want R to see my data as a time series and I want to plot my data. However, I only find examples of time series with only one observation per unit of time. Is there a possibility for R to treat my type of dataset as a time series? And what's the best way to plot it?

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

Plotting different mixture model clusters in the same curve

I have two sets of data, one representing a healthy data set having 4 variables and 11,000 points and another representing a faulty set having 4 variables and 600 points. I have used R's package MClust to obtain GMM clustering for each data set separately. What I want to do is to obtain both clusters in the same frame so as to study them at the same time. How can that be done?
I have tried joining both the datasets but the result I am obtaining is not what I want.
The code in use is:
Dat4M <- Mclust(Dat3, G = 3)
Dat3 is where I am storing my dataset, Dat4M is where I store the result of Mclust. G = 3 is the number of Gaussian mixtures I want, which in this case is three. To plot the result, the following code is used:
plot(Dat4M)
The following is obtained when I apply the above code in my Healthy dataset:
The following is obtained when the above code is used on Faulty dataset:
Notice that in the faulty data density curve, consider the mixture of CCD and CCA, we see that there are two density points that have been obtained. Now, I want to place the same in the same block in the healthy data and study the differences.
Any help on how to do this will be appreciated.

Are there any limits on k-mean in terms of k points, data dimensionality, and size of data (millions of samples )

I have a dataset consists of 2 million samples.
I want to use k-means to cluster this dataset into 2000 clusters.
is it ok to use this number of clusters with this data size.
note: feature vector size of each sample is 1000
To predict the runtime of an algorithm, you can take a look at it's time complexity. This is a formula that relates the run time to some parameters like for instance the data points and number of clusters in k-means. Information about time complexity in k-means clustering can be found here: Computational complexity of k-means

Resources