Identifying Data Bands based on Distance between Centroids with Clustering in R - r

I'm trying to use clustering to identify bands in my data set. I'm working with supply chain data, so my data looks like this:
The relevant column is the price per Each.
The problem is that sometimes we incorrectly have that this product comes in a Case of 100 instead of 10, so the Price per Each would look like (2, 0.25, 3). I want to create a code that only creates clusters if the mean price of an additional cluster is at least 2 times greater or lesser than all existing clusters.
For example, if my prices per each were (4, 5, 6, 13, 14, 15), I want it to return 2 clusters with centroids of 5 and 14. If, on the other hand, my data looked like (3, 4, 5, 6), it should return one cluster.
The goal is to create a code that returns the product codes for items in which multiple clusters have been generated so that I can audit those product codes for bad units of measure (case 100 vs case 10).
I'm thinking about using divisive hierarchical clustering, but I don't know how to introduce the centroid distance rule for creating new clusters.
I'm fairly new to R, but I have SQL and Stata experience, so I'm looking for a package that would do this or help with the syntax I need to accomplish this.

Don't use clustering here.
While you can probably use HAC with a ratio-like distance function and a threshold of 8x, this will be rather unreliable and expensive: clustering will take O(n²) or O(n³) usually.
If you know that these error happen, but not frequently, then I'd rather use a classic statistical approach. For example, compute the median and then report values that are 9x times larger/smaller than the median as errors. If errors are infrequent enough, you could even use the mean, but the median is more robust.

Related

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

Validating Fuzzy Clustering

I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.
First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.
So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.
I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,
k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
k , where it's recommended to decrease memb.exp.
it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.
Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)
Code
library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())
# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)
website as mentioned above.
As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.
wss <- sapply(2:10,
function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})
plot(2:10, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
The resulting plot is
After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.

Apply k-means to examine differences between two groups in R

I have two groups. The treatment group is exposure to media; the control group is no media. They are distinguished by a categorial variable in the data frame. (exposure to media = 1, no media = 0)
Now, I want to examine whether there are any clear differences between these two groups. To do this, apply the k-means algorithm with two clusters to four variables (proportion of black population, proportion of male population, proportion of hispanic population, median income on the logarithmic scale).
How to do this in R? Could anyone give some hints? Thanks!
Try this:
km <-kmeans(your data, 2, nstart=10)
your data here as a data.frame (your whole data or you can select the variables that you are interesting about them). You need to select the number of clusters (here is 2). A good practice to understand your data is to apply different number of cluster and then see which one fit your data better (use for example any criteria methods such as AIC or BIC).
k-means is an approach applied to cluster data. Where this data come from different distribution and we would like to know from where each observation come from (from which distribution).
You can also have a look at many tutorials about kmeans in R. For example,
https://onlinecourses.science.psu.edu/stat857/node/125
https://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
http://www.statmethods.net/advstats/cluster.html

Statistical functions for correlation between 2 data sets in R

This is more of a general question that I haven't been able to find. I am trying to find the correlation between 2 data sets, with the goal of matching them with a certain correlation percentage. They won't be exact matches, but will mostly be within 1%, though there will likely be some outliers. For example, every 100th point might be off by 5%, possibly more.
I am also trying to find instances where a data set might match another but have a different magnitude. For example, if you multiplied all of the data by a multiplier, you would get a match. It obviously wouldn't make sense to loop through a ton of possible multipliers. I'm contemplating trying to match positive and negative slopes as either +1/-1 as the slope would not work. Though, this would not work in some instances as the data is very granular and thus it might match the shape of the data but if you zoom in the slopes would be off.
Are there any built in functions in R? I don't have a statistical background and my searches came up with mostly how to handle a single data set and outliers in those.
For a basic Pearson, Spearman, or Kendall correlation, you can use the cor() function:
x <- c(1, 2, 5, 7, 10, 15)
y <- c(2, 4, 6, 9, 12, 13)
cor(x, y, use="pairwise.complete.obs", method="pearson")
You're going to want to adjust the "use" and "method" options based on your data. Since you didn't provide the nature of your data, I can't give you any more specific guidance.

Merging two statistical result sets

I have two sets of statistics generated from processing. The data from the processing can be a large amount of results so I would rather not have to store all of the data to recalculate the additional data later on.
Say I have two sets of statistics that describe two different sessions of runs over a process.
Each set contains
Statistics : { mean, median, standard deviation, runs on process}
How would I merge the two's median, and standard deviation to get a combined summary of the two describing sets of statistics.
Remember, I can't preserve both sets of data that the statistics are describing.
Artelius is mathematically right, but the way he suggests to compute the variance is numerically unstable. You want to compute the variance as follows:
new_var=(n(0)*(var(0)+(mean(0)-new_mean)**2) + n(1)*(var(1)+(mean(1)-new_mean)**2) + ...)/new_n
edit from comment
The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.
You can get the mean and standard deviation, but not the median.
new_n = (n(0) + n(1) + ...)
new_mean = (mean(0)*n(0) + mean(1)*n(1) + ...) / new_n
new_var = ((var(0)+mean(0)**2)*n(0) + (var(1)+mean(1)**2)*n(1) + ...) / new_n - new_mean**2
where n(0) is the number of runs in the first data set, n(1) is the number of runs in the second, and so on, mean is the mean, and var is the variance (which is just standard deviation squared). n**2 means "n squared".
Getting the combined variance relies on the fact that the variance of a data set is equal to the mean of the square of the data set minus the square of the mean of the data set. In statistical language,
Var(X) = E(X^2) - E(X)^2
The var(n)+mean(n)**2 terms above give us the E(X^2) portion which we can then combine with other data sets, and then get the desired result.
In terms of medians:
If you are combining exactly two data sets, then you can be certain that the combined median lies somewhere between the two medians (or equal to one of them), but there is little more that you can say. Taking their average should be OK unless you want to avoid the median not being equal to some data point.
If you are combining many data sets in one go, you can either take the median of the medians, or take their average. If there may be significant systematic differences between different the data sets, then taking their average is probably better, as taking the median reduces the effect of outliers. But if you have systematic differences between runs, disregarding them is probably not a good thing to do.
Median is not possible. Say you have two tuples, (1, 1, 1, 2), and (0, 0, 2, 3, 3). Medians are 1 and 2, overall median is 1. No way to tell.

Resources