Understanding R subspace package clustering output - r

Somewhat related to this question.
I am using R subspace package for subspace clustering. As in the question above, I have failed to use the generic plotting method to plot out my resulting clusters in a way native to the package. The next step is to understand the output of the command
CLIQUE(df, xi = 40, tau = 0.2)
That looks something like this:
I understand that the "object" is the row number for the clustered unit, and the subspace indicates the dimensions of the data in which the clustering was done. However I don't see how the clusters in the given dimensions can be distinguished.
The documentation does not contain information on the output. Ideally, my goal is to plot out all the clusters with something like ggplot2, or in 3D, what have you. And I need to know which units are in which clusters in corresponding dimensions.
Additionally, checked if the dimensions of any of the two members of the output list are the same like this:
cluster_result <- clique_model
equalities_matrix <- matrix(
0L, nrow = length(cluster_result), ncol = length(cluster_result)
)
for (i in 1:length(cluster_result)){
for (j in 1:length(cluster_result)){
equalities_matrix[i,j] <- (
all(cluster_result[[i]]$subspace == cluster_result[[j]]$subspace)
)
}
}
sum(equalities_matrix)
The answer is no.

So, here is what my researches lead to, might be helpful for someone in the future.
Above, the CLIQUE algorithm above outputted one cluster per every set of dimensions, possibly by the virtue of the data or tuning of the algorithm. I added more features to the data, ran it again, and checked if the dimensions of any of the two members of the output list are the same again. This time, yes, several dimensions were the same as the sum(equalities_matrix) yielded a number larger than the number of features.
In conclusion, the output of the algorithm is a list of lists where each member list represents one cluster in one subspace:
subspace ... the dimensions making the subspace indicated with TRUE,
objects ... members of the cluster.
If there is more than one cluster in a given subspace, there will be more member lists with the same subspace, and different members of the cluster.
Here are the papers that helped me understand the theory:
Parsons, Haque, and Liu; 2004
Agrawal Gehrke, Gunopulos, Raghavan

Related

how to merge nodes in a decision tree in R

I would like to use a decision tree as proposed by Yan et al. 2004; Adaptive Testing With Regression Trees in the Presence of Multidimensionality (https://journals.sagepub.com/doi/epdf/10.3102/10769986029003293) That produce trees like this:
It also seems like a similar (same?) thing was proposed more recently under the name decision stream https://arxiv.org/pdf/1704.07657.pdf
The goal is to perform the usual splitting as in CART or a similar decision tree at each node but then merge nodes where the difference in the target variable between nodes is smaller than some prespecified value.
I didn't find a package that can do this (I think SAS might be able to do it, and there is the decision stream implementation in closure). I looked into partykit package with the hope that I might change it a bit to produce this behavior, but the main issues are that the tree is constructed recursively, so the nodes don't know about other nodes at the same level, and that the tree representation does not allow to point to other nodes that are already in the tree, so a node can only have one parent, but I would need more. I was also thinking about repeatedly fitting tree stumps, then merging nodes and repeat, but I don't know how I would make predictions with something like that.
Edit: some example code
set.seed(1)
n_subjects <- 100
n_items <- 4
responses <- matrix(rep(c(0,1),times=(n_subjects/2)*n_items),
ncol=n_items)
responses <- as.data.frame(apply(responses, 2, function(x) sample(x)))
weights <- c(20,20,20,10)
responses$outcome <- rowSums(responses[,1:n_items] * weights)
library(rattle)
tree <- rpart(outcome~., data=responses)
fancyRpartPlot(tree, tweak=1.2, sub='')
Outcome:
I would expect nodes 5 and 6 to get merged because the outcome value is basically the same (34 and 35), before the splitting continues from the merged node

Generating multiple random graphs in R with same number of nodes and ties?

I would like to generate a large number of random graphs with the same number of nodes and ties, and use the result to find the distributions etc of the standard metrics.
I found this link for generating random graphs with a given number of nodes and ties (Graph generation given number of edges and nodes). Is there an easy way to tell R to do this 1000x or so, and combine all of those into one object, that I can then analyze? (for things like av. distance, degree, diameter, etc).
Ultimately I want to be able to use this information for comparison with an empirical network.
I got this answer from a friend, and it appears to work:
RR1 <- list()
for(i in 1:1000) {
RR1[[i]] <- erdos.renyi.game(559,9640,type=c("gnm"),directed=FALSE,loops=FALSE)
}
number_of_edges <- sapply(RR1, gsize)
number_of_edges

R and SPSS: Different results for Hierarchical Cluster Analysis

I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).
First, I ran the analysis in SPSS via
CLUSTER Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
FREQUENCIES CLU2_1.
I additionaly performed the analysis in R:
datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))
The resulting cluster sizes are:
1 2
SPSS 712 288
R 610 390
These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.
I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).
Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?
If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.
I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Validating Fuzzy Clustering

I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.
First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.
So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.
I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,
k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
k , where it's recommended to decrease memb.exp.
it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.
Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)
Code
library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())
# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)
website as mentioned above.
As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.
wss <- sapply(2:10,
function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})
plot(2:10, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
The resulting plot is
After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.

Resources