Weka Expected-Maximum clustering result explanation - bigdata

I currently have a very large dataset with 2 attributes which contain only strings. The first attribute has search queries (single words) and the second attribute has their corresponding categories.
So the data is set up like this (a search query can have multiple categories):
Search Query | Category
X | Y
X | Z
A | B
C | G
C | H
Now I'm trying to use clustering algorithms to get an idea of the different groups my data is comprised of. I read somewhere that when using a clustering algorithm with just strings it is recommended to first use the Expected Maximum clustering algorithm to get a sense of how many clusters I need and then use that with K-means.
Unfortunately, I'm still very new to machine learning and Weka, so I'm constantly reading up on everything to teach myself. I might be making some very simple mistakes here so bear with me, please :)
So I imported a sample (100.000 lines out of 2.7 million) of my dataset in Weka and used the EM clustering algorithm and it gives me the following results:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100
Relation: testrunawk1_weka_sample.txt
Instances: 100000
Attributes: 2
att1
att2
Test mode: split 66% train, remainder test
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross-validation: 2
Number of iterations performed: 14
[135.000 lines long table with strings, 2 clusters and their values]
Time is taken to build a model (percentage split): 28.42 seconds
Clustered Instances
0 34000 (100%)
Log-likelihood: -20.2942
So should I infer from this that I should be using 2 or 34000 clusters with k-means?
Unfortunately, both seem unusable for me. What I was hoping for is that I get for example 20 clusters which I can then look at individually to figure out what kind of groups can be found in my data. 2 clusters seems like too low with the wide amount of categories etc in my data and 34000 clusters would be way too much to inspect manually.
I am unsure if I'm doing something wrong in either the Weka EM algorithm settings (set to standard now) or if my data is just a mess, and if so how would I go about making this work?
I am still very much learning how this all works, so any advice is much appreciated! If there is a need for more examples of my settings or anything else just tell me and I'll get it for you. I could also send you this dataset if that is easier, but it's too large to paste in here. :)

Related

Rounding in excel vs R changes results of mixed models

Does anyone know what's the difference in how excel stores the decimals and why the values saved from R are slightly different when loaded back into R? It seems that excel can store up to 15 decimal points, what about R?
I have a dataset with a lot of values which in R display 6 decimal points, and I'm using them for an analysis in lme4. But I noticed that on the same dataset (saved in two different files) the models sometimes converge and sometimes not. I was able to narrow down the problem to the way excel changes the values, but I'm not sure what to do about it.
I have a dataframe like this:
head(Experiment1)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
log RT was obtained by calculating log10 of First.Key
I then save this to a csv file, and load the df again, and get
head(Experiment2)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
exactly the same values, up to 6 decimal points
but then this happens
Experiment1$logResponseTime - Experiment2$logResponseTime
1 2.220446e-15
2 -2.664535e-15
3 4.440892e-16
4 -4.440892e-16
5 -2.220446e-15
6 8.881784e-16
These differences are tiny, but they make a difference between convergence and non-convergence in my lmer models, where logResponseTime is the DV, which is why I'm concerned.
Is there a way to save R dataframes into excel to a format that won't make these changes (I use write.csv)? And more importantly, why do such tiny differences make a difference in lmer?
These tiny bits of rounding are hard to avoid; most of the time, it's not worth trying to fix them (in general errors of this magnitude are ubiquitous in any computer system that uses floating-point values).
It's hard to say exactly what the differences are between the analyses with the rounded and unrounded numbers, but you should be aware that the diagnosis of a convergence problem is based on particular numerical thresholds for the magnitude of the gradient at the maximum likelihood estimate and other related quantities. Suppose the threshold is 0.002 and that running your model with unrounded values results in a gradient of 0.0019, while running it with the rounded values results in a gradient of 0.0021. Then your model will "converge" in one case and "fail to converge" in the other case. I can appreciate the potential inconvenience of getting slightly different values just by saving your data to a CSV (or XLSX) file and restoring them from there, but you should also be aware that even running the same models on a different operating system could produce equally large differences. My suggestions:
check to see how big the important differences are between the rounded/unrounded results ("important differences" are differences in estimates you care about for your analysis, of magnitudes that are large enough to change your conclusions)
if these are all small, you can increase the tolerance of the convergence checks slightly so they don't bother you, e.g. use control = lmerControl(check.conv.grad = .makeCC("warning", tol = 6e-3, relTol = NULL)) (the default tolerance is 2e-3, see ?lmerControl)
if these are large, that should concern you - it means your model fit is very unstable. You should probably also try running allFit() to see how big the differences are when you use different optimizers.
you might be able to use the methods described here to make your read/write flow a little more precise.
if possible, you could save your data to a .rds or .rda file rather than CSV, which will keep the full precision.

Retrieve best number of clusters from NbClust

Many functions in R provide some sort of console output (such as NbClust() etc.) Is there any way of retrieving some of the output (e.g. a certain integer value) without having a look at the output? Any way of reading from the console?
Imagine the output would look like the following output from example code provided in the package manual:
[1] "Frey index : No clustering structure in this data set"
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 1 proposed 2 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 2 proposed 6 as the best number of clusters
* 1 proposed 7 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
How would I retrieve the value 4 from the last line of the above output?
It is better to work with objects rather than output in the console. Any "good" function would return hopefully structured output that can be accessed using $ or # signs, use str() to see object's structure.
In your case, I think this should work:
length(unique(res$Best.partition))
Another option is:
max(unlist(res[4]))

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

K Nearest Neighbor Questions

Hi I am having trouble understanding the workings of the K nearest neighbor algorithm specifically when trying to implement it in code. I am implementing this in R but just want to know the workings, I'm not so much worried about the code as much as the process. I will post what I have, my data, and what my questions are:
Training Data (just a portion of it):
Feature1 | Feature2 | Class
2 | 2 | A
1 | 4 | A
3 | 10 | B
12 | 100 | B
5 | 5 | A
So far in my code:
kNN <- function(trainingData, sampleToBeClassified){
#file input
train <- read.table(trainingData,sep=",",header=TRUE)
#get the data as a matrix (every column but the class column)
labels <- as.matrix(train[,ncol(train)])
#get the classes (just the class column)
features <- as.matrix(train[,1:(ncol(train)-1)])
}
And for this I am calculating the "distance" using this formula:
distance <- function(x1,x2) {
return(sqrt(sum((x1 - x2) ^ 2)))
}
So is the process for the rest of the algorithm as follows:?
1.Loop through every data (in this case every row for the 2 columns) and calculate the distance from the one number at a time and compare it to the sampleToBeClassified?
2.In the starting case that I want 1 nearest-neighbor classification, would I just be storing the variable that has the least distance to my sampleToBeClassified?
3.Whatever the closest distance variable is find out what class it is, then that class becomes the class of the sampleToBeClassified?
My main question is what role do the features play in this? My instinct is that the two features together are what defines that data item as a certain class, so what should I be calculating the distance between?
Am I on the right track at all?
Thanks
It looks as though you're on the right track. The three steps in your process seem to be correct for the 1-nearest neighbor cases. For kNN, you just need to make a list of the k nearest neighbors and then determine which class is most prevalent in that list.
As for features, these are just attributes that define each instance and (hopefully) give us an indication as to what class they belong to. For instance, if we're trying to classify animals we could use height and mass as features. So if we have an instance in the class elephant, its height might be 3.27m and its mass might be 5142kg. An instance in the class dog might have a height of 0.59m and a mass of 10.4kg. In classification, if we get something that's 0.8m tall and has a mass of 18.5kg, we know it's more likely to be a dog than a elephant.
Since we're only using 2 features here we can easily plot them on a graph with one feature as the X-axis and the other feature as the Y (it doesn't really matter which one) with the different classes denoted by different colors or symbols or something. If you plot the sample of your training data above, it's easy to see the separation between Class A and B.

Interpreting the results of R Mclust package

I'm using the R package mclust to estimate the number of clusters in my data and get this result:
Clustering table:
2 7 8 9
205693 4465 2418 91
Warning messages:
1: In map(z) : no assignment to 1,3,4,5,6
2: In map(z) : no assignment to 1,3,4,5,6
I have 9 clusters as the best, but it has no assignment to 5 of the clusters.
So does this mean I want to use 9 or 5 clusters?
If the answer can be found somewhere online, a link would be appreciated. Thanks in advance.
Most likely, the method just did not work at all on your data...
You may try other seeds, because when you "lose" clusters (i.e. they become empty) this usually means your seeds were not chosen well enough. And your cluster 9 is also pretty much gone, too.
However, if your data is actually generated by a mixture of Gaussians, it's hard to find such a bad starting point... so most likely, all of your results are bad, because the data does not satisfy your assumptions.
Judging from your cluster sizes, I'd say you have 1 cluster and a lot of noise...
Have you visualized and validated the results?
Don't blindly follow some number. Validate.

Resources