I'm using the R package mclust to estimate the number of clusters in my data and get this result:
Clustering table:
2 7 8 9
205693 4465 2418 91
Warning messages:
1: In map(z) : no assignment to 1,3,4,5,6
2: In map(z) : no assignment to 1,3,4,5,6
I have 9 clusters as the best, but it has no assignment to 5 of the clusters.
So does this mean I want to use 9 or 5 clusters?
If the answer can be found somewhere online, a link would be appreciated. Thanks in advance.
Most likely, the method just did not work at all on your data...
You may try other seeds, because when you "lose" clusters (i.e. they become empty) this usually means your seeds were not chosen well enough. And your cluster 9 is also pretty much gone, too.
However, if your data is actually generated by a mixture of Gaussians, it's hard to find such a bad starting point... so most likely, all of your results are bad, because the data does not satisfy your assumptions.
Judging from your cluster sizes, I'd say you have 1 cluster and a lot of noise...
Have you visualized and validated the results?
Don't blindly follow some number. Validate.
Related
Does anyone know what's the difference in how excel stores the decimals and why the values saved from R are slightly different when loaded back into R? It seems that excel can store up to 15 decimal points, what about R?
I have a dataset with a lot of values which in R display 6 decimal points, and I'm using them for an analysis in lme4. But I noticed that on the same dataset (saved in two different files) the models sometimes converge and sometimes not. I was able to narrow down the problem to the way excel changes the values, but I'm not sure what to do about it.
I have a dataframe like this:
head(Experiment1)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
log RT was obtained by calculating log10 of First.Key
I then save this to a csv file, and load the df again, and get
head(Experiment2)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
exactly the same values, up to 6 decimal points
but then this happens
Experiment1$logResponseTime - Experiment2$logResponseTime
1 2.220446e-15
2 -2.664535e-15
3 4.440892e-16
4 -4.440892e-16
5 -2.220446e-15
6 8.881784e-16
These differences are tiny, but they make a difference between convergence and non-convergence in my lmer models, where logResponseTime is the DV, which is why I'm concerned.
Is there a way to save R dataframes into excel to a format that won't make these changes (I use write.csv)? And more importantly, why do such tiny differences make a difference in lmer?
These tiny bits of rounding are hard to avoid; most of the time, it's not worth trying to fix them (in general errors of this magnitude are ubiquitous in any computer system that uses floating-point values).
It's hard to say exactly what the differences are between the analyses with the rounded and unrounded numbers, but you should be aware that the diagnosis of a convergence problem is based on particular numerical thresholds for the magnitude of the gradient at the maximum likelihood estimate and other related quantities. Suppose the threshold is 0.002 and that running your model with unrounded values results in a gradient of 0.0019, while running it with the rounded values results in a gradient of 0.0021. Then your model will "converge" in one case and "fail to converge" in the other case. I can appreciate the potential inconvenience of getting slightly different values just by saving your data to a CSV (or XLSX) file and restoring them from there, but you should also be aware that even running the same models on a different operating system could produce equally large differences. My suggestions:
check to see how big the important differences are between the rounded/unrounded results ("important differences" are differences in estimates you care about for your analysis, of magnitudes that are large enough to change your conclusions)
if these are all small, you can increase the tolerance of the convergence checks slightly so they don't bother you, e.g. use control = lmerControl(check.conv.grad = .makeCC("warning", tol = 6e-3, relTol = NULL)) (the default tolerance is 2e-3, see ?lmerControl)
if these are large, that should concern you - it means your model fit is very unstable. You should probably also try running allFit() to see how big the differences are when you use different optimizers.
you might be able to use the methods described here to make your read/write flow a little more precise.
if possible, you could save your data to a .rds or .rda file rather than CSV, which will keep the full precision.
I currently have a very large dataset with 2 attributes which contain only strings. The first attribute has search queries (single words) and the second attribute has their corresponding categories.
So the data is set up like this (a search query can have multiple categories):
Search Query | Category
X | Y
X | Z
A | B
C | G
C | H
Now I'm trying to use clustering algorithms to get an idea of the different groups my data is comprised of. I read somewhere that when using a clustering algorithm with just strings it is recommended to first use the Expected Maximum clustering algorithm to get a sense of how many clusters I need and then use that with K-means.
Unfortunately, I'm still very new to machine learning and Weka, so I'm constantly reading up on everything to teach myself. I might be making some very simple mistakes here so bear with me, please :)
So I imported a sample (100.000 lines out of 2.7 million) of my dataset in Weka and used the EM clustering algorithm and it gives me the following results:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100
Relation: testrunawk1_weka_sample.txt
Instances: 100000
Attributes: 2
att1
att2
Test mode: split 66% train, remainder test
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross-validation: 2
Number of iterations performed: 14
[135.000 lines long table with strings, 2 clusters and their values]
Time is taken to build a model (percentage split): 28.42 seconds
Clustered Instances
0 34000 (100%)
Log-likelihood: -20.2942
So should I infer from this that I should be using 2 or 34000 clusters with k-means?
Unfortunately, both seem unusable for me. What I was hoping for is that I get for example 20 clusters which I can then look at individually to figure out what kind of groups can be found in my data. 2 clusters seems like too low with the wide amount of categories etc in my data and 34000 clusters would be way too much to inspect manually.
I am unsure if I'm doing something wrong in either the Weka EM algorithm settings (set to standard now) or if my data is just a mess, and if so how would I go about making this work?
I am still very much learning how this all works, so any advice is much appreciated! If there is a need for more examples of my settings or anything else just tell me and I'll get it for you. I could also send you this dataset if that is easier, but it's too large to paste in here. :)
I have a requirement where I need to group my categorical variables (having more than 5 category values) into 5 groups based on their association with my continuous variable. To achieve this I am using rpart with "annova" method.
So for example my categorical variable is type having codes 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 so I want to have 5 groups of this variable. After running the tree inorder to have only 5 groups I need to prune the tree. One way I tried is to use the nsplit from cptable but, nsplit of 5 might give me 7-8 leaves and similarly nsplit of 4 might give me 5-6 leaves.
I was looking for an option by which when I prune I get only 5 leaves which would act as my 5 groups.
Can someone please suggest how I can achieve this by using rpart.
Thank you !!
I am interested in deriving dominance metrics (as in a dominance hierarchy) for nodes in a dominance directed graph, aka a tournament graph. I can use R and the package igraph to easily construct such graphs, e.g.
library(igraph)
create a data frame of edges
the.froms <- c(1,1,1,2,2,3)
the.tos <- c(2,3,4,3,4,4)
the.set <- data.frame(the.froms, the.tos)
set.graph <- graph.data.frame(the.set)
plot(set.graph)
This plotted graph shows that node 1 influences nodes 2, 3, and 4 (is dominant to them), that 2 is dominant to 3 and 4, and that 3 is dominant to 4.
However, I see no easy way to actually calculate a dominance hierarchy as in the page: https://www.math.ucdavis.edu/~daddel/linear_algebra_appl/Applications/GraphTheory/GraphTheory_9_17/node11.html . So, my first and main question is does anyone know how to derive a dominance hierarchy/node-based dominance metric for a graph like this using some hopefully already coded solution in R?
Moreover, in my real case, I actually have a sparse matrix that is missing some interactions, e.g.
incomplete.set <- the.set[-2, ]
incomplete.graph <- graph.data.frame(incomplete.set)
plot(incomplete.graph)
In this plotted graph, there is no connection between species 1 and 3, however making some assumptions about transitivity, the dominance hierarchy is the same as above.
This is a much more complicated problem, but if anyone has any input about how I might go about deriving node-based metrics of dominance for sparse matrices like this, please let me know. I am hoping for an already coded solution in R, but I'm certainly MORE than willing to code it myself.
Thanks in advance!
Not sure if this is perfect or that I fully understand this, but it seems to work as it should from some trial and error:
library(relations)
result <- relation_consensus(endorelation(graph=the.set),method="Borda")
relation_class_ids(result)
#1 2 3 4
#1 2 3 4
There are lots of potential options for method= for dealing with ties etc - see ?relation_consensus for more information. Using method="SD/L" which is a linear order might be the most appropriate for your data, though it can suggest multiple possible solutions due to conflicts in more complex examples. For the current simple data this is not the case though - try:
result <- relation_consensus(endorelation(graph=the.set),method="SD/L",
control=list(n="all"))
result
#An ensemble of 1 relation of size 4 x 4.
lapply(result,relation_class_ids)
#[[1]]
#1 2 3 4
#1 2 3 4
Methods of dealing with this are again provided in the examples in ?relation_consensus.
I have been trying to build this program or find out how to access what KKNN does to produce its results. I am using the KKNN function and package to help predict future baseball stats. It takes in 11 predictor variables (previous 3 year stats, PA and level, along with age and another predictor). The predictions work great but what I am hoping to do is when I am predicting only one player (as this would be ridiculous while predicting 100s of players), I would like to see maybe the 3 closest neighbors to the player in question and their previous stats with what they produced the next year. I am most concerned with the name of the nearest neighbors as knowing which players are closest will give context to the prediction that it makes.
I am fine with trying to edit the actual code to the function if that is the only way to get at these. Even finding the indices would be helpful as I can backsolve from there to get the names. Thank you so much for all of your help!
Here is some sample code that should help:
name=c("McGwire,Mark","Bonds,Barry","Helton,Todd","Walker,Larry","Pujols,Albert","Pedroia,Dustin")
z
lag1=c(100,90,75,89,95,70)
lag2=c(120,80,95,79,92,90)
Runs=c(65,120,105,99,65,100)
full=cbind(name,lag1,lag2,Runs)
full=data.frame(full)
learn=full
learn
learn$lag1=as.numeric(as.character(learn$lag1))
learn$lag2=as.numeric(as.character(learn$lag2))
learn$Runs=as.numeric(as.character(learn$Runs))
valid=learn[5,]
learn=learn[-5,]
valid
k=kknn(Runs~lag1+lag2,learn,valid,k=2,distance=1)
summary(k)
fit=fitted(k)
fit
Here is the function that I am actually calling if that helps you tailor your answers for workarounds!
kknn(RVPA~(lag1*lag1LVL*lag1PA)+(lag2*lag2LVL*lag2PA)+(lag3*lag3LVL*lag3PA)+Age1+PAsize, RV.learn, RV.valid,k=86, distance = 1,kernel = "optimal")
Here's a slightly modified version of your example:
full= data.frame(
name=c("McGwire,Mark","Bonds,Barry","Helton,Todd","Walker,Larry","Pujols,Albert","Pedroia,Dustin"),
lag1=c(100,90,75,89,95,70),
lag2=c(120,80,95,79,92,90),
Runs=c(65,120,105,99,65,100)
)
library(kknn)
train=full[full$name!="Bonds,Barry",]
test=full[full$name=="Bonds,Barry",]
k=kknn(Runs~lag1+lag2,train=train, test=test,k=2,distance=1)
This predicts Bonds to have 80.2 runs. The Runs variable acts like a class label and if you call k$CL you'll get back 65 and 99 (the number of runs corresponding to the two nearest neighbors). There are two players (McGwire, Pujols) with 65 runs and one with 99, so you can't tell directly who the neighbors are. It appears that the output for kknn does not include a list of the nearest neighbors to the test set (though you could probably back it out from the various outputs).
The FNN package, however, will let you do a query against your training data in the way you want:
library(FNN)
get.knnx(data=train[,c("lag1","lag2")], query=test[,c("lag1","lag2")],k=2)
$nn.index
[,1] [,2]
[1,] 3 4
$nn.dist
[,1] [,2]
[1,] 1.414214 13
train[c(3,4),"name"]
[1] Walker,Larry Pujols,Albert
So nearest neighbors to Bonds are Pujols and Walker.