I am trying to run a cluster on a very large data set. It contains only strings for values. I have removed the NA's and relaced with a dummy value. My K-Means in R keeps failing due to NA coerecion. How would the community run a cluster on this data. I am shwoing 10 rows of a dummy example below. In this situation lets call the data frame: cluster_data
ANy help would be greatly appreciated. I am trying see if any of the columns cause the data to break earlier then another to try and understand a possible struture. Thought Clustering with K-means was best approach but do not see how to do with strings. Have converted to factors in R and still have issues. ANy example code is greatly appreciated
Question: how do you run kmeans clustering with strings?
Answer: You can't run k means cluster analysis on categorical data. You need data that a distance function can make sense of.
K-means is designed for continuous variables, where least-squares and the mean make sense to be used as centers.
For other data types, it is better to sue other algorithms, such as PAM, HAC, DBSCAN, OPTICS, ...
Related
I have a question related to keras model code in R. I have finished training the model and need to predict. Predicting a line is very fast, but my data has 2000,000,000 rows and nearly 200 columns, with a structure like the attached image.
Datastructure
I don't know if anyone has any suggestions on which method to use so that predict can run quickly and use less memory. I created a matrix according to the table as shown in order to predict, each matrix is 200,000x200 dimensions. Then I use sapply to predict all the remaining matrices. However, even though predict is fast for each matrix, but creating the matrix is slow, so it makes the model run twice or three times as long, and that is not taking into account the sapply step. I wonder if keras has a "smart" way to know that in each of his matrix, the last N columns that are exactly the same? I google and see someone talking about RepeatVector but I don't quite understand and it seems that this is only used for training? I already have the model and just need to predict.
Thank you so much everyone!
One of the most performant ways to feed keras models locally is by creating a tf.data.Dataset object. Please take a look at the tfdatasets R package for guides and example usage.
I am trying to run a k-means cluster analysis in R using only a subset of data from my data source. I have created the subset (as dataframe) because I am only interested in using these variables for segmentation, and the rest of the variables will be used to describe the segments.
After the k-means clustering is done, I was wondering how I could connect the clustering results back to my original dataset, which include the descriptive variables as well.
Please let me know if I could provide any clarifications on my questions. Many thanks in advance.
Cheers,
AC
You get a cluster label for each point.
These should refer to the same samples as for your original data.
I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/
I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!
I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp