How can KNN algorithm be applied to categorical variables? - r

I am working on data set where most of the variables are categorical variables. some variables have even 5 categories. Is it possible to implement knn algorithm in a situation like this? If so, how can I proceed with these categorical variables? Do I have to normalize them? I am using R and it would be a help if someone could direct me to a source.

Your first step would be to decide on a distance/dissimilarity function between your observations.
One option is to transform your categorical variables into dummy binary variables and then calculate the Jaccard distance between each row pair. Here is a simple tutorial for these steps.
Once you have a distance defined you can proceed with the KNN algorithm as usual. I'm not sure if there are any packages implementing this in R already or if you should program this yourself. Shouldn't be that complicated

Related

Applying a population total variable in R?

I have a weighting variable that I'd like to apply to my dataset so that I have weighted totals. In SPSS, this is straightforward enough. However, in R, I've been multiplying the variable by the weight variable to create a new variable as shown in the following example:
https://stats.stackexchange.com/questions/210697/weighting-variable-based-on-another-variable
Is there a more sophisticated way of applying weights in R?
Thanks.
If you need to work with a weighted dataset and define a complex survey sample, you can use the survey package : https://cran.r-project.org/web/packages/survey/survey.pdf.
You can therefore use all sorts of summary statistics once you have defined the weights to be applied.
However, I would advise this for complex weighted analysis.
Otherwise, there are several other packages dealing with weights such as questionr for instance.
It all depends on if you have to do a simple weighted sum or go on to do other types of analysis that require using more sophisticated methods.

Procedure to identify the most significant predictors variables using R when data has tremendous multicollinearity?

I have a database of around 36 predictor variables which I am using to predict a target variable. The target is a categorical variable consisting of three different classes whereas predictor variables include both numeric and categorical variables.
However, data is subject to severe multi-collinearity. I am trying to build a parsimonious logistic regression model so need to reduce the variables. According to VIF values results become counter intuitive as soon as I reduce the number of variables. On the other hand, I am not very sure that PCR can solve the problem as I need inferences from the sensitivity from each variable.
What is the better option to deal with such problem?
Which packages from 'R' I can use?
Will factor analysis solve the problem?
Or can we infer everything from PCR?
You have first to run ANOVA/Kruskall Wallis test to check which variables are well suited for your problem. For 36 variables I don't think you will need PCA, as this will make your model lose some explainability.
Remember that PCA will reduce dimensionality and also explain only part of the data variance. Factor Analysis will generate groups of variables in factors, in case you want to run a segmented logistic regression for each factor of grouped variables.
If you want to build a parsimonious logistic regression, you can apply some regularization so that you can increase the generalization properties of it, instead of reducing number of variables.
You can use the following R packages: caret (logistic regression), ROCR (AUC), ggplot (plot), DMwR (outliers), mice (missing values)
Also, if you want to make a regularization, you can use the following formula:
In this case, you can develop regularization from scratch, without a library, to adjust the inclination of the sigmoid, so that you can correctly classify your classes:

K-Means Distance Measure - Large Data and mixed Scales

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/

Which technique is best used to find optimal split on numeric data to reduced error on group?

I have a dataset that contains a numeric variable and a binary categorical variable. I want to find the optimal split on the numeric variable that can be used to quickly classify the categories and limit the amount of error.
I have used a decision tree to do this but am wondering if there are better optimization methods out there to do this?
I would like to be able to do this in R but am having trouble how to write the function for this.
Please help me understand the simple optimisation problem. Thanks!

Dimension Reduction for Clustering in R (PCA and other methods)

Let me preface this:
I have looked extensively on this matter and I've found several intriguing possibilities to look into (such as this and this). I've also looked into principal component analysis and I've seen some sources that claim it's a poor method for dimension reduction. However, I feel as though it may be a good method, but am unsure how to implement it. All the sources I've found on this matter give a good explanation, but rarely do they provide any sort of advice as to actually go about applying one of these methods (i.e. how one can actually apply a method in R).
So, my question is: is there a clear-cut way to go about dimension reduction in R? My dataset contains both numeric and categorical variables (with multiple levels) and is quite large (~40k observations, 18 variables (but 37 if I transform categorical variables into dummies)).
A few points:
If we want to use PCA, then I would have to somehow convert my categorical variables into numeric. Would it be okay to simply use a dummy variable approach for this?
For any sort of dimension reduction for unsupervised learning, how do I treat ordinal variables? Do the concept of ordinal variables even make sense in unsupervised learning?
My real issue with PCA is that when I perform it and have my principal components.. I have no idea what to actually do with them. From my knowledge, each principal component is a combination of the variables - and as such I'm not really sure how this helps us pick and choose which are the best variables.
I don't think this is an R question. This is more like a statistics question.
PCA doesn't work for categorical variables. PCA relies on decomposing the covariance matrix, which doesn't work for categorical variables.
Ordinal variables make lot's of sense in supervised and unsupervised learning. What exactly are you looking for? You should only apply PCA on ordinal variables if they are not skewed and you have many levels.
PCA only gives you a new transformation in terms of principal components, and their eigenvalues. It has nothing to do with dimension reduction. I repeat, it has nothing to do with dimension reduction. You reduce your data set only if you select a subset of the principal components. PCA is useful for regression, data visualisation, exploratory analysis etc.
A common way is to apply optimal scaling to transform your categorical variables for PCA:
Read this:
http://www.sicotests.com/psyarticle.asp?id=159
You may also want to consider correspondence analysis for categorical variables and multiple factor analysis for both categorical and continuous.

Resources