Clustering of set of customers having heterogeneous variables [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have set of customers with different attributes continuous, categorical, binary and ordinal.
How can I cluster them knowing that we cannot apply the same distance metrics on the these different types of attributes?
Thank you in advance

As mentioned already daisy package is an option which does an automatic selection of best distance metric based on data type.But I would suggest the following approach and request expert to please chime in.
Rather than automatic selection identify and remove some correlated variables like(some examples)
Pearson Correlation: for continuous variable
Chi Square Test: for categorical variables
Categorical vs Numerical: One way Anova test etc.
Taking the subset of useful variables consider doing One-Hot Encoding of categorical variables and maybe convert ordinal to continuous (or categorical and one-hot encode). Test using different distance metric like Euclidean, Manhattan etc to evaluate the result. You will get a better clarity of the overall clustering process in this way.

Related

Obtaining the eta2 table provided in the FactoMineR package [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm using FactoMineR to perform PCA on a biological dataset, where each column is a gene and the rows contain different samples. The samples belong to different groups (control/treatment; cancer/noncancer). I've included this information as qualitative supplements when applying the PCA() function, and I sort of understand that when we call $quali.sup$eta2, we get a table with the squared correlations between each categorical variable and the principal components. My question is: how is that table exactly obatined -- how is the correlation exactly calculated?
The package vingette on p41 identifies eta2 as the correlation coefficient.
The correlation coefficient is be based on the underlying correlation matrices.
The exact methodology should be within the reference that the package author provides. Methods for producing eigenvalues and eigenvectors are usually where packages differ but the matrix algebra is generally the same.
Husson, F., Le, S. and Pages, J. (2010). Exploratory Multivariate Analysis by Example Using R,
Chapman and Hall.
The correlation (squared) is computed between the coordinates of the samples (individuals in FactoMineR terms) and the categorical variable expressed as numeric factor levels.

Machine learning - Calculating the importance of a "value" in a variable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.

How to work with numeric probability distribution functions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have to calculate the probability distribution function of a random variable that is composed of (sum, division, product, exponentiation, etc...) some other simple random variables. It is pretty complex so I am morte then happy to get a numerical solution
While thought this was a very standard thing to do , I was unable to find a framework to do that. I'd preferably use R, but any major language will do.
What I would like therefore is a library that allowed me to:
i) create numerical random variables from classic distributions
ii) compose them by simple operations (+,-,*,/, exp,min, max,...)
Of course I could work with vectors and use convolutions and the like, but I wanted something more polished.
I am also aware that is possible to use simulation to create the variables, then compose them with the operations and finally getting the PDF using a histogram, but again, I would prefer a non - simulating approach.
Try the rv package. Note that if X is an exponential random variable with mean 1, then -log(X) has a standard Gumbel distribution.

comparison of regression models built on two time points [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have two multiple linear regression models, built using the same groups of subjects, variables, the only difference is the time point: one is baseline data and the other is obtained some time after.
I want to compare if there is any statistical significance between the two models. I have seen articles saying that using AIC maybe a better option over p-value when comparing models.
My question is: does it make sense to just purely compare the AIC using extractAIC in R, or to obtain the anova(lm)?
It is not standard to test for statistical significance between observations recorded at two points in time by estimating two different models.
You may mean that you are testing to see whether the observations recorded at a second point in time are statistically different from the first, by including some dummy variables, and testing the coefficients on these. Still, this is only estimating one model.
In your model you will have dummy variables for your second point in time, either one intercept or an intercept and an interaction dummy like this.
Then you should do both - test the p-value significance for either or both gammas in the models described, and also look at the AIC. There is no definitive 'better', as the articles likely described.

Should Categorical predictors within a linear model be normally distributed? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am running simple linear models(Y~X) in R where my predictor is a categorical variable (0-10). However, this variable is not normally distributed and none of the transformation techniques available are healpful (e.g. log, sq etc.) as the data is not negatively/positively skewed but rather all over the place. I am aware that for lm the outcome variable (Y) has to be normally distributed but is this also required for predictors? If yes, any suggestions of how to do this would be more than welcome.
Also, as the data I am looking at has two groups, patients vs controls (I am interested in group differences, as you can guess), do I have to look at whether the data is normally distributed within the two groups or overall across the two groups?
Thanks.
See #Roman Luštriks comment above: it does not matter how your predictors are distributed. (Except for problems with multicollinearity.) What is important is that the residuals be normal (and with homogeneous variances).

Resources