Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm using FactoMineR to perform PCA on a biological dataset, where each column is a gene and the rows contain different samples. The samples belong to different groups (control/treatment; cancer/noncancer). I've included this information as qualitative supplements when applying the PCA() function, and I sort of understand that when we call $quali.sup$eta2, we get a table with the squared correlations between each categorical variable and the principal components. My question is: how is that table exactly obatined -- how is the correlation exactly calculated?
The package vingette on p41 identifies eta2 as the correlation coefficient.
The correlation coefficient is be based on the underlying correlation matrices.
The exact methodology should be within the reference that the package author provides. Methods for producing eigenvalues and eigenvectors are usually where packages differ but the matrix algebra is generally the same.
Husson, F., Le, S. and Pages, J. (2010). Exploratory Multivariate Analysis by Example Using R,
Chapman and Hall.
The correlation (squared) is computed between the coordinates of the samples (individuals in FactoMineR terms) and the categorical variable expressed as numeric factor levels.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have set of customers with different attributes continuous, categorical, binary and ordinal.
How can I cluster them knowing that we cannot apply the same distance metrics on the these different types of attributes?
Thank you in advance
As mentioned already daisy package is an option which does an automatic selection of best distance metric based on data type.But I would suggest the following approach and request expert to please chime in.
Rather than automatic selection identify and remove some correlated variables like(some examples)
Pearson Correlation: for continuous variable
Chi Square Test: for categorical variables
Categorical vs Numerical: One way Anova test etc.
Taking the subset of useful variables consider doing One-Hot Encoding of categorical variables and maybe convert ordinal to continuous (or categorical and one-hot encode). Test using different distance metric like Euclidean, Manhattan etc to evaluate the result. You will get a better clarity of the overall clustering process in this way.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say that a data has both numeric & catagoricial feature, and I've created a xgboost model by using gblinear. I've analyzed the xgboost model with xgb.importance, then how can I express categorical variable weights?
While XGBoost is considered to be a black box model, you can understand the feature importance (for both categorical and numeric) by averaging the gain of each feature for all split and all trees.
This is represented in the graph below.
# Get the feature real names
names <- dimnames(trainMatrix)[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = bst)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
In the feature importance above, we can see the first 10 most important features.
This function gives a color to each bar. Basically a K-means clustering is applied to group each feature by importance.
Alternately, this could be represented in a tree diagram (see the link above).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm not sure if this is a question for stackoverflow, or crossvalidated.
I'm looking for away to include covariate measures when calculating the correlation between two measures. For example, Lets say I have 100 samples, for which I have two measurements, x and y. Now lets say I also have a third measure, a covariate (lets say age). I want to measure the correlation between x and y, but I also want to ignore any of that correlation that comes from the covariate, age.
If I'm fitting a linear model, I could simply add the term to the model:
lm(y~x+age)
I know you can't calculate correlation with this kind of model in R (using ~).
So I want to know:
Does what I'm asking even make sense to do? I suspect it may not.
If it does, what R packages should I be using?
It sounds like you're asking for a semipartial correlation. You want the correlation between x and y partialling out the correlation between x and z. You need to read about partial and semipartial correlations.
The ppcor package in R will then help you with the calculations.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have two multiple linear regression models, built using the same groups of subjects, variables, the only difference is the time point: one is baseline data and the other is obtained some time after.
I want to compare if there is any statistical significance between the two models. I have seen articles saying that using AIC maybe a better option over p-value when comparing models.
My question is: does it make sense to just purely compare the AIC using extractAIC in R, or to obtain the anova(lm)?
It is not standard to test for statistical significance between observations recorded at two points in time by estimating two different models.
You may mean that you are testing to see whether the observations recorded at a second point in time are statistically different from the first, by including some dummy variables, and testing the coefficients on these. Still, this is only estimating one model.
In your model you will have dummy variables for your second point in time, either one intercept or an intercept and an interaction dummy like this.
Then you should do both - test the p-value significance for either or both gammas in the models described, and also look at the AIC. There is no definitive 'better', as the articles likely described.