Most important variables for finding group membership - r

I have a dataset 8100 observations of 118 variables that are used to determine which one of 4 groups each respondent falls into. I am interested in which variables are the most important for predicting group membership. My data is a combination of ordinal and binary. I initially did a discriminant function analysis, but then read that this does not handle binary data well. Next I tried a multinomial logistic regression. However, from here I am struggling to work out which variables are the most important. I had tried an r-part decision tree, but then I read that these are not very stable, and indeed, when I ran it on a random half of my data I got different results every time. Now I am trying a dominance analysis. I can get it working for a linear model (lm), but for both the multinomial logistic regression and the discriminant function analysis I get the error:
Error in daRawResults(x = x, constants = constants, terms = terms, fit.functions = fit.functions, :
Not implemented method to retrieve data from model
Does anyone have any advice for what else I can try? Only 4 of the 118 variables are binary, so I can remove them if needed and will still have a good analysis.
Here is a reproducible example including a much smaller example dataset:
set.seed(1) ## for reproducibility
remotes::install_github("clbustos/dominanceAnalysis") # If you don't have the dominance analysis package
library(dominanceanalysis)
library(MASS)
library(nnet)
mydata <- data.frame(Segments=sample(1:4, 15, replace=TRUE),
var1=sample(1:7, 15, replace=TRUE),
var2=sample(1:7, 15, replace=TRUE),
var3=sample(1:6, 15, replace=TRUE),
var4=sample(1:2, 15, replace=TRUE))
# Show that it works for a linar model
LM<-lm(Segments ~., mydata)
da.LM<-dominanceAnalysis(LM);da.LM
#var1 is the most important, followed by var4
# Try the discriminant function analysis
DFA <- lda(Segments~., data=mydata)
da.DFA <- dominanceAnalysis(DFA)
# Error
# Try multinomial logistic regression
MLR <- multinom(Segments ~ ., data = mydata, maxit=500)
da.MLR <- dominanceAnalysis(MLR)
# Error

I've discovered a partial answer.
The dominanceanalysis package can only be used on these models: Ordinary Least Squares, Generalized Linear Models, Dynamic Linear Models and Hierarchical Linear Models.
Source: https://github.com/clbustos/dominanceAnalysis
This explains why it didn't work for my data - I wasn't using those models.
I have decided to pursue the decision tree option of variable selection by using a random forest.

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

How can I extract coefficients from this model in caret?

I'm using the caret package with the leaps package to get the number of variables to use in a linear regression. How do I extract the model with the lowest RMSE that uses mdl$bestTune number of variables? If this can't be done are there functions in other packages you would recommend that allow for loocv of a stepwise linear regression and actually allow me to find the final model?
Below is reproducible code. From it, I can tell from mdl$bestTune that the number of variables should be 4 (even though I would have hoped for 3). It seems like I should be able to extract the variables from the third row of summary(mdl$finalModel) but I'm not sure how I would do this in a general case and not just this example.
library(caret)
set.seed(101)
x <- matrix(rnorm(36*5), nrow=36)
colnames(x) <- paste0("V", 1:5)
y <- 0.2*x[,1] + 0.3*x[,3] + 0.5*x[,4] + rnorm(36) * .0001
train.control <- trainControl(method="LOOCV")
mdl <- train(x=x, y=y, method="leapSeq", trControl = train.control, trace=FALSE)
coef(mdl$finalModel, as.double(mdl$bestTune))
mdl$bestTune
summary(mdl$finalModel)
mdl$results
Here's the context behind my question in case it's of interest. I have historical monthly returns hundreds of mutual fund. Each fund's returns will be a dependent variable that I'd like to regress against a set of returns on a handful (e.g. 5) factors. For each fund I want to run a stepwise regression. I expect only 1 to 3 of the five factors to be significant for any fund.
you can use:
coef(mdl$finalModel,unlist(mdl$bestTune))

R random forest inconsistent predictions

I recently built a random forest model using the ranger package in R. However, I noticed that the predictions stored in the ranger object during training (accessible with model$predictions) do not match the prediction I get if I run the predict command on the same dataset using the model created. The following code reproduces the problem on the mtcars dataset. I created a binary variable just for the sake of converting this to a classification problem though I saw similar results with regression trees as well.
library(datasets)
library(ranger)
mtcars <- mtcars
mtcars$mpg2 <- ifelse(mtcars$mpg > 19.2 , 1, 0)
mtcars <- mtcars[,-1]
mtcars$mpg2 <- as.factor(mtcars$mpg2)
set.seed(123)
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = T)
mod$predictions[1,] # Probability of 1 = 0.905
predict(mod, mtcars[1,])$predictions # Probability of 1 = 0.967
This problem also carries on to the randomForest package where I observed a similar problem reproducible with the following code.
library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars, ntree = 20)
mod$votes[1,]
predict(mod, mtcars[1,], type = "prob")
Can someone please tell me why this is happening? I would expect the results to be the same. Am I doing something wrong or is there an error in my understanding of some inherent property of random forest that leads to this scenario?
I think you may want to look a little more deeply into how a random forest works. I really recommend Introduction to Statistical Learning in R (ISLR), which is available for free online here.
That said, I believe the main issue here is that you are treating the mod$votes value and the predict() value as the same, when they are not quite the same thing. If you look at the documentation of the randomForest function, the mod$votes or mod$predicted values are out-of-bag ("OOB") predictions for the input data. This is different from the value that the predict() function produces, which evaluates an observation on the model produced by randomForest(). Typically, you would want to train the model on one set of data, and use the predict() function on the test set.
Finally, you may need to re-run your set.seed() function every time your make the random forest if you want to achieve the same results for the mod object. I think there is a way to set the seed for an entire session, but I am not sure. This looks like a useful post: Fixing set.seed for an entire session
Side note: Here, you are not specifying the number of variables to use for each tree, but the default is good enough in most cases (check the documentation for each of the random forest functions you are using for the default). Maybe you are doing that in your actual code and didn't include it in your example, but I thought it was worth mentioning.
Hope this helps!
Edit:
I tried training the random forest using all of the data except for the first observation (Mazda RX4) and then used the predict function on just that observation, which I think illustrates my point a bit better. Try running something like this:
library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars[-1,], ntree = 200)
predict(mod, mtcars[1,], type = "prob")
As you have converted mpg to mpg2, was expecting that you want to build classification model. But nevertheless mod$predictions gives you probability while your model is trying to learn from your data points and predict(mod,mtcars[,1:10])$predictions option gives probability from trained model. Have run same code with Probability = F, and got below result, you can see prediction from trained model is prefect whereas from mod$predictions option we have 3 miss classifications.
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = F)
> table(mtcars$mpg2,predict(mod, mtcars[,1:10])$predictions)
0 1
0 17 0
1 0 15
> table(mtcars$mpg2,mod$predictions)
0 1
0 15 2
1 1 14

the questions about predict function in randomForestSRC package

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).
Here is my code.
install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
train <- sample(1:nrow(data), round(nrow(data) * 0.70))
data.grow <- rfsrc(Surv(days, status) ~ .,
data[train, ],
ntree = 100,
tree.err=T,
importance=T,
nsplit=1,
proximity=T)
data.pred <- predict(data.grow,
data[-train , ],
importance=T,
tree.err=T)
What I have a question is that predict function in this code.
Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.
For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.
However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).
"predict" function must work with outcome information (event/censored).
Therefore, I cannot understand what the "predict" function means.
If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?
In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.
Please let me know what the "predict" function in this package means is.
Thank you for reading my long question.
The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.
The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did.
Your example rfsrc statement does not work because it refers to columns that are not in the example data set.
I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.
# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
tree.err=TRUE, importance = TRUE)
# Simulate new data
new_data <- mtcars[31:32,]
# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
Sample size of test (predict) data: 2
Number of grow trees: 1000
Average no. of grow terminal nodes: 4.898
Total no. of grow variables: 9
Analysis: RSF
Family: surv-CR
Test set error rate: NA
predicted$predicted
event.1 event.2 event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809 2.15895

How to test your tuned SVM model on a new data-set using machine learning and Caret Package in R?

Guys!
I am a newbie in machine learning methods and have a question about it. I try to use Caret package in R to start this method and work with my dataset.
I have a training dataset (Dataset1) with mutation information regarding my gene of interest let's say Gene A.
In Dataset1, I have the information regarding the mutation of Gene A in the form of Mut or Not-Mut. I used the Dataset1 with SVM model to predict the output (I chose SVM because it was more accurate than LVQ or GBM).
So, in my first step, I divided my dataset into training and test groups because I've had information as a test and train set in the dataset. then I've done the cross validation with 10 fold.
I tuned my model and assessed the performance of the model using the test dataset (using ROC curve).
Everything goes fine till this step.
I have another dataset. Dataset2 which doesn't have mutation information regarding Gene A.
What I want to do now is to use my tuned SVM model from the Dataset1 on the Dataset2 to see if it could give me mutation information regarding Gene A in the Dataset 2 in a form of Mut/Not-Mut. I've gone through Caret package guide but I couldn't get it. I am stuck here and don't know what to do.
I am not sure if I chose a right approach.Any suggestions or help would really be appreciated.
Here is my code till I tuned my model from the first dataset.
Selecting training and test models from the first dataset:
M_train <- Dataset1[Dataset1$Case=='train',-1] #creating train feature data frame
M_test <- Dataset1[Dataset1$Case=='test',-1] #creating test feature data frame
y=as.factor(M_train$Class) # Target variable for training
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)
#Use the expand.grid to specify the search space
#Note that the default search grid selects 3 values of each tuning parameter
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate
n.minobsinnode = 20)
# Set up for parallel processing
#set.seed(1951)
registerDoParallel(4,cores=2)
#Train and Tune the SVM
svm.tune <- train(x=M_train,
y= M_train$Class,
method = "svmRadial",
tuneLength = 9, # 9 values of the cost function
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl) # same as for gbm above
#Finally, assess the performance of the model using the test data set.
#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,M_test)
confusionMatrix(svm.pred,M_test$Class)
svm.probs <- predict(svm.tune,M_test,type="prob") # Gen probs for ROC
svm.ROC <- roc(predictor=svm.probs$mut,
response=as.factor(M_test$Class),
levels=y))
plot(svm.ROC,main="ROC for SVM built with GA selected features")
So, here is where I stuck, how can I use svm.tune model to predict the mutation of Gene A in Dataset2?
Thanks in advance,
Now you just take the model you built and tuned and predict off of it using predict :
D2.predictions <- predict(svm.tune, newdata = Dataset2)
They keys are to be sure that you have ALL off the same predictor variables in this set, with the same column names (and in my paranoid world in the same order).
D2.predictions will contain your predicted classes for the unlabeled data.

Resources