Train a random forest algorithm using various columns - r

I have asked this question before here: Creating a loop for different random forest training algoritms but didnt get a right answer yet. So hereby another attempt with a more reproducable example.
I have the following datasets:
train <- read.csv(url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"))
test <- read.csv(url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"))
train <- train[complete.cases(train), ]
I would like to run several random forest algoritms to see which one performs best. So what I basically want to do is:
#predict based on Pclass
fit <- randomForest(as.factor(Survived) ~ Pclass, data=train, importance=TRUE, ntree=2000)
Prediction <- predict(fit, test)
#fetch accuracy
#predict based on Pclass and Sex
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex, data=train, importance=TRUE, ntree=2000)
Prediction <- predict(fit, test)
#fetch accuracy
I would like to create some kind of loop so that I can store all values in a list and then loop over it. So like this:
list <- c(Pclass, Pclass + Sex)
for (R in list) {
modfit <- paste0("won ~ ", R, ", data=training, method=\"rf\", prox=\"TRUE")
modfit <- as.formula(modfit)
train(modfit)
}
But the code above doesn't work. It gives me the following error:
Error in parse(text = x, keep.source = FALSE) :
<text>:1:13: unexpected ','
1: won ~ Pclass,
Any thoughts on how I can get this working?

for (R in list) {
modfit <- paste0("won ~ ", R, "data=training, method=\"rf\", prox=\"TRUE")
modfit <- as.formula(modfit)
train(modfit)
}
You might be having a comma before data=training that does not need to be there

Related

Obtaining individual slopes from an lme4 object in R

I'm new to lme4 package in R. In my example below, I was wondering if it might be possible to obtain the gender slopes (i.e., differences) for each dep after fitting my glmer model?
dat <- data.frame(dep = rep(LETTERS[1:6],each=2), gender = rep(c("Ma","Fe"),6),
admit=c(512,89,353,17,120,202,138,131,53,94,22,24),
reject=c(313,19,207,8,205,391,279,244,138,299,351,317))
lme4::glmer(cbind(admit,reject) ~ gender+dep + (gender|dep), data=dat, family=binomial)
In lme4 you can get the estimated slopes from ranef, but in your model you will need to sum the global and unit specific terms, as in the example below.
library(lme4)
dat <- data.frame(dep = rep(LETTERS[1:6],each=2), gender = rep(c("Ma","Fe"),6),
admit=c(512,89,353,17,120,202,138,131,53,94,22,24),
reject=c(313,19,207,8,205,391,279,244,138,299,351,317))
mod1 <- glmer(cbind(admit,reject) ~ gender+dep + (gender|dep), data=dat, family=binomial)
summary(mod1)
ran_gender <- ranef(mod1)$dep
fe_mod1 <- fixef(mod1)
slopes <- fe_mod1[[2]] + ran_gender[,2]
slopes

Getting estimated means after multiple imputation using the mitml, nlme & geepack R packages

I'm running multilevel multiple imputation through the package mitml (using the panimpute() function) and am fitting linear mixed models and marginal models through the packages nlme and geepack and the mitml:with() function.
I can get the estimates, p-values etc for those through the testEstimates() function but I'm also looking to get estimated means across my model predictors. I've tried the emmeans package, which I normally use for getting estimated means when running nlme & geepack without multiple imputation but doing so emmeans tell me "Can't handle an object of class “mitml.result”".
I'm wondering is there a way to get pooled estimated means from the multiple imputation analyses I've run?
The data frames I'm analyzing are longitudinal/repeated measures and in long format. In the linear mixed model I want to get the estimated means for a 2x2 interaction effect and in the marginal model I'm trying to get estimated means for the 6 levels of 'time' variable. The outcome in all models is continuous.
Here's my code
# mixed model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100, group = "treatment")
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, lme(Dep ~ time*treatment, random = ~ 1|id, method = "ML", na.action = na.exclude, control = list(opt = "optim")))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
# marginal model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100)
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, geeglm(Dep ~ time, id = id, corstr ="unstructured"))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
is there a way to get pooled estimated means from the multiple imputation analyses I've run?
This is not a reprex without Data, so I can't verify this works for you. But emmeans provides support for mira-class (lists of) models in the mice package. So if you fit your model in with() using the mids rather than mitml.list class object, then you can use that to obtain marginal means of your outcome (and any contrasts or pairwise comparisons afterward).
Using example data found here, which uncomfortably loads an external workspace:
con <- url("https://www.gerkovink.com/mimp/popular.RData")
load(con)
## imputation
library(mice)
ini <- mice(popNCR, maxit = 0)
meth <- ini$meth
meth[c(3, 5, 6, 7)] <- "norm"
pred <- ini$pred
pred[, "pupil"] <- 0
imp <- mice(popNCR, meth = meth, pred = pred, print = FALSE)
## analysis
library(lme4) # fit multilevel model
mod <- with(imp, lmer(popular ~ sex + (1|class)))
library(emmeans) # obtain pooled estimates of means
(em <- emmeans(mod, specs = ~ sex) )
pairs(em) # test comparison

Confusion matrix for multinomial logistic regression & ordered logit

I would like to create confusion matrices for a multinomial logistic regression as well as a proportional odds model but I am stuck with the implementation in R. My attempt below does not seem to give the desired output.
This is my code so far:
CH <- read.table("http://data.princeton.edu/wws509/datasets/copen.dat", header=TRUE)
CH$housing <- factor(CH$housing)
CH$influence <- factor(CH$influence)
CH$satisfaction <- factor(CH$satisfaction)
CH$contact <- factor(CH$contact)
CH$satisfaction <- factor(CH$satisfaction,levels=c("low","medium","high"))
CH$housing <- factor(CH$housing,levels=c("tower","apartments","atrium","terraced"))
CH$influence <- factor(CH$influence,levels=c("low","medium","high"))
CH$contact <- relevel(CH$contact,ref=2)
model <- multinom(satisfaction ~ housing + influence + contact, weights=n, data=CH)
summary(model)
preds <- predict(model)
table(preds,CH$satisfaction)
omodel <- polr(satisfaction ~ housing + influence + contact, weights=n, data=CH, Hess=TRUE)
preds2 <- predict(omodel)
table(preds2,CH$satisfaction)
I would really appreciate some advice on how to correctly produce confusion matrices for my 2 models!
You can refer -
Predict() - Maybe I'm not understanding it
Here in predict() you need to pass unseen data for prediction.

obtain pmml representation of glm-type model produced by caret::train

I am trying to produce PMML from a regression model trained in caret with method='glm'. Example model:
library('caret')
data('GermanCredit')
set.seed(123)
train_rows <- createDataPartition(GermanCredit$Class, p=0.6, list=FALSE)
train_x <- GermanCredit[train_rows, c('Age','ForeignWorker','Housing.Own',
'Property.RealEstate','CreditHistory.Critical') ]
train_y <- as.integer( GermanCredit[train_rows, 'Class'] == 'Good' )
some_glm <- train( train_x, train_y, method='glm', family='binomial',
trControl = trainControl(method='none') )
summary(some_glm$finalModel)
An unaccepted answer on this related question for type='rf' suggests that it is not possible to do using the matrix interface.
So I'm unable to get pmml using either the matrix or the formula syntax (which I'm pretty sure produce identical finalModels anyway):
library('pmml')
pmml(some_glm$finalModel)
# Error in if (model$call[[1]] == "glm") { : argument is of length zero
# Same problem if I try:
some_glm2 <- train( Class ~ Age + ForeignWorker + Housing.Own +
Property.RealEstate + CreditHistory.Critical,
data=GermanCredit[train_rows, ], family="binomial",
method='glm',
trControl = trainControl(method='none') )
pmml(some_glm2$finalModel)
It does work in base glm with the formula interface:
some_glm_base <- glm(Class ~ Age + ForeignWorker + Housing.Own +
Property.RealEstate + CreditHistory.Critical,
data=GermanCredit[train_rows, ], family="binomial")
pmml(some_glm_base) # works
For interoperablity, I would like to continue to use caret. Is there a way to convert some_glm produced in caret back to a format that pmml() will accept? Or am I forced to use the glm() construction if I want pmml functionality?
If you set model$call[[1]], the pmml function will work correctly.
So in your case you would want to:
library('pmml')
some_glm$finalModel$call[[1]] <- "glm"
pmml(some_glm$finalModel)

how to carry out logistic regression and random forest to predict churn rate

I am using following dataset: http://www.sgi.com/tech/mlc/db/churn.data
And the variable description: http://www.sgi.com/tech/mlc/db/churn.names
Ii did preliminary coding but I am really not able to make out how to perform a logistic regression and Random Forest techniques to this data to predict the importance of variables and churn rate.
nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names",
skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]]
nm
dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn"))
dat
View(dat)
View(dat)
library(survival)
s <- with(dat, Surv(account.length, as.numeric(Churn)))
model <- coxph(s ~ total.day.charge + number.customer.service.calls, data=dat[, -4])
summary(model)
plot(survfit(model))
Also I am not able to figure out how to use the model that I built in my further analysis.
please help me.
Do you have any example code of what you're trying to do? What further analysis do you have planned? If you're just trying to run a logistic regression on the data, the general format is:
lr <- glm(Churn ~ international.plan + voice.mail.plan + number.vmail.messages
+ account.length, family = "binomial", data = dat)
Try help(glm) and help(randomForest)

Resources