Random forest evaluation in R - r

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.

Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))

Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

Related

Why does ranger predict give different numbers when re-applied to training data?

I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:
rf_reg <- ranger(formula = y ~ ., data = training_df)
results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions
Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!

Is there a function to obtain pooled standardized coefficient of linear equation modelling related to analysis of a MI database?

I replaced missing data by using MICE package.
I realized the linear equation modelling by using : summary(pool(with(imputed_base_finale,lm(....)))
I tried to obtain standardized coefficients by using the function lm.beta, however it doesn't work.
lm.beta (with(imputed_base_finale,lm(...)))
Error in lm.beta(with(imputed_base_finale, lm(...)))
object has to be of class lm
How can I obtain these standardized coefficients ?
Thank you for you help!!!
lm.scale works on lm objects and adds standardized coefficients. This however was not build to work on mira objects.
Have you considered using scale on the data before you build a model, effectively getting standardized coefficients?
Instead of standardizing the data before imputation, you could also apply it with post processing during imputation.
I am not sure which of these would be the most robust option.
require(mice)
# non-standardized
imp <- mice(nhanes2)
pool(with(imp,lm(chl ~ bmi)))
# standardized
imp_scale <- mice(scale(nhanes2[,c('bmi','chl')]))
pool(with(imp_scale,lm(chl ~ bmi)))

I get an error with functions of nlme package in R

I am trying to fit a linear growth model (LGM) in R, and I understand that the primary steps would be to fit a Null model with time as a predictor of my independent variable Y (allowing for random effects) and a Null model not allowing for random effects, then compare the two and see whether the random effect is strong enough to justify the usage of the model with random intercept.
I managed to fit the model with random intercept with the lmer function of the lme4 package, but I can't find a function in that package that allows me to fit a model without random intercept.
I have tried to fit models both with random intercept (lme function) and without (gls function) with the nlme package, but neither of them have been working for me.
My original code was:
library(nlme)
LMModel <- lme(Y~Time, random=~Time| ID, data=dataset,
method="ML")
and running that, I got an error saying "missing values in object" (apparently referring to my Time variable). I thus added a transformation of my dataset into a matrix with "matr <- as.matrix(dataset)" and added the missing data management part to my code, which ended up being:
LMModel <- lme(Y~Time, random=~Time| ID, data=dataset,
method="ML", na.action = na.exclude(matr))
Running this, I get the error: ' could not find function "1" '
I further tried to fit a model with no random effect with the gls function of nlme and got the exact same error.
I feel quite lost as I can't seem to figure out what that function 1 means. Any ideas of what might be happening here?
Thanks a lot in advance for the help!
Federico

How do I plot predictions from new data fit with gee, lme, glmer, and gamm4 in R?

I have fit my discrete count data using a variety of functions for comparison. I fit a GEE model using geepack, a linear mixed effect model on the log(count) using lme (nlme), a GLMM using glmer (lme4), and a GAMM using gamm4 (gamm4) in R.
I am interested in comparing these models and would like to plot the expected (predicted) values for a new set of data (predictor variables). My goal is to compare the predicted effects for each model under particular conditions (x variables). Of particular interest is the comparison between marginal (GEE) and conditional estimates.
I think my main problem might be getting the new data in the correct form with the correct labels and attributes and such. I am still very much an R novice and struggle with this stuff (no course on this at my university unfortunately).
I currently have fitted models
gee1 lme1 lmer1 gamm1
and can extract their fixed effect coefficients and standard errors without a problem. I also don't have a problem converting them from the log scale or estimating confidence intervals accounting for the random effects.
I also have my new dataframe newdat which has 365 observations of 23 variables (average environmental data for each day of the year).
I am stuck on how to predict new count estimates from this. I played around with the model.matrix function but couldn't get it to work. For example, I tried:
mm = model.matrix(terms(glmm1), newdat) # Error in model.frame.default(object,
# data, xlev = xlev) : object is not a matrix
newdat$pcount = mm %*% fixef(glmm1)
Any suggestions or good references would be greatly appreciated. Can anyone help with the error above?
Getting predictions for lme() and lmer() is documented on http://glmm.wikidot.com/faq

Regression evaluation in R

Are there any utilities/packages for showing various performance metrics of a regression model on some labeled test data? Basic stuff I can easily write like RMSE, R-squared, etc., but maybe with some extra utilities for visualization, or reporting the distribution of prediction confidence/variance, or other things I haven't thought of. This is usually reported in most training utilities (like caret's train), but only over the training data (AFAICT). Thanks in advance.
This question is really quite broad and should be focused a bit, but here's a small subset of functions written to work with linear models:
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
#general summary
summary(model)
#Visualize some diagnostics
plot(model)
#Coefficient values
coef(model)
#Confidence intervals
confint(model)
#predict values
predict(model)
#predict new values
predict(model, newdata = data.frame(y = 1:10))
#Residuals
resid(model)
#Standardized residuals
rstandard(model)
#Studentized residuals
rstudent(model)
#AIC
AIC(model)
#BIC
BIC(model)
#Cook's distance
cooks.distance(model)
#DFFITS
dffits(model)
#lots of measures related to model fit
influence.measures(model)
Bootstrap confidence intervals for parameters of models can be computed using the recommended package boot. It is a very general package requiring you to write a simple wrapper function to return the parameter of interest, say fit the model with some supplied data and return one of the model coefficients, whilst it takes care of the rest, doing the sampling and computation of intervals etc.
Consider also the caret package, which is a wrapper around a large number of modelling functions, but also provides facilities to compare model performance using a range of metrics using an independent test set or a resampling of the training data (k-fold, bootstrap). caret is well documented and quite easy to use, though to get the best out of it, you do need to be familiar with the modelling function you want to employ.

Resources