Forest with ranger predictions on non-continuous data - r

I'm creating a random forest model with ranger in R that training on MNIST pixel data and predicting the number that the pixels represent (0-9). After creating my model with the following code and looking at the predictions, many of my predicted numbers are continuous (decimals).
set.seed(1031)
forestranger <- ranger(Y ~ ., data = mnisttrain, num.trees = 1000)
predictions <- predict(forestranger, data = mnisttest, num.trees = 1000)
Is there a way to ensure my predictions are only a whole number between 0 and 9 and not continuous?

Converting the response variable to a factor seems to have solved the issue

Related

R multilevel mediation with uneven sample sizes in Y and M models

I'm trying to run a multilevel mediation analysis in R.
I get the error: Error in mediate(model.M, model.Y, treat = "treat", mediator = mediator, data=data):
number of observations do not match between mediator and outcome models
Models M and Y are multilevel lme4 models, and there are uneven sample sizes in these models. Is there anything I can do to run this analysis? Will it really only run if I have the same sample sizes in each model?
Fit the model with less observations first (which is, I guess, model.Y, because that model has more predictors and thus is more likely to have more missings), then use the model frame from that model as data for the 2nd model:
model.M <- lmer(..., data = model.Y#frame)
That should work.

R - caret::train "random forest" parameters

I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.
My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??
I'm confused on whether there is something wrong with my model or my data cleaning, etc.
set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)
fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 41
OOB estimate of error rate: 0.01%
Use of Random Forest for final project for the Johns Hopkins Practical Machine Learning course on Coursera will generate the same prediction for all 20 test cases for the quiz if students fail to remove independent variables that have more than 50% NA values.
SOLUTION: remove variables that have a high proportion of missing values from the model.

H2o random forest plot on r

I'm new to h2o and I'm having difficulty with this package on r.
I'm using a traning and test set 5100 and 2300 obs respectively with 18917 variables and a binary target (0,1)
I ran a random forest:
train_h20<-as.h2o(train)
test_h20<-as.h2o(test)
forest <- h2o.randomForest(x = Words,
y = 18918,
training_frame = train_h20,
ntree = 250,
validation = test_h20,
seed = 8675309)
I know i can get the plot of logloss or mse or ... as the number of tree changes
But is there a way to plot an image of the model itself. I mean the final ensembled tree used for the final predictions?
Also, another question, in randomForest package I could use varImp function which returned me, as well as the absolute importance, the class-specific measures (computed as mean decrease in accuracy), i interpreted as a class-relative measure of variable importance.
varImp matrix, randomForest package:
In h2o package I only find the absolute importance measure, is there something similar?
There is no a final tree at the end of the random forest in R with randomForest packages. To make final predıction, random forest uses voting method. Voting means, for any data:
For example 0;
of tree that predict the data as Class 0/total number of trees in the forest
For Class 1 it is same as the Class 0;
of tree that predict the data as Class 1/total number of trees in the forest
However you can use ctree.
library("party")
x <- ctree(Class ~ ., data=data)
plot(x)

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows:
RF1pred <- predict(RF1, newdata=TrainS1, type = "class")
Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there.
If someone could elaborate, I will be grateful.
Thank you!
EDIT: Important to note: I get sensible accuracy and AUC if I run the prediction without specifying a dataset altogether, like so:
RF1pred <- predict(RF1, type = "class")
If a new dataset is not explicitly specified, isn't the training data used for prediction. Hence, shouldn't I get the same results from both lines of code?
EDIT2: Here is a sample code with random data that illustrates the point. When predicting without specifying newdata, the AUC is 0.4893. When newdata=train is explicitly specified, the AUC is 0.7125.
# Generate sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T), x2=rpois(100,10), y=sample(0:1, 100, replace=T))
# Build random forest
library(randomForest)
model <- randomForest(x1 ~ x2, data=train)
pred1 <- predict(model)
pred2 <- predict(model, newdata = train)
# Calculate AUC
library(ROCR)
ROCRpred1 <- prediction(pred1, train$x1)
AUC <- as.numeric(performance(ROCRpred1, "auc")#y.values)
AUC # 0.4893
ROCRpred2 <- prediction(pred2, train$x1)
AUC <- as.numeric(performance(ROCRpred2, "auc")#y.values)
AUC # 0.7125
If you look at the documentation for predict.randomForest you will see that if you do not supply a new data set you will get the out-of-bag (OOB) performance of the model. Since the OOB performance is theoretically related to the performance of your model on a different data set, the results will be much more realistic (although still not a substitute for a real, independently collected, validation set).

How to get the nodal raw numbers (from all the trees for a particula test vector) from which random forest calculates the prediction in R?

I'd like to predict a distribution rather than a single number using random forest regression in R. To do this, I'd like to get all the numbers from which random forest calculates (averages) the predicted value for a particular test vector. How can I get this done?
To be specific,
I'm not growing each tree to its full size, but limiting the size using nodesize parameter. In this case, I'm interested not in the prediction of each tree in the forest (which is given by setting the predict.all to TRUE) , but all the data points from which this prediction is calculated; that is all the data points from the node on which a new observation lands on, for all the trees in the forest.
Thanks,
The function predict.randomForest has a boolean parameter predict.all exactly for this purpose.
library("randomForest")
rf = randomForest(Species ~ ., data = iris)
?predict.randomForest
allpred = predict(rf, newdata = iris, predict.all = TRUE)
Now, the allpred$individual is a matrix, where columns correspond to individual decision trees

Resources