Ranger Predicted Class Probability of each row in a data frame - r

With regard to this link Predicted probabilities in R ranger package, I have a question.
Imagine I have a mixed data frame, df (comprising of factor and numeric variables) and I want to do classification using ranger. I am splitting this data frame as test and train sets as Train_Set and Test_Set. BiClass is my prediction factor variable and comprises of 0 and 1 (2 levels)
I want to calculate and attach class probabilities to the data frame using ranger using the following commands:
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 200, type='response', verbose = TRUE)$predictions)
The data frame probabilities is a data frame consisting of 2 columns (0 and 1) with number of rows equal to the number of rows in Test_Set.
Does it mean, if I append or attach this data frame, namely, probabilities to the Test_Set as the last two columns, it shows the probability of each row being either 0 or 1? Is my understanding correct?
My second question, when I attempt to calcuate confusion matrix through
pred = predict(Biclass.ranger, data=Test_Set, num.trees = 500, type='response', verbose = TRUE)
table(Test_Set$BiClass, pred$predictions)
I get the following error:
Error in table(Test_Set$BiClass, pred$predictions) :
all arguments must have the same length
What am I doing wrong?

For your first question yes, it shows the probability of each row being 0 or 1. Using the example below:
library(ranger)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(mdl, data = Test_Set,type='response', verbose = TRUE)$predictions)
We can always check whether they agree:
par(mfrow=c(1,2))
boxplot(probabilities[,"0"] ~ Test_Set$Species,ylab="Prob 0",xlab="Actual label")
boxplot(probabilities[,"1"] ~ Test_Set$Species,ylab="Prob 1",xlab="Actual label")
Not the best plot, but sometimes if the labels are flipped you will see something weird. We need to find the column that has the max probability and assign the label, for this we do:
max.col(probabilities) - 1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0
This goes through each row of probabilities returns 1 or 2 depending on which column has maximum probability and we simply subtract 1 from it to get 0,1. For the confusion matrix:
caret::confusionMatrix(table(max.col(probabilities) - 1,Test_Set$Species))
Confusion Matrix and Statistics
0 1
0 31 2
1 0 17
Accuracy : 0.96
95% CI : (0.8629, 0.9951)
No Information Rate : 0.62
P-Value [Acc > NIR] : 2.048e-08
In your case, you can just do:
confusionMatrix(table(max.col(probabilities)-1,Test_Set$BiClass))

Related

Error: BoxCox error during preprocess imputation R language

I am looking at the answers to the Chapter 6 exercise 3 questions from the Applied Predictive Modeling book Max Kuhn and I am getting an error in the imputation prediction step (despite following their answer exactly). The reproducible code and problem is as follows:
library(AppliedPredictiveModeling)
library(caret)
library(RANN)
data(ChemicalManufacturingProcess)
predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")
# Impute
#Split data into training and test sets
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]
#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute"))
ppTrainPredictors <- predict(pp,newdata=trainPredictors)
ppTestPredictors <- predict(pp,newdata=testPredictors) # This results in an error
The error it gives is: Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, : NA/NaN/Inf in foreign function call (arg 2)
When I use YeoJohnson transformation instead it seems to work (which i read is able to handle non positive numbers)
However, I dont understand why it isn't working on the test data seeing as its just a different subset of the training data? and it's just used for the imputation step of the question?
I was unable to find any answers to this which seems strange as surely other people who have followed the book would have noticed? or am I being thick?
Thank you
You get that error because the boxcox transformation does not take in zero. If you check the help page for BoxCoxTrans it writes:
If any(y <= 0) or if length(unique(y)) < numUnique, lambda is not
estimated and no transformation is applied.
So if your preProcess() is run on a train set with no zeros in the columns, the boxcox transformation is applied, but it will not work on a test set with zeros.
In the book example above, most likely the seed was set with an older R version, so it works. If you are using a newer version of R, then it doesn't work. So if I check with your example:
cbind(colSums(trainPredictors==0,na.rm=TRUE),colSums(testPredictors==0,na.rm=TRUE))
[,1] [,2]
BiologicalMaterial01 0 0
BiologicalMaterial02 0 0
BiologicalMaterial03 0 0
BiologicalMaterial04 0 0
BiologicalMaterial05 0 0
BiologicalMaterial06 0 0
BiologicalMaterial07 0 0
BiologicalMaterial08 0 0
BiologicalMaterial09 0 0
BiologicalMaterial10 0 0
BiologicalMaterial11 0 0
BiologicalMaterial12 0 0
ManufacturingProcess01 1 2
ManufacturingProcess02 29 6
ManufacturingProcess03 0 0
ManufacturingProcess04 0 0
ManufacturingProcess05 0 0
ManufacturingProcess06 0 0
ManufacturingProcess07 0 0
ManufacturingProcess08 0 0
ManufacturingProcess09 0 0
ManufacturingProcess10 0 0
ManufacturingProcess11 0 0
ManufacturingProcess12 104 38
ManufacturingProcess13 0 0
ManufacturingProcess14 0 0
ManufacturingProcess15 0 0
ManufacturingProcess16 1 0
ManufacturingProcess17 0 0
ManufacturingProcess18 1 0
You can see that ManufacturingProcess16, ManufacturingProcess18 will give you problems.
Yeo-Johnson transformation can deal with zeros or negative values, so it is not a problem.
If you would like to carry on with the work example, you can try to use another seed:
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]

Confusion Matrix Error: Error: `data` and `reference` should be factors with the same levels

I am currently trying to build a neural network to predict what rank people within the data will place.
The Rank system is: A,B,C,D,E
Everything runs very smoothly until I get to my confusion matrix. I get the error "Error: data and reference should be factors with the same levels.". I have tried many different methods on other posts but none seem to work.
The levels are both the same in NNPredicitions and test$Rank. I checked them both with table().
library(readxl)
library(caret)
library(neuralnet)
library(forecast)
library(tidyverse)
library(ggplot2)
Indirect <-read_excel("C:/Users/Abdulazizs/Desktop/Projects/Indirect/FIltered Indirect.xlsx",
n_max = 500)
Indirect$Direct_or_Indirect <- NULL
Indirect$parentaccount <- NULL
sum(is.na(Indirect))
counts <- table(Indirect$Rank)
barplot(counts)
summary(counts)
part2 <- createDataPartition(Indirect$Rank, times = 1, p = .8, list = FALSE, groups = min(5, length(Indirect$Rank)))
train <- Indirect[part2, ]
test <- Indirect[-part2, ]
set.seed(1234)
TrainingParameters <- trainControl(method = "repeatedcv", number = 10, repeats=10)
as.data.frame(train)
as.data.frame(test)
NNModel <- train(train[,-7], train$Rank,
method = "nnet",
trControl= TrainingParameters,
preProcess=c("scale","center"),
na.action = na.omit
)
NNPredictions <-predict(NNModel, test, type = "raw")
summary(NNPredictions)
confusionMatrix(NNPredictions, test$Rank)
length(NNPredictions)
length(test$Rank)
length(NNPredictions)
[1] 98
length(test$Rank)
[1] 98
table(NNPredictions, test$Rank, useNA="ifany")
NNPredictions A B C D E
A 1 0 0 0 0
B 0 6 0 0 0
C 0 0 11 0 0
D 0 0 0 18 0
E 0 0 0 0 62
Also change method = "prob" to method = "raw"
Table1 <- table(NNPredictions, test$Rank, useNA = "ifany")
cnf1 <- confusionMatrix(Table1)
Answered provided by dclarson

EM algorithm for multivariate t mixed models

I'm trying to implement an EM algorithm for family data where I'm assuming my observations have a multivariate t distribution. I have only two siblings per family, so all of the family groups have only two observations. Basically I'm trying to follow the E(C)M steps in this article:
https://pdfs.semanticscholar.org/9445/ef865c4eb1431f9cb2abdb5efc1c361172cc.pdf
However, now I'm not sure if EM is doable for this kind of data, since my correlation matrix Psi should be block diagonal for families.
So here's an R example of how my families are structured
fam_id = sort(rep(1:5, 2))
Z= matrix(0, nrow = length(fam_id), ncol = length(unique(fam_id)))
colnames(Z) = unique(fam_id)
k = 1
i = 1
# Random effects dummy matrix
while (k <= ncol(Z)) {
Z[i:(i+1), k] = c(1, 1)
k = k +1
i = i+2
}
> Z
1 2 3 4 5
[1,] 1 0 0 0 0
[2,] 1 0 0 0 0
[3,] 0 1 0 0 0
[4,] 0 1 0 0 0
...
The EM algorithm chokes after 5th iteration saying that the correlation matrix Psi is not:
Error in solve.default(psi_hat) :
system is computationally singular
If anyone could shed some light to this, I'd be very happy!
Please check this answer in the Statschange website
https://stats.stackexchange.com/questions/76488/error-system-is-computationally-singular-when-running-a-glm
Your are probably ending up with a non invertible matrix in your 5th iteration

How to read the indexes from the prediction output of predict.ranger, R

Using the ranger package I run the following script:
rf <- ranger(Surv(time, Y) ~ ., data = train_frame[1:50000, ], write.forest = TRUE, num.trees = 100)
test_frame <- train_frame[50001:100000, ]
preds <- predict(rf, test_frame)
chfs <- preds$chf
plot(chfs[1, ])
The cumulative hazard function has indexes 1 - 36 on the X-axis. Obviously this corresponds with time, but I'm not sure how: my time of observation variable ranges from a minimum of 0 to a maximum of 399. What is the mapping between the original data and the predicted output from predict.ranger, and how can I operationalize this to quantify degree of risk for a given subject after a given length of time?
Here's a sample of what my time/event data looks like:
Y time
<int> <dbl>
1 1 358
2 0 90
3 0 162
4 0 35
5 0 307
6 0 69
7 0 184
8 0 24
9 0 366
10 0 33
And here's what the CHF of the first subject looks like:
Can anyone help me connect the dots? There are no row or columns names on the "matrix" object that is preds$chf.
In the prediction object is vector called unique.death.times containing the time points where the CHF and survival estimates are computed. The chf matrix has observations in the rows and these time points in the columns, same for survival.
Reproducible example:
library(survival)
library(ranger)
## Split the data
n <- nrow(veteran)
idx <- sample(n, 2/3*n)
train <- veteran[idx, ]
test <- veteran[-idx, ]
## Grow RF and predict
rf <- ranger(Surv(time, status) ~ ., train, write.forest = TRUE)
preds <- predict(rf, test)
## Example CHF plot
plot(preds$unique.death.times, preds$chf[1, ])
## Example survival plot
plot(preds$unique.death.times, preds$survival[1, ])
Setting importance = "impurity" for survival forests should throw an error.

Modelling for zero using glm function in R

I am trying to build a logistic regression model using glm function in R. My dependent variable is binomial with 0 and 1 only. Here 0 - Non Return , 1- Return.
I want to model for Non-Return (0's),but glm function of R by default build for 1's. Like in SAS which by default build for lower value and we can use descending attribute in proc logistic to change the order, do we have something similar in glm too ?
I have one option to achieve this by changing 0 to 1 and vice-versa in my raw data but don't want to change my raw data.
Please help me or guide how can I do the similar thing in R.
Thanks in advance.
Just specify 1 - y as the DV:
set.seed(42)
y <- sample(c(0, 1), 10, TRUE)
#[1] 1 1 0 1 1 1 1 0 1 1
fit <- glm(y ~ 1, family = binomial)
coef(fit)
# (Intercept)
# 1.386294
log(mean(y) / (1 - mean(y)))
#[1] 1.386294
1 - y
#[1] 0 0 1 0 0 0 0 1 0 0
fit1 <- glm(1 - y ~ 1, family = binomial)
coef(fit1)
#(Intercept)
#-1.386294
log(mean(1 - y) / (1 - mean(1 - y)))
#[1] -1.386294
Alternatively, you can temporarily transform your data by using...transform:
glm( data = transform( data.frame(y=0), y=y+1 ), ... )

Resources