MLR, example dependent cost of misclassification, makeCostSensWeightedPairsWrapper - r

This question has been seen 74 times and has received only one response (as of noon (PDT) Wed, Aug-14).
I've rewritten the question to make it as clear as possible and I'll appreciate any help.
As a summary, I need a small but complete example on a dataset with binary response on how to use MLR's makeCostSensWeightedPairsWrapper to obtain prediction probabilities on a test set.
In the MLR tutorial in the part on cost-sensitive classification
https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
there is a paragraph on "Example-dependent misclassification costs" and an example is given based on the iris dataset.
In the code snippet below, I modified the iris data set so as to contain only two classes as I'm interested in binary classification only.
library( mlr )
set.seed( 12347 )
n1 = 100; ntrain = 70
df = iris[ 1:n1, ] # 100 points in df so as to have two classes only (setosa and versicolor)
df$Species = factor( df$Species ) # refactor the response
# partition df into a training set (70 points) and test set (30 points)
#
ix = sample( 1:n1, ntrain, replace=FALSE )
xtest = df[ setdiff( 1:n1, ix ), ] ## test set
ntest = nrow( xtest )
xtrain = df[ ix, ] # this is the training set
# create cost matrix, same as in the MLR example
#
cost = matrix(runif(ntrain * 2, 0, 2000), ntrain) * (1 - diag(2))[xtrain$Species,] + runif(ntrain, 0, 10)
colnames(cost) = levels(xtrain$Species)
rownames(cost) = rownames(xtrain)
xtrain$Species = NULL # this is done according to the MLR example
# cost-sensitive task
#
costsens.task = makeCostSensTask(id = "xtrain", data = xtrain, cost = cost )
costsens.task
##lrn = makeLearner("classif.multinom", trace = FALSE, predict.type="prob" )
lrn = makeLearner( "classif.gbm", predict.type="prob" )
lrn = makeCostSensWeightedPairsWrapper( lrn ); lrn
mod = train(lrn, costsens.task ); mod
pred = predict( mod, newdata = xtest, pred.type="prob" );
perf = performance( pred, measures = list(auc), task = costsens.task)
# I get the following error:
# Error in FUN(X[[i]], ...) :
# You need to have a 'truth' column in your pred object for measure auc!
My original project is to do a binary classification which incorporates example-dependent misclassification costs.
The goal is to do a prediction on a test dataset, obtain the probabilities and show the performance (using ROCR, for which there are MLR-mapping functions).
NOTE: The learners I've tried are 'classif.multinom' and (I guessed) 'classif.gbm' as the two that might be compatible with the weighted pairs wrapper.
My questions are:
Q1: Where in the code snippet and how to specify that I want probabilities as an output of the cost-sensitive classifier?
Q2: Which learner can be used so as to produce classification probabilities?
Q3: How to avoid the error above and get the class probabilities?
Once again, I'd really appreciate any help, even more so if there is anyone who can answer promptly.

OK, after a number of days and almost a hundred views of this question (about 20 are mine :) there has been only one comment and no answers.
From what I could understand exploring some of the available MLR documentation, it seems that the output of example-based cost-sensitive method (makeCostSensWeightedPairsWrapper) is labels only and no prediction probabilities.
In other words, no probabilities are available from a cost-sensitive task, only the new labels are given, which are in turn computed based on the probabilities of the base classifier.
So, this the answer which I'll accept.
As for MLR errors, in this case at least, it would be helpful to get an explicit error message instead of a spurious one, or simply to note this in the documentation.

Related

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

How to Create a loop (when levels do not overlap the reference)

I have written some code in R. This code takes some data and splits it into a training set and a test set. Then, I fit a "survival random forest" model on the training set. After, I use the model to predict observations within the test set.
Due to the type of problem I am dealing with ("survival analysis"), a confusion matrix has to be made for each "unique time" (inside the file "unique.death.time"). For each confusion matrix made for each unique time, I am interested in the corresponding "sensitivity" value (e.g. sensitivity_1001, sensitivity_2005, etc.). I am trying to get all these sensitivity values : I would like to make a plot with them (vs unique death times) and determine the average sensitivity value.
In order to do this, I need to repeatedly calculate the sensitivity for each time point in "unique.death.times". I tried doing this manually and it is taking a long time.
Could someone please show me how to do this with a "loop"?
I have posted my code below:
#load libraries
library(survival)
library(data.table)
library(pec)
library(ranger)
library(caret)
#load data
data(cost)
#split data into train and test
ind <- sample(1:nrow(cost),round(nrow(cost) * 0.7,0))
cost_train <- cost[ind,]
cost_test <- cost[-ind,]
#fit survival random forest model
ranger_fit <- ranger(Surv(time, status) ~ .,
data = cost_train,
mtry = 3,
verbose = TRUE,
write.forest=TRUE,
num.trees= 1000,
importance = 'permutation')
#optional: plot training results
plot(ranger_fit$unique.death.times, ranger_fit$survival[1,], type = 'l', col = 'red') # for first observation
lines(ranger_fit$unique.death.times, ranger_fit$survival[21,], type = 'l', col = 'blue') # for twenty first observation
#predict observations test set using the survival random forest model
ranger_preds <- predict(ranger_fit, cost_test, type = 'response')$survival
ranger_preds <- data.table(ranger_preds)
colnames(ranger_preds) <- as.character(ranger_fit$unique.death.times)
From here, another user (Justin Singh) from a previous post (R: how to repeatedly "loop" the results from a function?) suggested how to create a loop:
sensitivity <- list()
for (time in names(ranger_preds)) {
prediction <- ranger_preds[which(names(ranger_preds) == time)] > 0.5
real <- cost_test$time >= as.numeric(time)
confusion <- confusionMatrix(as.factor(prediction), as.factor(real), positive = 'TRUE')
sensitivity[as.character(i)] <- confusion$byclass[1]
}
But due to some of the observations used in this loop, I get the following error:
Error in confusionMatrix.default(as.factor(prediction), as.factor(real), :
The data must contain some levels that overlap the reference.
Does anyone know how to fix this?
Thanks
Certain values in prediction and/or real have only 1 unique value in them. Make sure the levels of the factors are the same.
sapply(names(ranger_preds), function(x) {
prediction <- factor(ranger_preds[[x]] > 0.5, levels = c(TRUE, FALSE))
real <- factor(cost_test$time >= as.numeric(x), levels = c(TRUE, FALSE))
confusion <- caret::confusionMatrix(prediction, real, positive = 'TRUE')
confusion$byClass[1]
}, USE.NAMES = FALSE) -> result
result

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

vegan::ordiR2step() doesn't find best-fit model

The vegan package includes the ordiR2step() function for model building, which can be used to identify the most important variables using the R2 and the p-value as goodness of fit measures. However for the dataset I was recently working with the function doesn't provide the best-fit model.
# data
RIKZ <- read.table("http://www.uni-koblenz-landau.de/en/campus-landau/faculty7/environmental-sciences/landscape-ecology/Teaching/RIKZ_data/at_download/file", header = TRUE)
# data preparation
Species <- RIKZ[ ,2:5]
ExplVar <- RIKZ[ , 9:15]
Species_fin <- Species[ rowSums(Species) > 0, ]
ExplVar_fin <- ExplVar[ rowSums(Species) > 0, ]
# rda
RIKZ_rda <- rda(Species_fin ~ . , data = ExplVar_fin, scale = TRUE)
# stepwise model building: ordiR2step()
require(vegan)
step_both_R2 <- ordiR2step(rda(Species_fin ~ salinity, data = ExplVar_fin, scale = TRUE),
scope = formula(RIKZ_rda),
direction = "both", R2scope = TRUE, Pin = 0.05,
steps = 1000)
Why does ordiR2step() not add the variable exposure to the model, although it would increase the explained variance?
If R2scope is set FALSE and the p-value criterion is increased (Pin = 0.15) it adds the variable exposure corretly but throws the following error:
Error in terms.formula(tmp, simplify = TRUE) :
invalid model formula in ExtractVars
If R2scope is set TRUE (Pi = 0.15) exposure is not added.
Note: This might seem more as a statistic question and therefore more suitable for CV. However I think the problem is rather technical and better off here on SO.
Please read the ordiR2step documentation: it will tell you why exposure is not added to the model. The help page tells that ordiR2step has three stopping criteria. The second criterion is that "the adjusted R2 of the ‘scope’ is exceeded". This happens with exposure and therefore it was not added. This second criterion will be ignored if you set R2scope = FALSE (also documented). So the function works like documented.

R Neural Net Issues

Here is the updated code. My issue is with the output of "results". I'll post below as the format for readability.
library("neuralnet")
library("ggplot2")
setwd("C:/Users/Aaron/Documents/UMUC/R/Data For Assignments")
trainset <- read.csv("SOTS.csv")
head(trainset)
## val data classification
str(trainset)
## building the neural network
risknet <- neuralnet(Overall.Risk.Value ~ Finance + Personnel + Information.Dissemenation.C, trainset, hidden = 10, lifesign = "minimal", linear.output = FALSE, threshold = 0.1)
##plot nn
plot(risknet, rep="best")
##import scoring set
score_set <- read.csv("SOSS.csv")
##select subsets-training and scoring match
score_test <- subset(score_set, select = c("Finance", "Personnel", "Information.Dissemenation.C"))
##display values of score_test
head(score_test)
##neural network compute function score_test and the neural net "risknet"
risknet.results <- compute(risknet, score_test)
##Actual value of Overall.Risk.Value variable wanting to predict. net.result = a matrix containing the overall result of the neural network
results <- data.frame(Actual = score_set$Overall.Risk.Value, Prediction = risknet.results$net.result)
results[1:14, ]
The output of results is not as expected. For instance, the actual data is a number between 5 and 8, whereas "Prediction" displays outputs of .9995...for each result.
Thanks again for the help.
This is how you train and predict:
Use training data to learn model parameters (the variable risknet in your case)
Use parameters to predict scores on test data
Here is an example very much similar to yours that explains how this is done.
The default activation function in neuralnet is "logistic". When linear.output is set as FALSE, it ensures that the output is mapped by the activation function to the interval [0,1].(R_Journal (neuralnet)- Frauke Günther)
I just updated linear.output=TRUE in your code and final result looks much better.
Thanks for the help!

Resources