In an assignment, we are asked to perform a cross-validation on a CART model. I have tried using the cvFit function from cvTools but got a strange error message. Here's a minimal example:
library(rpart)
library(cvTools)
data(iris)
cvFit(rpart(formula=Species~., data=iris))
The error I'm seeing is:
Error in nobs(y) : argument "y" is missing, with no default
And the traceback():
5: nobs(y)
4: cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K,
R = R, foldType = foldType, folds = folds, names = names,
predictArgs = predictArgs, costArgs = costArgs, envir = envir,
seed = seed)
3: cvFit(call, data = data, x = x, y = y, cost = cost, K = K, R = R,
foldType = foldType, folds = folds, names = names, predictArgs = predictArgs,
costArgs = costArgs, envir = envir, seed = seed)
2: cvFit.default(rpart(formula = Species ~ ., data = iris))
1: cvFit(rpart(formula = Species ~ ., data = iris))
It looks that y is mandatory for cvFit.default. But:
> cvFit(rpart(formula=Species~., data=iris), y=iris$Species)
Error in cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K, :
'x' must have 0 observations
What am I doing wrong? Which package would allow me to do a cross-validation with a CART tree without having to code it myself? (I am sooo lazy...)
The caret package makes cross validation a snap:
> library(caret)
> data(iris)
> tc <- trainControl("cv",10)
> rpart.grid <- expand.grid(.cp=0.2)
>
> (train.rpart <- train(Species ~., data=iris, method="rpart",trControl=tc,tuneGrid=rpart.grid))
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.94 0.91 0.0798 0.12
Tuning parameter 'cp' was held constant at a value of 0.2
Finally, I was able to get it to work. As Joran noted, the cost parameter needs to be adapted. In my case I am using 0/1 loss, which means that I use a simple function that evaluates != instead of - between y and yHat. Also, predictArgs must include c(type='class'), otherwise the predict call used internally will return a vector of probabilities instead of the most probable classification. To sum up:
library(rpart)
library(cvTools)
data(iris)
cvFit(rpart, formula=Species~., data=iris,
cost=function(y, yHat) (y != yHat) + 0, predictArgs=c(type='class'))
(This uses another variant of cvFit. Additional args to rpart can be passed by setting the args= parameter.)
Related
I am trying to run ridge/lasso with the glmnetand onehot package and getting an error.
library(glmnet)
library(onehot)
set.seed(123)
Sample <- HouseData[1:1460, ]
smp_size <- floor(0.5 * nrow(Sample))
train_ind <- sample(seq_len(nrow(Sample)), size = smp_size)
train <- Sample[train_ind, ]
test <- Sample[-train_ind, ]
############Ridge & Lasso Regressions ################
# Define the response for the training + test set
y_train <- train$SalePrice
y_test <- test$SalePrice
# Define the x training and test
x_train <- train[,!names(train)=="SalePrice"]
x_test <- test[,!names(train)=="SalePrice"]
str(y_train)
## encoding information for training set
x_train_encoded_data_info <- onehot(x_train,stringsAsFactors = TRUE, max_levels = 50)
x_train_matrix <- (predict(x_train_encoded_data_info,x_train))
x_train_matrix <- as.matrix(x_train_matrix)
# create encoding information for x test
x_test_encoded_data_info <- onehot(x_test,stringsAsFactors = TRUE, max_levels = 50)
x_test_matrix <- (predict(x_test_encoded_data_info,x_test))
str(x_train_matrix)
###Calculate best lambda
cv.out <- cv.glmnet(x_train_matrix, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)
best.lambda <- cv.out$lambda.min
best.lambda
model <- glmnet(x_train_matrix, y_train, alpha = 0, lambda = best.lambda)
results_ridge <- predict(model,newx=x_test_matrix)
I know my data is clean and my matrices are the same size, But I keep getting this error when I try to run my prediction.
Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
My professor has also told me to one-hot encode before I split my data, but that makes no sense to me.
It's hard to debug that specific error because it's not entirely clear where the onehot function in your code is coming from; it doesn't exist in base R or the glmnet package.
That said, I would recommend using the old built-in standby function model.matrix (or its sparse cousin, sparse.model.matrix, if you have larger datasets) for creating the x argument to glmnet. model.matrix will automatically one-hot encode factor or categorical variables for you. It requires a model formula as input, which you can create from your dataset as shown below.
# create the model formula
y_variable <- "SalePrice"
model_formula <- as.formula(paste(y_variable, "~",
paste(names(train)[names(train) != y_variable], collapse = "+")))
# test & train matrices
x_train_matrix <- model.matrix(model_formula, data = train)[, -1]
x_test_matrix <- model.matrix(model_formula, data = test)[, -1]
###Calculate best lambda
cv.out <- cv.glmnet(x_train_matrix, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)
A second, newer option would be to use the built-in glmnet function makeX(), which builds matrices off of your test/train dataframes. This can just be fed into cv.glmnet as the x argument as below.
## option 2: use glmnet built in function to create x matrices
x_matrices <- glmnet::makeX(train = train[, !names(train) == "SalePrice"],
test = test[, !names(test) == "SalePrice"])
###Calculate best lambda
cv.out <- cv.glmnet(x_matrices$x, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)
I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.
Suppose i'm doing seveal runs of the same model, but only with different complexity parameters, on the same (seed fixed) cross-validation with the caret package, for exemple :
library(caret)
data(iris)
# controls are the same for every models
c = trainControl(method = "cv",number=10,verboseIter = TRUE)
d = iris # data is also the same
f = Species ~ . # formula is also the same
m = "rpart" # method is also the same
set.seed(1234)
model1 <- train(form = f, data = d, trControl = c, method = m,
tuneGrid = expand.grid(cp = c(0,0.5)))
set.seed(1234)
model2 <- train(form = f, data = d, trControl = c, method = m,
tuneGrid = expand.grid(cp = c(0.1,0.2)))
set.seed(1234)
model3 <- train(form = f, data = d, trControl = c, method = m,
tuneGrid = expand.grid(cp = c(0,0.5,0.1,0.2)))
Is there a way i could "build up" the model3 train object only from model1 and the model2 ?
Calculations are long, and i did'nt ran all my different tuning in the same caret call. But having every run in the same train object will be much easier for comparing them (via the plot function, the update function, the resamples function, etc...)
I'm particularly looking for a way do do the same thing plot.train do but for all of them together.
I perfectly understand your concern, because my computation sources are also very limited. However I would approach it as follows, instead of "building up" the model3 object.
Suppose what you wish to achieve is highest accuracy. Then you simply need to evaluate the following: which among the model1 and model2 do we see highest accuracy? Then we are only interested in choosing the best-result tuning parameter. For example, we see the following:
> model1$bestTune$cp
[1] 0
> model2$bestTune$cp
[1] 0.2
> model1$results$Accuracy ## Respectively for cp = 0.0 and cp = 0.5
[1] 0.9333 0.3333
> model2$results$Accuracy ## Respectively for cp = 0.1 and cp = 0.2
[1] 0.9267 0.9267
We would choose cp = 0.
Suppose you have broken things down to model1, model2, model3, ... and wish to explore all manually input parameter values using them.
k = 2 ## Here we only have model1 and model2 to compare
evaluate <- list()
for (i in 1:k) {
model = eval(parse(text = paste0("model", i)))
evaluate[["cp"]][[paste0("model", i)]] <-
model$finalModel$tuneValue$cp
evaluate[["accuracy"]][[paste0("model", i)]] <-
model$results$Accuracy[[which(model$results$cp == model$bestTune$cp)]]
}
Then in our evaluate list, we have the following:
> evaluate
$cp
model1 model2
0.0 0.2
$accuracy
model1 model2
0.9333 0.9267
Upon this, we can do
> which(evaluate$accuracy == max(evaluate$accuracy))
model1
1
> evaluate$cp[[which(evaluate$accuracy == max(evaluate$accuracy))]]
[1] 0
Now we can happily choose cp = 0 and we also know that the result from the optimal cp is stored in model1.
If you wish to still "build up" the model3, you can simply substitute some of the components (e.g. results in which AccuracySD, KappaSD, and such metrics would be stored) after having chosen what we evaluated as the best model---model1 in this case.
I would like to use the method Random Forest to impute missing values. I have read some papers that claim that MICE random Forest perform better than parametric mice.
In my case, I already run a model for the default mice and got the results and played with them. However when I had a option for the method random forest, I got an error and I'm not sure why. I've seen some questions relating to errors with random forest and mice but those are not my cases. My variables have more than a single NA.
imp <- mice(data1, m=70, pred=quickpred(data1), method="pmm", seed=71152, printFlag=TRUE)
impRF <- mice(data1, m=70, pred=quickpred(data1), method="rf", seed=71152, printFlag=TRUE)
iter imp variable
1 1 Vac
Error in if (n == 0) stop("data (x) has 0 rows") : argument is of length zero
Any one has any idea why I'm getting this error?
EDIT
I tried to change all variables to numeric instead of having dummy variables and it returned the same error and some warnings()
impRF <- mice(data, m=70, pred=quickpred(data), method="rf", seed=71152, printFlag=TRUE)
iter imp variable
1 1 Vac CliForm
Error in if (n == 0) stop("data (x) has 0 rows") : argument is of length zero
In addition: There were 50 or more warnings (use warnings() to see the first 50)
50: In randomForest.default(x = xobs, y = yobs, ntree = 1, ... :
The response has five or fewer unique values. Are you sure you want to do regression?
EDIT1
I've tried only with 5 imputations and a smaller subset of the data, with only 2000 rows and got a few different errors:
> imp <- mice(data2, m=5, pred=quickpred(data2), method="rf", seed=71152, printFlag=TRUE)
iter imp variable
1 1 Vac Radio Origin Job Alc Smk Drugs Prison Commu Hmless Symp
Error in randomForest.default(x = xobs, y = yobs, ntree = 1, ...) : NAs in foreign
function call (arg 11)
In addition: Warning messages:
1: In randomForest.default(x = xobs, y = yobs, ntree = 1, ...) : invalid mtry: reset to within valid range
2: In max(ncat) : no non-missing arguments to max; returning -Inf
3: In randomForest.default(x = xobs, y = yobs, ntree = 1, ...) : NAs introduced by coercion
I also encountered this error when I had only one fully observed variable, which I'm guessing is the cause in your case too. My colleague Anoop Shah provided me with a fix (below) and Prof van Buuren (mice's author) has said he will include it in the next update of the package.
In R type the following to enable you to redefine the rf impute function.
fixInNamespace("mice.impute.rf", "mice")
The corrected function to paste in is then:
mice.impute.rf <- function (y, ry, x, ntree = 100, ...){
ntree <- max(1, ntree)
xobs <- as.matrix(x[ry, ])
xmis <- as.matrix(x[!ry, ])
yobs <- y[ry]
onetree <- function(xobs, xmis, yobs, ...) {
fit <- randomForest(x = xobs, y = yobs, ntree = 1, ...)
leafnr <- predict(object = fit, newdata = xobs, nodes = TRUE)
nodes <- predict(object = fit, newdata = xmis, nodes = TRUE)
donor <- lapply(nodes, function(s) yobs[leafnr == s])
return(donor)
}
forest <- sapply(1:ntree, FUN = function(s) onetree(xobs,
xmis, yobs, ...))
impute <- apply(forest, MARGIN = 1, FUN = function(s) sample(unlist(s),
1))
return(impute)
}
I'm testing the kernlab package in a regression problem. It seems it's a common issue to get 'Error in .local(object, ...) : test vector does not match model ! when passing the ksvm object to the predict function. However I just found answers to classification problems or custom kernels that are not applicable to my problem (I'm using a built-in one for regression). I'm running out of ideas here, my sample code is:
data <- matrix(rnorm(200*10),200,10)
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(x = tr[,-1],
y = tr[,1],
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod,
ts
)
You forgot to remove the y variable in the test set, and so it fails because the number of predictors don't match. This will work:
predict(mod,ts[,-1])
You can use pred <- predict(mod, ts) if ts is a dataframe.
It would be
data <- setNames(data.frame(matrix(rnorm(200*10),200,10)),
c("Y",paste("X", 1:9, sep = "")))
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(as.formula("Y ~ ."), data = tr,
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod, ts)