Random Forest using R - r

i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables. Now while i run the RF algorithm i encountered the following error
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error: Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.
code:
library(randomForest)
library(ROCR)
library(Hmisc)
library(genefilter)
setwd("E:/kavya's project_work/final")
datafile<-"trainset_gcrma.txt"
clindatafile<-read.csv("mod clinical_details.csv")
outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"
data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=clindatafile
clincaldata_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clincaldata_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)
X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]
#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names
target= clindata[,"relapse"]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)
tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
error:"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."
as i'm new to Machine learning i'm unable to proceed. kindly do the needful.
Thnks in advance.

It is hard to say without knowing the data. Run class or summary on all your predictor variables to ensure that they are not accidentally interpreted as characters or factors. If you really do have more than 53 levels, you will have to convert them to binary variables. Example:
mtcars$automatic <- mtcars$am == 0
mtcars$manual <- mtcars$am == 1

Related

R: Caret package: Brier Score

I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.

R Catboost to handle categorical variables

I have a question about Catboost. Whether do I preprocess the categorical before modeling?
If I have 86 variables including 1 target variable. In these 85 variables, there are 2 numeric variables and 83 categorical variables (Factor type). The target variable is binary factor, 1 or 0.
Column 1, and Column 4 to Column 85 are factors type.
Column 2 and 3 are numeric.
I am a little confused with cat_features in catboost.train(). In the parameters, I can set a vector of categorical features. Also, I can set in the catboost.load_pool.
library(Catboost)
library(dplyr)
X_train <- train %>% select(-Target)
y_train <- (as.numeric(unlist(train[c('Target')])) - 1)
X_valid <- test %>% select(-Target)
y_valid <- (as.numeric(unlist(test[c('Target')])) - 1)
train_pool <- catboost.load_pool(data = X_train, label = y_train, cat_features = c(0,3:84))
test_pool <- catboost.load_pool(data = X_valid, label = y_valid, cat_features = c(0,3:84))
params <- list(iterations=500,
learning_rate=0.01,
depth=10,
loss_function='RMSE',
eval_metric='RMSE',
random_seed = 1,
od_type='Iter',
metric_period = 50,
od_wait=20,
use_best_model=TRUE,
cat_features = c(0,3:84))
catboost.train(train_pool, test_pool, params = params)
However, after I ran the code above, I got an error:
Error in catboost.train(train_pool, test_pool, params = params) :
catboost/libs/options/plain_options_helper.cpp:339: Unknown option {cat_features} with value "[0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]"
Any help?
Look at this example cat_features should not go in param <- list() only in catboost.load_pool()
library(catboost)
countries = c('RUS','USA','SUI')
years = c(1900,1896,1896)
phone_codes = c(7,1,41)
domains = c('ru','us','ch')
dataset = data.frame(countries, years, phone_codes, domains, stringsAsFactors = T)
glimpse(dataset)
label_values = c(0,1,1)
fit_params <- list(iterations = 100,
loss_function = 'Logloss',
ignored_features = c(4,9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5)
pool = catboost.load_pool(dataset, label = label_values, cat_features = c(0,3))
model <- catboost.train(pool, params = fit_params)
model
I haven't tried CatBoost in R, but see the example on this page:
https://catboost.ai/docs/concepts/r-reference_catboost-train.html
It appears you only pass the categorical variables in the load_pool() call, and NOT in the train() call.
(This works differently from the Python API, where cat_features is passed in the Python fit() call.)
A suggestion: group all the categorical variables in the left most column. That way you have a simpler vector creation. I also have a check in my code to make sure I did it right...

Problem with Over- and Under-Sampling with ROSE in R

I have a dataset to classify between won cases (14399) and lost cases (8677). The dataset has 912 predicting variables.
I am trying to oversample the lost cases in order to reach almost the same number as the won cases (so having 14399 cases for each of the won and lost cases).
TARGET is the column with lost (0) and won (1) cases:
table(dat_train$TARGET)
0 1
8677 14399
Now I am trying to balance them using ROSE ovun.sample
dat_train_bal <- ovun.sample(dat_train$TARGET~., data = dat_train, p=0.5, seed = 1, method = "over")
I get this error:
Error in parse(text = x, keep.source = FALSE) :
<text>:1:17538: unexpected symbol
1: PPER_409030143+BP_RESPPER_9639064007+BP_RESPPER_7459058285+BP_RESPPER_9339059882+BP_RESPPER_9339058664+BP_RESPPER_5209073603+BP_RESPPER_5209061378+CRM_CURRPH_Initiation+Quotation+CRM_CURRPH_Ne
Can anyone help?
Thanks :-)
Reproducing your code from a sham example I found an error in your formula dat_train$TARGET~. needs to be corrected as TARGET~.
dframe <- tibble::tibble(val = sample(c("a", "b"), size = 100, replace = TRUE, prob = c(.1, .9))
, xvar = rnorm(100)
)
# Use oversampling
dframe_os <- ROSE::ovun.sample(formula = val ~ ., data = dframe, p=0.5, seed = 1, method = "over")
table(dframe_os$data$val)

How to fix "variable length differ" error in cv.zipath?

Trying to run a Cross validation of a zero-inflated poisson model using cv.zipath from the mpath package.
Fitting the LASSO
fit.lasso = zipath(estimation_sample_nomiss ~ .| .,
data = missings,
nlambda = 100,
family = "poisson",
link = "logit")
Cross validation
n <- dim(docvisits)[1]
K <- 10
set.seed(197)
foldid <- split(sample(1:n), rep(1:K, length = n))
fitcv <- cv.zipath(F_time_unemployed~ . | .,
data = estimation_sample_nomiss, family = "poisson",
nlambda = 100, lambda.count = fit.lasso$lambda.count[1:30],
lambda.zero = fit.lasso$lambda.zero[1:30], maxit.em = 300,
maxit.theta = 1, theta.fixed = FALSE, penalty = "enet",
rescale = FALSE, foldid = foldid)
I encounter the following error:
Error in model.frame.default(formula = F_time_unemployed ~ . + ., data = list(: variable lengths differ (found for '(weights)')
I have cleaned the sample of all NA's but still encounter the error message.
The solution turns out to be that the cv.zipath() command does not accept tibble data formats - at least in this instance. (No guarantee as to how this statement can be generalised). Having used dplyr commands, one needs to convert back to data frame. Thus, the solution is as simple as as.dataframe().

Error: nrow(x) == n is not TRUE when using Train in Caret

I have a training set that looks like
Name Day Area X Y Month Night
ATTACK Monday LA -122.41 37.78 8 0
VEHICLE Saturday CHICAGO -1.67 3.15 2 0
MOUSE Monday TAIPEI -12.5 3.1 9 1
Name is the outcome/dependent variable. I converted Name, Area and Day into factors, but I wasn't sure if I was supposed to for Month and Night, which only take on integer values 1-12 and 0-1, respectively.
I then convert the data into matrix
ynn <- model.matrix(~Name , data = trainDF)
mnn <- model.matrix(~ Day+Area +X + Y + Month + Night, data = trainDF)
I then setup tuning the parameters
nnTrControl=trainControl(method = "repeatedcv",number = 3,repeats=5,verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = multiClassSummary,allowParallel = TRUE)
nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
model <- train(y=ynn, x=mnn, method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)
However, I get the error Error: nrow(x) == n is not TRUE for the model<-train
I also get a similar error if I use xgboost instead of nnet
Anyone know whats causing this?
y should be a numeric or factor vector containing the outcome for each sample, not a matrix. Using
train(y = make.names(trainDF$Name), ...)
helps, where make.names modifies values so that they could be valid variable names.
Even though in the help file of train said either maxtrix or data frame would be expected, but you can try to convert the matrix to a data frame:
model <- train(y=ynn, x=as.data.frame(mnn), method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)

Resources