Using the R Naive Bayes package I am trying to find a good model predicting a feature.
Assume I have 12 columns. Assume my train set has 600K rows and test set has 150K rows.
The 12th column (say X) is the one I am trying to predict using the first 11 factor rows.
With the code below
nb_model = naiveBayes(train[,1:11], train[,12])
prediction = predict(nb_model,test[,-12])
str(prediction)
length(prediction)
table(pred=prediction,test[,12])
confusionMatrix(prediction,test[,12])
I get a low accuracy on TRUE POSITIVES.
Actual False | Actual True
Pred False 115442 | 24862
Pred True 4559 | 5137
I have a feeling my TRUE POSITIVES are dominated by FALSE values in the TRAIN set, since the ratio of POSITIVES against ALL VALUES is like 1/5. But PREDICTED TRUE POSITIVES / ALL POSITIVES are even less then 1/5 means worse then random !
The question how can I set some thresholds etc so I can make more TRUE POSITIVE predictions count ? For now I dont care the TP/(TP+NP) rate.
thanks.
Related
When I'm running random forest model over my test data I'm getting different results for the same data set + model.
Here are the results where you can see the difference over the first column:
> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)
FALSE TRUE
FALSE 14 7
TRUE 13 66
> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)
FALSE TRUE
FALSE 15 7
TRUE 12 66
Although the difference is very small, I'm trying to understand what caused that. I'm guessing that predict has "flexible" classification threshold, although I couldn't find that in the documentation; Am I right?
Thank you in advance
I will assume that you did not refit the model here, but it is simply the predict call that is producing these results. The answer is probably this, from ?predict.randomForest:
Any ties are broken at random, so if this is undesirable, avoid it by
using odd number ntree in randomForest()
I would like to perform automated, exhaustive model selection on a dataset with 7 predictors (5 continuous and 2 categorical) in R. I would like all continuous predictors to have the potential for interaction (at least up to 3 way interactions) and also have non-interacting squared terms.
I have been using regsubsets() from the leaps package and have gotten good results, however many of the models contain interaction terms without including the main effects as well (e.g., g*h is an included model predictor but g is not). Since inclusion of the main effect as well will affect the model score (Cp, BIC, etc) it is important to include them in comparisons with the other models even if they are not strong predictors.
I could manually weed through the results and cross off models that include interactions without main effects but I'd prefer to have an automated way to exclude those. I'm fairly certain this isn't possible with regsubsets() or leaps(), and probably not with glmulti either. Does anyone know of another exhaustive model selection function that allows for such specification or have a suggestion for script that will sort the model output and find only models that fit my specs?
Below is simplified output from my model searches with regsubsets(). You can see that model 3 and 4 do include interaction terms without including all the related main effects. If no other functions are known for running a search with my specs then suggestions on easily sub-setting this output to exclude models without the necessary main effects included would be helpful.
Model adjR2 BIC CP n_pred X.Intercept. x1 x2 x3 x1.x2 x1.x3 x2.x3 x1.x2.x3
1 0.470344346 -41.26794246 94.82406866 1 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2 0.437034361 -36.5715963 105.3785057 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 0.366989617 -27.54194252 127.5725366 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
4 0.625478214 -64.64414719 46.08686422 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
You can use the dredge() function from the MuMIn package.
See also Subsetting in dredge (MuMIn) - must include interaction if main effects are present .
After working with dredge I found that my models have too many predictors and interactions to run dredge in a reasonable period (I calculated that with 40+ potential predictors it might take 300k hours to complete the search on my computer). But it does exclude models where interactions don't match with main effects so I imagine that might still be a good solution for many people.
For my needs I've moved back to regsubsets and have written some code to parse through the search output in order to exclude models that contain terms in interactions that are not included as main effects. This code seems to work well so I'll share it here. Warning: it was written with human expediency in mind, not computational, so it could probably be re-coded to be faster. If you've got 100,000s of models to test you might want to make it sleeker. (I've been working on searches with ~50,000 models and up to 40 factors which take my 2.4ghz i5 core a few hours to process)
reg.output.search.with.test<- function (search_object) { ## input an object from a regsubsets search
## First build a df listing model components and metrics of interest
search_comp<-data.frame(R2=summary(search_object)$rsq,
adjR2=summary(search_object)$adjr2,
BIC=summary(search_object)$bic,
CP=summary(search_object)$cp,
n_predictors=row.names(summary(search_object)$which),
summary(search_object)$which)
## Categorize different types of predictors based on whether '.' is present
predictors<-colnames(search_comp)[(match("X.Intercept.",names(search_comp))+1):dim(search_comp)[2]]
main_pred<-predictors[grep(pattern = ".", x = predictors, invert=T, fixed=T)]
higher_pred<-predictors[grep(pattern = ".", x = predictors, fixed=T)]
## Define a variable that indicates whether model should be reject, set to FALSE for all models initially.
search_comp$reject_model<-FALSE
for(main_eff_n in 1:length(main_pred)){ ## iterate through main effects
## Find column numbers of higher level ters containing the main effect
search_cols<-grep(pattern=main_pred[main_eff_n],x=higher_pred)
## Subset models that are not yet flagged for rejection, only test these
valid_model_subs<-search_comp[search_comp$reject_model==FALSE,]
## Subset dfs with only main or higher level predictor columns
main_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%main_pred]
higher_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%higher_pred]
if(length(search_cols)>0){ ## If there are higher level pred, test each one
for(high_eff_n in search_cols){ ## iterate through higher level pred.
## Test if the intxn effect is present without main effect (working with whole column of models)
test_responses<-((main_pred_df[,main_eff_n]==FALSE)&(higher_pred_df[,high_eff_n]==TRUE))
valid_model_subs[test_responses,"reject_model"]<-TRUE ## Set reject to TRUE where appropriate
} ## End high_eff for
## Transfer changes in reject to primary df:
search_comp[row.names(valid_model_subs),"reject_model"]<-valid_model_subs[,"reject_model"
} ## End if
} ## End main_eff for
## Output resulting table of all models named for original search object and current time/date in folder "model_search_reg"
current_time_date<-format(Sys.time(), "%m_%d_%y at %H_%M_%S")
write.table(search_comp,file=paste("./model_search_reg/",paste(current_time_date,deparse(substitute(search_object)),
"regSS_model_search.csv",sep="_"),sep=""),row.names=FALSE, col.names=TRUE, sep=",")
} ## End reg.output.search.with.test fn
Assuming "test" and "train" are two data frames for testing and traininig respectively, and "model" is a classifier that was generated using training data. I can find the number of misclassified examples like this:
n = sum(test$class_label != predict(model, test))
How can I find the number of examples that is predicted as negative but it is actually positive? (i.e. false positive)
NOTE: Above example assumes that the problem is a binary classification problem whose classes are, say, "yes" (positive class) and "no". Additionally, predict is a function of caret package.
This will get you a 2x2 table showing true positives, false positives, false negatives and true negatives.
> table(Truth = test$class_label, Prediction = predict(model, test))
Prediction
Truth yes no
yes 32 3
no 8 27
I built a decision tree from training data using the rpart package in R. Now i have more data and I want to check it against the tree to check the model. Logically/iteratively, I want to do the following:
for each datapoint in new data
run point thru decision tree, branching as appropriate
examine how tree classifies the data point
determine if the datapoint is a true positive or false positive
How do I do that in R?
To be able to use this, I assume you split up your training set into a subset training set and a test set.
To create the training model you can use:
model <- rpart(y~., traindata, minbucket=5) # I suspect you did it so far.
To apply it to the test set:
pred <- predict(model, testdata)
You then get a vector of predicted results.
In your training test data set you also have the "real" answer. Let's say the last column in the training set.
Simply equating them will yield the result:
pred == testdata[ , last] # where 'last' equals the index of 'y'
When the elements are equal, you will get a TRUE, when you get a FALSE it means your prediction was wrong.
pred + testdata[, last] > 1 # gives TRUE positive, as it means both vectors are 1
pred == testdata[, last] # gives those that are correct
It might be interesting to see how much percent you have correct:
mean(pred == testdata[ , last]) # here TRUE will count as a 1, and FALSE as 0
I am working on Random Forest classification.
I found that cforest in "party" package usually performs better than "randomForest".
However, it seemed that cforest easily overfitted.
A toy example
Here is a random data set that includes response of binary factor and 10 numerical variables generated from rnorm().
# Sorry for redundant preparation.
data <- data.frame(response=rnorm(100))
data$response <- factor(data$response < 0)
data <- cbind(data, matrix(rnorm(1000), ncol=10))
colnames(data)[-1] <- paste("V",1:10,sep="")
Perform cforest, employing unbiased parameter set (maybe recommended).
cf <- cforest(response ~ ., data=data, controls=cforest_unbiased())
table(predict(cf), data$response)
# FALSE TRUE
# FALSE 45 7
# TRUE 6 42
Fairly good prediction performance on meaningless data.
On the other hand, randomForest goes honestly.
rf <- randomForest(response ~., data=data)
table(predict(rf),data$response)
# FALSE TRUE
# FALSE 25 27
# TRUE 26 22
Where these differences come from?
I am afraid that I am using cforest in a wrong way.
Let me put some extra observations in cforest:
The number of variables did not much affect the result.
Variable importance values (computed by varimp(cf)) were rather low, compared to those using some realistic explanatory variables.
AUC of ROC curve was nearly 1.
I would appreciate your advices.
Additional note
Some wondered why a training data set was applied to the predict().
I did not prepare any test data set because the prediction was done for OOB samples, which was not true for cforest.
c.f. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
You cannot learn anything about the true performance of a classifier by studying its performance on the training set. Moreover, since there is no true pattern to find you can't really tell if it is worse to overfit like cforest did, or to guess randomly like randomForest did. All you can tell is that the two algorithms followed different strategies, but if you'd test them on new unseen data both would probably fail.
The only way to estimate the performance of a classifier is to test it on external data, that has not been part of the training, in a situation you do know there is a pattern to find.
Some comments:
The number of variables shouldn't matter if none contain any useful information.
Nice to see that the variable importance is lower for meaningless data than meaningful data. This could serve as a sanity check for the method, but probably not much more.
AUC (or any other performance measure) doesn't matter on the training set, since it is trivial to obtain perfect classification results.
The predict methods have different defaults for cforest and randomForest models, respectively. party:::predict.RandomForest gets you
function (object, OOB = FALSE, ...)
{
RandomForest#predict(object, OOB = OOB, ...)
}
so
table(predict(cf), data$response)
gets me
FALSE TRUE
FALSE 45 13
TRUE 7 35
whereas
table(predict(cf, OOB=TRUE), data$response)
gets me
FALSE TRUE
FALSE 31 24
TRUE 21 24
which is a respectably dismal result.