The accuracy limit of n weak classifier combinations - math

For dataset D(n * m), n is the number of samples, m is the number of features.
There are k weak classifiers, and the accuracy of each classifier is 60%.
How many weak classifier combinations can improve accuracy to 90%?
Can this problem be solved by mathematical formula?
If use 2 classifiers, the accuracy is 60%
If use 3 classifiers, the accuracy is 64.8%(3 * 60%^2 * 40% + 60%^3)
Is that right?

The idea behind your question is ensembling.
This can work when each of the model you do choose, perform well for individual classes. So you can assign a weight to each of them according to the classes and generate the final output.
for example,
you have 3 classes (C1, C2, C3)
Let's say Model A, predicts well for C1 then you can set the final probability of C1 as
prob_of_C1=model_A_prob0.7+model_B_prob0.2+model_C_prob*0.1
Similarly, you can apply the same rule for other classes. You might have to change the weights you assign and that's generally done based on the precision of each model for that particular class. This will work, only if you have the models performing differently for different classes.
if you want to learn more, you can look at xgboost alogrithm, and this blog explains it nicely: XGBoost Algorithm: Long May She Reign!

Related

why Strauss-hardcore model could has a gamma bigger than 1?

the spatstat book said clearly that a Strauss model is invalid with a gamma bigger than 1, that is true:
multiple.Strauss<-ppm(P1a4.multiple~1, Strauss(r=51),method='ho')
#Warning message:
#Fitted model is invalid - cannot be simulated
as the L(r) function does has a trough first, I refit the data as a Strauss-hardcore model:
Mo.hybrid<-Hybrid(H=Hardcore(),S=Strauss(51))
multiple.hybrid<-ppm(P1a4.multiple~1,Mo.hybrid,method='ho')
#Hard core distance: 12.65963
#Fitted S interaction parameter gamma: 2.7466492
it interesting to see that the model fitted suceessfully, with a gamma>1 !
I want to know whether the gamma in Strauss-Hardcore model has same meaning with Strauss model, therefore could used as a indicator of aggregation?
Yes, the interpretation is similar and indicates some aggregation behaviour. The model with gamma>1 may be less intuitive to understand: Say the hardcore distance is r=12 and the Strauss interaction distance is R=50. Then you say that pairs of points within distance 12 of each other are heavily penalized (not permitted at all) while pairs of points separated by between 12 and 50 are encouraged (have a higher probability of occurring than at random). Pairs of points separated by more than 50 do not change the baseline probability (complete randomness).
Simulations from the StraussHardcore model often shows strange aggregation behavior, but it may be suitable for your data.

Three-step method LCA in R (poLCA). Posterior probabilities from inclusive LCA?

As recommended by Bray, Lanzaa and Tanb (2015) I’d like to perform three-step method to classify individuals into classes by using posterior probabilities of inclusive LCA (LCA including covariates). However, the inclusive model is very different compare with the non-inclusive model if I include all variables of interest.
Conditional probabilities are completely different, as well as the number of cases per class. Therefore, the interpretation of profiles or patterns changes completely from the non-inclusive model (step-1) when using posterior probabilities of inclusive LCA (in order to assign the cases).
My question is, am I doing something wrong? Is it normal to get these changes? Maybe procedure isn't correct. The model itself loses sense when looking at item conditional probabilities of each class.
These are the steps I took:
To perform LCA to study profiles of sexual risk behaviors (using 6 variables) and analyze association with diferent types of drug use, gender and age (model 4 seemed the best choice).
z <- cbind(sexrisk1, sexrisk2, sexrisk3, sexrisk4, sexrisk5, sexrisk6)
lc4 <- poLCA(z, MyData, nclass = 4,nrep=10)
Include all variables of interest as covariate for “appropriate” posterior analysis (as recommended Bray, Lanzaa and Tanb (2015))
f <- cbind(sexrisk1, sexrisk2, sexrisk3, sexrisk4, sexrisk5, sexrisk6)~ drug1+drug2+drug3+gender+age
lc4.cov <- poLCA(f, MyData, nclass = 4,nrep=10)
Once inclusive model is performed, I used the values of predicted classes and posterior probabilities (which I think poLCA does it via maximum-probability assignment. Not sure of this) to assign cases to membership classes.
table(lc4.cov$predclass)
write.csv(cbind(MyData$code, lc4.cov$posterior), 'new.data.csv')
(NOTE: by incresing the number of nrep of both models (inclusive and non-inclusive) results of posterior probabilities showed less differences).

How to perform a Multivariate Polynomial Regression when output has stochastic behavior?

I have a experiment being simulated. This experiment has 3 parameters a,b,c (variables?) but the result, r, cannot be "predicted" as it has a stochastic component. In order to minimize the stochastic component I've run this experiment several times(n). So in resume I have n 4-tuples a,b,c,r where a,b,c are the same but r varies. And each batch of experiments is run with different values for a, b, c (k batches) making the complete data-set having k times n sets of 4-tuples.
I would like to find out the best polynomial fit for this data and how to compare them like:
fit1: with
fit2: with
fit3: some 3rd degree polynomial function and corresponding error
fit4: another 3rd degree (simpler) polynomial function and corresponding error
and so on...
This could be done with R or Matlab®. I've searched and found many examples but none handled same input values with different outputs.
I considered doing the multivariate polynomial regression n times adding some small delta to each parameter but I'd rather take a cleaner sollution before that.
Any help would be appreciated.
Thanks in advance,
Jacques
Polynomial regression should be able to handle stochastic simulations just fine. Just simulate r, n times, and perform a multivariate polynomial regression across all points you've simulated (I recommend polyfitn()).
You'll have multiple r values for the same [a,b,c] but a well-fit curve should be able to estimate the true distribution.
In polyfitn it will look something like this
n = 1000;
a = rand(500,1);
b = rand(500,1);
c = rand(500,1);
for n = 1:1000
for i = 1:length(a)
r(n,i) = foo(a,b,c);
end
end
my_functions = {'a^2 b^2 c^2 a b c',...};
for fun_id = 1:length(my_functions)
p{f_id} = polyfitn(repmat([a,b,c],[n,1]),r(:),myfunctions{fun_id})
end
It's not hard to iteratively/recursively generate a set of polynomial equations from a basis function; but for three variables there might not be a need to. Unless you have a specific reason for fitting higher order polynomials (planetary physics, particle physics, etc. physics), you shouldn't have too many functions to fit. It is generally not good practice to use higher-order polynomials to explain data unless you have a specific reason for doing so (risk of overfitting, sparse data inter-variable noise, more accurate non-linear methods).

The best way to calculate classification accuracy?

I know one formula to calculate classification accuracy is X = t / n * 100 (where t is the number of correct classification and n is the total number of samples. )
But, let's say we have total 100 samples, 80 in class A, 10 in class B, 10 in class C.
Scenario 1: All 100 samples were assigned to class A, by using the formula, we got accuracy equals 80%.
Scenario 2: 10 samples belong to B were correctly assigned to class B ;10 samples belong to C were correctly assigned to class C as well; 30 samples belong to A correctly assigned to class A; the rest 50 samples belong to A were incorrectly assigned to C. By using the formula, we got accuracy of 50%.
My question is:
1: Can we say scenario 1 has a higher accuracy rate then scenario 2?
2: Is there any way to calculate accuracy rate for classification problem?
Many thanks ahead!
Classification accuracy is defined as "percentage of correct predictions". That is the case regardless of the number of classes. Thus, scenario 1 has a higher classification accuracy than scenario 2.
However, it sounds like what you are really asking is for an alternative evaluation metric or process that "rewards" scenario 2 for only making certain types of mistakes. I have two suggestions:
Create a confusion matrix: It describes the performance of a classifier so that you can see what types of errors your classifier is making.
Calculate the precision, recall, and F1 score for each class. The average F1 score might be the single-number metric you are looking for.
The Classification metrics section of the scikit-learn documentation has lots of good information about classifier evaluation, even if you are not a scikit-learn user.

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.
The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.
The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.
Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.
One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.
I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

Resources