AUC in Weka vs R - r

I have received AUCs and prediction from a collaborated generated in Weka. The statistical model behin that was cross validated, so my dataset with the predictions includes columns for fold, predicted probability and true class. Using this data I was unable to replicate the AUCs given the predicted probabilities in R. The values always differ slightly.
Additional details:
Weka was used via GUI, not command line
I checked the AUC in R with packages pROC and ROCR
I first tried calculating the AUC over the collected predictions (without regard to fold) and I got different AUCs
Then I tried calculating the AUCs per fold and averaging. This did also not match.
The model was ridge logistic regression and there is a single tie in the predictions
The first fold has one sample more than the others. I have tried taking a weighted average, but this did not work out either
I have even tested averaging the AUCs after logit-transformation (for normality)
Taking the median instead of the mean did not help either
I am familiar with how the AUC is calculated in R, but I don't see what Weka could do differently.

Related

Using GAMLSS, the difference between fitDist() and gamlss()

When using the GAMLSS package in R, there are many different ways to fit a distribution to a set of data. My data is a single vector of values, and I am fitting a distribution over these values.
My question is this: what is the main difference between using fitDist() and gamlss() since they give similar but different answers for parameter values, and different worm plots?
Also, using the function confint() works for gamlss() fitted objects but not for objects fitted with fitDist(). Is there any way to produce confidence intervals for parameters fitted with the fitDist() function? Is there an accuracy difference between the two procedures? Thanks!
m1 <- fitDist()
fits many distributions and chooses the best according to a
generalized Akaike information criterion, GAIC(k), wit penalty k for each
fitted parameter in the distribution, where k is specified by the user,
e.g. k=2 for AIC,
k = log(n) for BIC,
k=4 for a Chi-squared test (rounded from 3.84, the 5% critical value of a Chi-squared distribution with 1 degree of fereedom), which is my preference.
m1$fits
gives the full results from the best to worst distribution according to GAIC(k).

How to calculate X-year survival probability from a cox regression of random survival forest in R

I want to build a survival model then calculate the X-year (e.g. 10-year) risk of survival.
Is there a way to do this using coxph or survreg? Is this possible using random survival forest (e.g. ranger)?
P.S. not sure if important but data is wide (~100 features - mostly continuous) and 17k samples.
For anyone else trying to do the same. If you build a cox-model with survival::coxph or rms::cph you can use the function pec::predictSurvProb.

How to assess the model and prediction of random forest when doing regression analysis?

I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.

How to calculate Bias and Variance for SVM and Random Forest Model

I'm working on a classification problem (predicting three classes) and I'm comparing SVM against Random Forest in R.
For evaluation and comparison I want to calculate the bias and variance of the models. I've looked up the two terms in many machine learning books and I'd say I do understand the sense of variance and bias (easiest explanation with the bullseye). But I can't really figure out how to apply it in my case.
Let's say I predict the results for a test set with 4 SVM-models that were trained with 4 different training sets. Each time I get a total error (meaning all wrong predictions/all predictions).
Do I then get the bias for SVM by calculating this?
which would mean that the bias is more or less the mean of the errors?
I hope you can help me with not to complicated formula, because I've already seen many of them.

Can't generate correlated random numbers from binomial distributions in R using rmvbin

I am trying to get a sample of correlated random numbers from binomial distributions in R. I tried to use rmvbin and it worked fine with some probabilities:
> rmvbin(100, margprob = c(0.1,0.1), bincorr=0.5*diag(2)+0.5)
while the next call which is quite similar one raises an error:
> rmvbin(100, margprob = c(0.01,0.01), bincorr=0.5*diag(2)+0.5)
Error in commonprob2sigma(commonprob, simulvals) :
Extrapolation occurred ... margprob and commonprob not compatible?
I can't find any justification for this.
This is a math/stats "problem" and not an R problem (In the sense that it's not a problem but a consequence of the model)
Short version: For bivariate binary data there is a link between the marginal probabilities and the correlation that can be observed. You can see it if you do a bit of boring juggling with the marginal probabilities $p_A$ and $p_B$ and the simultaneous probability $p_{AB}$. In other words: the marginal probabilities put restrictions on range of allowed correlations (and vice versa), and you are violating this in your call.
For bivariate Gaussian random variables the marginals and the correlations are separate and can be specified independently of each other.
The question should probably be moved to stats exchange.

Resources