random forest gets worse as number of trees increases

random forest gets worse as number of trees increases - r

I am running into difficulties when using randomForest (in R) for a classification problem. My R code, an image, and data are here:
http://www.psy.plymouth.ac.uk/research/Wsimpson/data.zip
The observer is presented with either a faint image (contrast=con) buried in noise or just noise on each trial. He rates his confidence (rating) that the face is present. I have categorised rating to be a yes/no judgement (y). The face is either inverted (invert=1) or not in each block of 100 trials (one file). I use the contrast (1st column of predictor matrix x) and the pixels (the rest of the columns) to predict y.
It is critical to my application that I have an "importance image" at the end which shows how much each pixel contributes to the decision y. I have 1000 trials (length of y) and 4248 pixels+contrast=4249 predictors (ncols of x). Using glmnet (logistic ridge regression) on this problem works fine
fit<-cv.glmnet(x,y,family="binomial",alpha=0)
However randomForest does not work at all,
fit <- randomForest(x=x, y=y, ntree=100)
and it gets worse as the number of trees increases. For invert=1, the classification error for randomForest is 34.3%, and for glmnet it is 8.9%.
Please let me know what I am doing wrong with randomForest, and how to fix it.

ridge regression's only parameter lambda is chosen via internal cross-validation in cv.glmnet, as pointed out by Hong Ooi. and the error rate you get out of cv.glmnet realtes to that. randomForest gives you OOB error that is akin to an error on a dedicated test set (which is what you are interested in).
randomForest requires you to calibrate it manually (i.e. have a dedicated validation set to see which parameters work best) and there are a few to consider: depth of the trees (via fixing the number of examples in each node or the number of nodes), number of randomly chosen attributes considered at each split and the number of trees. you can use tuneRF to find the optimal number of mtry.
when evaluated on the train set, the more trees you add the better your predictions get. however, you will see predictive ability on a test set starts diminishing after a certain number of trees are grown -- this is due to overfitting. randomForest determines the optimal number of trees via OOB error estimates or, if you provide it, by using the test set. if rf.mod is your fitted RF model then plot(rf.mod) will allow you to see at which point roughly it starts to overfit. when using the predict function on a fitted RF it will use the optimal number of trees.
in short, you are not comparing the two models' performances correctly (as pointed out by Hong Ooi) and also your parameters might be off and/or you might be overfitting (although unlikely with just 100 trees).

Related

My understanding of : How does CV.GLMNET work to choose optimal lambda?

I wish to confirm my understanding of CV procedure in the glmnet package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.
Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.
My explanation to the reviewer (who is a doctor like me) of how cv.glmnet does this is :
In each step of 10 fold cross-validation, data were divided randomly
into two groups containing 9/10th data for training and 1/10th for
internal validation (i.e., measuring binomial deviance/error of model
developed with that lambda). Lambda vs. deviance was plotted. When the
process was repeated 9 more times, 95% confidence intervals of lambda
vs. deviance were derived. The final lambda value to go into the model
was the one that gave the best compromise between high lambda and low
deviance. High lambda is the factor that minimises overfitting because
the regression model is not allowed to improve by assigning large
coefficients to the variables. The model is then trained on the entire
dataset using least squares approximation that minimises model error
penalized by lambda term. Because the lambda term is chosen through
cross-validation (and not from the entire dataset), the choice of
lambda is somewhat independent of the data.
I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this.
Thanks in advance.

A bit late I guess, but here goes.
By default glmnet chooses the lambda.1se. It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error. You can also check out this post. Not very sure if you mean this with "The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance."
The main issue with your approach is calculating its accuracy on the same training data. This does not tell you how good the model will perform on unseen data, and bootstrapping does not address the error in the accuracy. For an estimate of the error, you should actually use the error from the cross validation. If your model does not work on 90% of the data, I don't see how using all of the training data works.

Can I trust a full glmer model that converges ONLY with bobyqa and with contrast sum coding?

I am using R 3.2.0 with lme4 version 1.1.8. to run a mixed effects logistic regression model on some binomial data (coded as 0 and 1) from a psycholinguistic experiment. There are 2 categorical predictors (one with 2 levels and one with 3 levels) and two random terms (participants and items). I am using sum coding for the predictors (i.e. contr.sum..) which gives me the effects and interactions that I am interested in.
I find that the full model (with fixed effects and interactions, plus random intercepts AND slopes for the two random terms) converges ONLY when I specify (optimizer="bobyqa"). If I do not specify the optimizer, the model converges only after simplifying the model drastically. The same thing happens when I use the default treatment coding, even when I specify optimizer="bobyqa".
My first question is why is this happening and can I trust the output of the full model?
My second question is whether this might be due to the fact that my data is not fully balanced, in the sense that my conditions do not have exactly the same number of observations. Are there special precautions one must take when the data is not full balanced? Can one suggest any reading on this particular case?
Many thanks

You should take a look at the ?convergence help page of more recent versions of lme4 (or you can read it here). If the two fits using different optimizers give similar estimated parameters (despite one giving convergence warnings and the other not), and the fits with different contrasts give the same log-likelihood, then you probably have a reasonable fit.
In general lack of balance lowers statistical power and makes fitting more difficult, but mildly to moderate unbalanced data should present no particular problems.

How to consider different costs for different types of errors in SVM using R

Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?

Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.

The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.

class importance for random forest in r

I'm using randomForest pkg in R to predict the binary class based upon 11 numerical predictors. Out of the two classes, Hit or Miss, the class Hit is of more importance, i.e. I would like to know about how many times correctly predicted Hit.
Is there a way to give the Hit a higher importance in training the random forest? Currently the trained random forest predicts merely 7% of the Hit cases correctly and definitely would like an improvement.

Higher importance? I don't know how to tell any algorithm "I'm not kidding this time: I want this analysis to be accurate."
You're always fighting the variance versus bias battle. If you improve the training accuracy too much, you run the risk of overfitting.
You can adjust random forest by varying the size of the random sample of predictors. If you have m predictors, the recommendation for random forest is p = m^1/2 for the number of splits in the tree. You can also vary the number of trees. Plot the test classification error versus # trees for different values of p to see how you do.
You can also try another algorithm, like gbm (Generalized Boosted Regression Models) or support vector machines
How does your data look when you plot it? Any obvious groups jumping out at you when you look at them in scatterplots?
Regardless of algorithm, I'd advise that you do n-fold validation of your model.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?

This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex