Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation
Related
Colleagues, I have a graph-1 and I need to predict the value of another graph-2 based on its data
graphs have a correlation, that's for sure - using machine learning, I can predict graph-2 according to graph-1, but I would like to have a mathematical formula for which the prediction will be
My plan is just to make an approximation, there are цуи-sites where mathematical formulas are automatically selected and, as a result, take a formula that has the least average approximation error,% well, then use this formula and see
maybe there is a smarter way
Please see the image
I have recently run an ensemble classifier in MLR (R) of a multicenter data set. I noticed that the ensemble over three classifiers (that were trained on different data modalities) was worse than the best classifier.
This seemed to be unexpected to me. I was using logistic regressions (without any parameter optimization) as simple classifier and a Partial Least Squares (PLS) Discriminant Analysis as a superlearner, since the base-learner predictions ought to be correlated. I also tested different superlearners like NB, and logistic regression. The results did not change.
Here are my specific questions:
1) Do you know, whether this can in principle occur?
(I also googled a bit and found this blog that seems to indicate that it can:
https://blogs.sas.com/content/sgf/2017/03/10/are-ensemble-classifiers-always-better-than-single-classifiers/)
2) Especially, if you are as surprised as I was, do you know of any checks I could do in mlr to make sure, that there isnt a bug. I have tried to use a different cross-validation scheme (originally I used leave-center-out CV, but since some centers provided very little data, I wasnt sure, whether this might lead to weird model fits of the super learner), but it still holds. I also tried to combine different data modalities and they give me the same phenomenon.
I would be grateful to hear, whether you have experienced this and if not, whether you know what the problem could be.
Thanks in advance!
Yes, this can happen - ensembles do not always guarantee a better result. More details regarding cases where this can happen are discussed also in this cross-validate question
I have a set of data and I'd like to know whether this data set has a logistic distribution.
When I made a histogram of my data set (see the histogram on http://imageshack.us/photo/my-images/593/histogram.png/) it seems to have a logistic distribution, but to be sure I'd like to test for a logistic distribution in R. So my question is: Is there a way to test your data for a logistic distribution and how do you do this?
Additional information: The data set consists of 8544 items. The data are horizontal distances in km between 2 geographical points.
Thanks for your attention
Sander
In R you can use the ks.test or chisq.test functions (and probably others) to test against a hypothesized distribution. Note that these tests (and others) are all rule out tests, a non-significant result does not guarentee that the data come from the given distribution, just that you cannot rule it out. Also note that with a sample size of 8544 these tests are likely to be way overpowered, meaning that it will have power to find slight meaningless differences and you are likely to reject the null hypothesis even though it is "close enough". Also the fact that you decided on a distribution based on looking at the data first could bias results.
Another approach that may give you a better feel for if a logistic distribution is "close enough" rather than exactly is to use the vis.test function in the TeachingDemos package (be sure to read the paper referenced in the help page to understand the test and what assumptions you are making).
Most importantly is understanding the science that leads to the data, does a logistic distribution make sense scientifically? what other distributions could be reasonble? Also understand what question(s) you are trying to answer with the data and what is the effect on those answers of the distribution (e.g. the CLT will let you use the normal to answer some questions, but not others, using a normal distribution even though the data comes from a logistic or something similar).
Can anyone give me a good explanation for what the parameter "algorithm" does in the nls function in R?
Also, how does the formula work? I know it uses a tilda, but I can't really find a down-to-earth explanation of it.
Also, how important are the start values? Do I need to try multiple start values, or can I still have a guarantee that nls will find the correct parameters regardless of the start values I use?
In brief:
nls() is going to vary parameters to try to minimize the square error between your model and your data. There's several good methods it can try to find the minimum. Reading the details about "method" in ?optim will provide some good info and references.
In general, for nonlinear models, your results can be sensitive to initial guess. You should try several different guesses to make sure that the outputs are close. If your results are very sensitive to your guess, you can try re-parameterizing, using a different algorithm, or rethinking your model.
As for the formula, I'd echo the previous answer. Work through the examples in the bottom of ?nls and then try to ask a more specific question.
I am trying to manually calculate the standard error of the constant in an ARIMA model, if it is included. I have referred to Box and Jenkins (1994) text, specially Section 7.2, but my understanding is that the methods mentioned here calculates the variance-covariance matrix for the ARIMA parameters only, not the constant. Tried searching on the Internet, but couldn't find any theory. Software like Minitab, R etc. calculate this, so I was wondering what is the way? Can someone provide any pointer(s) on this topic?
Thanks.
arima() will fit a regression model with ARMA errors. The constant is treated as the coefficient of a regression variable consisting only of 1s. So you need the covariance matrix of the regression coefficients which is usually calculated separately from the covariance matrix of the ARMA coefficients. Look at Section 8.3 of Hamilton's "Time series analysis"
One of the nicest things about R is that you can access a lot of the source code to R itself from within the environment. If you simply type arima at the command prompt, you get the high-level source code for the arima() function. I got several pages of code here, when I tried it.
You do miss out on anything implemented internally within the R executable in native code, but often the high-level code tells you everything you want to know.
Perhaps a shift of perspective can solve this problem.
Rather than seeing the constant as something special, just consider the problem without constant and with a variable that is a vector of ones.