How can one calculate ROC's AUCs in complex designs with clustering in R? - r

The packages that calculate AUCs I've found so far do not contemplate sample clustering, which increases standard errors compared to simple random sampling. I wonder if the ones provided by these packages could be recalculated to allow for clustering.
Thank you.

Your best bet is probably replicate weights, as long as you can get point estimates of AUC that incorporate weights.
If you convert your design into a replicate-weights design object (using survey::as.svrepdesign()), you can then run any R function or expression using the replicate weights using survey::withReplicates() and return a standard error.

Related

R: Evaluate Gradient Boosting Machines (GBM) for Regression

Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation

Quantile Regression with Time-Series Models (ARIMA-ARCH) in R

I am working on quantile forecasting with time-series data. The model I am using is ARIMA(1,1,2)-ARCH(2) and I am trying to get quantile regression estimates of my data.
So far, I have found "quantreg" package to perform quantile regression, but I have no idea how to put ARIMA-ARCH models as the model formula in function rq.
rq function seems to work for regressions with dependent and independent variables but not for time-series.
Is there some other package that I can put time-series models and do quantile regression in R? Any advice is welcome. Thanks.
I just put an answer on the Data Science forum.
It basically says that most of the ready made packages are using so called exact test based on assumption on the distribution (independent identical normal-Gauss distribution, or wider).
You also have a family of resampling methods in which you simulate a sample with a similar distribution of your observed sample, perform your ARIMA(1,1,2)-ARCH(2) and repeat the process a great number of times. Then you analyze this great number of forecast and measure (as opposed to compute) your confidence intervals.
The resampling methods differs in the way to generate the simulated samples. The most used are:
The Jackknife: in which you "forget" one point, that is you simulate a n samples of size n-1 (if n is the size of the observed sample).
The Bootstrap: in which you simulate a sample by taking n values of the original sample with replacements: some will be taken once, some twice or more, some never,...
It is a (not easy) theorem that the expectation of the confidence intervals, as most of the usual statistical estimators, are the same on the simulated sample than on the original sample. With the difference that you can measure them with a great number of simulations.
Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named "What topics can I ask about here?" and "What types of questions should I avoid asking?". And more importantly, please read the Stack Overflow question checklist. You might also want to learn about Minimal, Complete, and Verifiable Examples.
I can try to address your question, although this is hard since you don't provide any code/data. Also, I guess by "put ARIMA-ARCH models" you actually mean that you want to make an integrated series stationary using an ARIMA(1,1,2) plus an ARCH(2) filters.
For an overview of the R time-series capabilities you can refer to the CRAN task list.
You can easily apply these filters in R with an appropriate function.
For instance, you could use the Arima() function from the forecast package, then compute the residuals with residuals() from the stats package. Next, you can use this filtered series as input for the garch() function from the tseries package. Other possibilities are of course possible. Finally, you can apply quantile regression on this filtered series. For instance, you can check out the dynrq() function from the quantreg package, which allows time-series objects in the data argument.

how to perform semi-supervised k-mean clustering

I am new in r. I am trying to perform semi-supervised k-means clustering. I plan to divide my 2/3 of my data as a training set, and 1/3 as a test set. My objective is to train a model using the known clusters, and then propagate the training model to the test set. the propagation result will be compare with the prior clusters. my objective is to check the prediction accuracy of kmeans clustering. Therefore I am wondering if there is a way we can do semi-supervised kmeans clustering using r? any package is needed. thank you.
thank you
regards,
Use kmeans(). It should come with the stats package, which you should have if you've installed R correctly. You can read how to use functions by putting a ? before the function call, e.g. ?kmeans().
Search online if you're still lost about how to use the function - there are plenty of guides and toy examples online.
M

Handling case weight in the Random Forest packages in R

I checked both the randomForest and the rfsrc packages in R, but couldn't find an easy way to apply observation/case weight when training the random forest model. Is there any way to do this?
As an alternative I thought about replicating my observations (e.g. replicate once if the observation has a weight of 2), but think this would be inefficient and difficult for non-integer case weight.
You could use the tree package which allows you to weight individual observations. This would of course only give you a single tree, so you would have to make the random forest yourself.
It might be a little more work, but it's probably a better solution than replicating observations.

How do you use the rugarch package to include the stable distribution

In R, I would like to use rugarch and stabledist/fBasics packages together to fit a univariate time-series object to be modeled as an ARMA(1,1)-GARCH(1,1) process with the innovation term/conditional distribution term being modeled as a stable distribution. Is there a way to to this? given that the fBasics package allows one to have a dstable() function, which I'm guessing would be used to optimise a maximum-likelihood function.
And as a follow up, how would one go about simulating several thousand iterations of x days forward returns assuming it follows the same process. (I'm guessing here using the function rstable() with the parameters estimated above.)
Any other packages that you might think would do the job better would gladly be looked at as well.
Yes, you can use dstable and rstable, but they come from package stabledist...
If you want to estimate the stable parameters, you can use
fBasics::stableFit(data, type="mle")
to give you MLE estimate, but usually takes few minutes to compute.
Faster, but little less precise is the quantile method (implicit for stableFit, i.e. dont specify the type).
Then if you get the fit, you extract the resulting estimates from result#fit$estiamte and can use it in rstable to draw random variates..

Resources