I want to extract the by tree predictions for each observation from the rfsrc object. In other words, for i trees and j observations, I want to extract an [i,j] matrix of the predictions. My goal is to calculate the prediction confidence intervals using the R code found at https://github.com/swager/randomForestCI. My analysis requires a competing risks random forest; otherwise I would have used the randomForest package which makes the by tree predictions more obvious to extract.
I appreciate any assistance.
EDIT: I am attempting to follow the procedure outlined here: http://blog.revolutionanalytics.com/2016/03/confidence-intervals-for-random-forest.html
Related
I have a data set where observations come from highly distinct groups. Each group may have a wildly different distribution, so I am trying to find the best distribution using fitdist from fitdistrplus, then use gamlssML from the gamlss package to find the best parameters.
My issue is with transforming the data after this step. For some of the distributions, like the Box-Cox t, I can find the equation for normalizing the data using the BCT coefficients, but for many of these distributions I cannot.
Does gamlss have a function that normalizes the data after fitting? Their documentation only provides the transformations for a small number of distributions https://www.gamlss.com/wp-content/uploads/2018/01/DistributionsForModellingLocationScaleandShape.pdf
Thanks a lot
The normalised data values (for any distribution) are exactly equal to the residuals from a gamlss fit,
m1 <- gamlss()
which can be accessed by
residuals(m1) or
m1$residuals
In order to compare two survival curves at a fixed point in time and perform basically a two sample test, I need to extract the sample variance of the estimate at a given point in time.
For an object created with the svykm function from Thomas Lumley's survey package in R, this should be accessible in the varlog list. Do the entries in this list constitute the transformed variances on the log scale or the untransformed variances?
I have read the documentation provided for the survey package, but did not fully come to a conclusion. I note that confidence intervals are computed on the log(survival) scale, following the default in survival package and their bounds are given as exp(log(x$surv)+1.96*sqrt(x$varlog)) and exp(log(x$surv)-1.96*sqrt(x$varlog)) in the R package documentation.
They are variances on the log scale.
I am new to machine learning and R. I want to run a statistical model to predict daily hours of supply of electricity (y). I have several x variables to use for prediction. I have three goals to achieve:
I want to use some sort of regularization to choose the x variables that should go in the model.
y is bounded between 0 and 24. So I want the predictions to also be bounded within this range.
The data has spatial attributes and I want to use spatial cross-validation to re-sample while tuning regularization parameters.
I am planning to use the mlr package in R. Which learner can I use that can achieve the above three goals?
Many thanks.
I want to compute an unsupervised random forest classification out of a raster stack in R. The raster stack represents the same extent in different spectral bands and as a result I want to obtain an unsupervised classification of the stack.
I am having problems with my code as my data is very huge. Is it okay to just convert the stack into a dataframe in order to run the random forest algorithm like this:
stack_median <- stack(b1_mosaic_median, b2_mosaic_median, b3_mosaic_median, b4_mosaic_median, b5_mosaic_median, b7_mosaic_median)
stack_median_df <- as.data.frame(stack_median)
Here is the data as a csv file (https://www.dropbox.com/s/gkaryusnet46f0i/stack_median_df.csv?dl=0) - and you can read it in via:
stack_median_df<-read.csv(file="stack_median_df.csv")
stack_median_df<-stack_median_df[,-1]
stack_median_df_na <- na.omit(stack_median_df)
My next step would be the unsupervised classification:
median_rf <- randomForest(stack_median_df_na, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)
Due to my huge dataset a proximity measure can't be calculated (would need around 6000GB). Do you know how to be able to have a look at the classification? As predict(median_rf) and plot(median_rf) don't return anything.
I am happy for every suggestion, improvement or code snippet of a unsupervised random forest classification with its accuracy measures,...
Thanks a lot!
I think you could use a large sample for unsupervised classification, and then use the create a supervised classification model (that predicts the classes from the raw data; and should have a very good fit) and apply that to the entire data set.
This question already has answers here:
ROC curve in R using ROCR package
(6 answers)
Closed 9 years ago.
I am doing ensemble forecasts for a quantity. And I have around 20 forecast values at each observation point. I will have an event definition of x% i.e. say 95% of highest observation value. I am trying to construct an ROC Curve using R:
Is ROCR a good package for probabilistic based ROC score?
Can you provide an example of how to construct this ROC curve?
Just assume a fake dataset.
I am reading all sorts of papers. But I am very confused as to how to calculate the forecast probabilities.
I would encourage you to look at the caret package. It's wonderful for ensemble learning. It'll tune your parameters for you based on RMSE, ROC (AUC), etc. by cross-validation. That is split your data up into samples with replacement, run tons of models while tuning parameters and give you back the best model.
The vignette (listed on the package page) here is excellent and you'll see examples in there showing plotting ROC curves.
However, if what you're looking for is the simple method to calculate an ROC score from predictions and held out data, check out page 11 of this pdf.