How to perform random forest/cross validation in R - r

I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce.
So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time.
ID RT (seconds) 1_MW 2_AMW 3_Sv 4_Se
4281 38 145.29 5.01 14.76 28.37
4952 40 132.19 6.29 11 21.28
4823 41 176.21 7.34 12.9 24.92
3840 41 174.24 6.7 13.99 26.48
3665 42 240.34 9.24 15.2 27.08
3591 42 161.23 6.2 13.71 26.27
3659 42 146.22 6.09 12.6 24.16
This is an example of the table that I have. I want to basically plot RT against 1_MW, etc (up to 1664 variables), so I can find which of these variables are of importance and which aren't.
I do:-
r = randomForest(RT..seconds.~., data = cadets, importance =TRUE, do.trace = 100)
varImpPlot(r)
which tells me which variables are of importance and what not, which is great. However, I want to be able to partition my dataset so that I can perform cross validation on it. I found an online tutorial that explained how to do it, but for a classification model rather than regression.
I understand you do:-
k = 10
n = floor(nrow(cadets)/k)
i = 1
s1 = ((i-1) * n+1)
s2 = (i * n)
subset = s1:s2
to define how many cross folds you want to do, and the size of each fold, and to set the starting and end value of the subset. However, I don't know what to do here on after. I was told to loop through but I honestly have no idea how to do this. Nor do I know how to then plot the validation set and the test set onto the same graph to depict the level of accuracy/error.
If you could please help me with this I'd be ever so grateful, thanks!

From the source:
The out-of-bag (oob) error estimate
In random forests, there is no need for cross-validation or a separate
test set to get an unbiased estimate of the test set error. It is
estimated internally , during the run...
In particular, predict.randomForest returns the out-of-bag prediction if newdata is not given.

As topchef pointed out, cross-validation isn't necessary as a guard against over-fitting. This is a nice feature of the random forest algorithm.
It sounds like your goal is feature selection, cross-validation is still useful for this purpose. Take a look at the rfcv() function within the randomForest package. Documentation specifies input of a data frame & vector, so I'll start by creating those with your data.
set.seed(42)
x <- cadets
x$RT..seconds. <- NULL
y <- cadets$RT..seconds.
rf.cv <- rfcv(x, y, cv.fold=10)
with(rf.cv, plot(n.var, error.cv))

Related

Obtaining covariates' estimates in rdrobust package

I am using rdrobust to estimate RDDs and for a submission in a journal the journal demands I report tables with covariates and their estimates. I don't think these should be reported in designs like these and don't really know how informative they are, but anyways: I can't find them anywhere in the output of the rdrobust call, so I was wondering whether there is anyway of actually obtaining them.
Here's my code:
library(rdrobust)
rd <- rdrobust(y = full_data$share_female,
x = full_data$running,
c = 0,
cluster = full_data$constituency.name,
covs=cbind(full_data$income, full_data$year_fct,
full_data$population, as.factor(full_data$constituency.name)))
I then call the object
rd
And get:
Call: rdrobust
Number of Obs. 1812
BW type mserd
Kernel Triangular
VCE method NN
Number of Obs. 1452 360
Eff. Number of Obs. 566 170
Order est. (p) 1 1
Order bias (q) 2 2
BW est. (h) 0.145 0.145
BW bias (b) 0.221 0.221
rho (h/b) 0.655 0.655
Unique Obs. 1452 360
So as you see there seems to be no information on this on the output nor the object the function calls. I don't really know what to do.
Thanks!
Unfortunately, I do not believe rdrobust() allows you to recover the coefficients introduced through the covs option.
In your case, running the code as you provided and then running:
rd$coef
will only give you the point estimate for the rd estimator.
Josh McCrain has written-up a nice vignette here to replicate rdrobust using lfe that also allows you to recover the coefficients on covariates.
It involves some modification on your part and is of course not as user friendly, but does allow recovery of covariates.
This might be beside the point by now, but the journal requirement in an RD design is odd.
Use summary(rd). This will return the coefficient estimate

Cointegration: Testing for bounds in VAR and interpretation of results

I am currently working with different pairwise VAR models to analyze cointegration relationships.
There is following pair of time series: X is I(0), Y I(1). Because X and Y are not integrated of same order (i.e. I(1) and I(1)), I can't carry out the Johansen and Juselius (ca.jo) test with the vars package. Rather, I have to consider a test by Pesaran et al. (2001) that works for time series that are integrated of different order.
Here is my reproducible code for a cointegration test of variables of different order of integration with a package named ardl:
install.packages("devtools")
library(devtools)
install_github("fcbarbi/ardl")
library(ardl)
data(br_month)
br_month
m1 <- ardl(mpr~cpi, data=br_month, ylag=1, case=3)
bounds.test(m1)
m2 <- ardl(cpi~mpr, data=br_month, ylag=1, case=3)
bounds.test(m2)
Question here:
Can I test the cointegration of a VAR (with 2 variables) with the ARDL test?
Interpretation of results (case 5 = constant+trend):
bounds.test(m1)
PSS case 5 ( unrestricted intercept, unrestricted trend )
Null hypothesis (H0): No long-run relation exist, ie H0:pi=0
I(0) I(1)
10% 5.59 6.26
5% 6.56 7.30
1% 8.74 9.63
F statistic 11.21852
Existence of a Long Term relation is not rejected at 5%.
bounds.test(m2)
PSS case 5 (unrestricted intercept, unrestricted trend )
Null hypothesis (H0): No long-run relation exist, ie H0:pi=0
I(0) I(1)
10% 5.59 6.26
5% 6.56 7.30
1% 8.74 9.63
F statistic 5.571511
Existence of a Long Term relation is rejected at 5% (even assumming all regressors I(0))
I would conclude that there is a cointegration relationship between cpi and mpr as the F statistic for m2 is smaller than the critical value for I(0) at the 5% level.
However, does it tell me anything that it can be concluded for m2 but not m1?
To me, you are confusing the definition of "cointegration". Because: for many time series to be cointegrated, they must have the same order of integration.
So, your question rather seems to be "what can I do when I have time series with different order of integration?".
So, in that case, I would advise you to take differences (of non-stationary variables) to obtain stationarity; and leave stationary ones as they are. Then apply VAR ordinarily.
From ARDL test's 2001 paper: This paper proposes a new approach to testing for the existence of a relationship between variables in levels which is applicable irrespective of whether the underlying regressors are purely I(0), purely I(1) or mutually cointegrated.
So, normally, ARDL test is not used for cointegration checking.

Regression - out-of-sample forecasting

I try to figure out how to deal with my forecasting problem and I am not sure if my understanding is right in this field, so it would be really nice if someone can help me. First of all, my goal is to forecast a time series with regression. Instead of using ARIMA model or other heuristic models I want to focus on machine learning techniques like regressions such as random forest regression, k-nearest-neighbour regression etc.. Here is an overview of the dataset:
Timestamp UsageCPU UsageMemory Indicator Delay
2014-01-03 21:50:00 3123 1231 1 123
2014-01-03 22:00:00 5123 2355 1 322
2014-01-03 22:10:00 3121 1233 2 321
2014-01-03 22:20:00 2111 1234 2 211
2014-01-03 22:30:00 1000 2222 2 0
2014-01-03 22:40:00 4754 1599 1 0
The timestamp is increased in steps of 10 minutes and I want to predict the independent variable UsageCPU with the dependent variables UsageMemory, Indicator etc.. At this point i will explain my general knowledge of the prediction part. So for the prediction it is necessary to separate the dataset into training, validation and test sets. For this my dataset that contains 2 whole weeks is separated in 60% training, 20% validation and 20% test. This means for training set I have the first 8 days included and for the validation and the test set I have each 3 days. After that I can train a model in SparkR (the settings are not important).
model <- spark.randomForest(train, UsageMemory ~ UsageMemory, Indicator, Delay,
type = "regression", maxDepth = 30, maxBins = 50, numTrees=50,
impurity="variance", featureSubsetStrategy="all")
So after this I can validate the results with the validation set and compute the RMSE to see the accuracy of the model and which point have to tuned in my model building part. If that is finished I can predict on the test dataset:
predictions <- predict(model, test)
So the prediction works fine, but this is only an in-sample forecast and can not be used to predict for example the next day. In my understanding the in-sample can only used to predict the data in the data set and not to predict future values that can happen tomorrow. So really want to predict for example the next day or only the next 10 minutes / 1 hour, which is only possible to success with the out-of-sample forecasting. I also tried something like this (rolling regression) on the predicted values from random forest, but in my case the rolling regression is only used for evaluating the performance of different regressors with respect to different parameters combinations. So this is in my understanding no out-sample forecasting.
t <- bind(prediction, RollingRegression3 = rollApply(prediction, fun=function(x) mean(UsageCPU), window=6, align='right'))
So in my understanding I need something (maybe lag values?), before the model building process starts. I also read a lot of different papers and books, but there is no clear way how to do it and what are the key points. There is only standing something like t+1, t+n, but right now I do not even know how to do it. Would be really nice if someone can help me, because I tried to figure this out since three month now, thank you.
Let’s see if I get your problem right. I suppose that, given a time window, e.g. 144 last observations (one day) of UsageCPU, UsageMemory, Indicator and Delay, you want to forecast the ‘n’ next observations of UsageCPU. One way you could do such a thing, using random forests, is assigning one model for each next observation you want to forecast. So, if you want to forecast the 10 next UsageCPU observations, you should train 10 random forest models.
Using the example I began with, you could split the data you have in chunks of 154 observations. In each, you will use the first 144 observations to forecast the last 10 values of UsageCPU. There are lots of ways in which you could use feature engineering to extract information from these first 144 observations to train your model with, e.g. mean for each variable, last observation of each variable, global mean for each variable. So, for each chunk you will get a vector containing a bunch of predictors and 10 target values.
Bind the vectors you got for each chunk and you’ll have a matrix where the first columns are the predictors and the last 10 columns are the targets. Train each random forest with the n predictors columns and 1 of the targets column. Now you can apply the models on the features you extract from any data chunk containing the 144 observations. The model trained for target column 1 will ‘forecast’ one observation ahead, the model trained for target column 2 will ‘forecast’ two observations ahead, the model trained for target column 3 will ‘forecast’ three observations ahead...

Convert mixed model with repeated measures from SAS to R

I have been trying to convert a repeated measures model from SAS to R, since a collaborator will do the analysis but does not have SAS. We are dealing with 4 groups, 8 to 10 animals per group, and then 5 time points for each animal. The mock data file is available here https://drive.google.com/file/d/0B-WfycVUQyhaVGU2MUpuQkg4Mk0/edit?usp=sharing as a Rdata file and here https://drive.google.com/file/d/0B-WfycVUQyhaR0JtZ0V4VjRkTk0/edit?usp=sharing as an excel file:
The original SAS code (1) is :
proc mixed data=essai.data_test method=reml;
class group time mice;
model param = group time group*time / ddfm=kr;
repeated time / type=un subject=mice group=group;
run;
Which gives :
Type 3 Tests des effets fixes
DDL DDL Valeur
Effet Num. Res. F Pr > F
group 3 15.8 1.58 0.2344
time 4 25.2 10.11 <.0001
group*time 12 13.6 1.66 0.1852
I know that R does not handle degrees of freedom in the same way as SAS does, so I am first trying to obtain results similar to (2) :
proc mixed data=essai.data_test method=reml;
class group time mice;
model param = group time group*time;
repeated time / type=un subject=mice group=group;
run;
I have found some hints here Converting Repeated Measures mixed model formula from SAS to R and when specifying a compound symmetry correlation matrix this works perfectly. However, I am not able to obtain the same thing for a general correlation matrix.
With (2) in SAS, I obtain the following results :
Type 3 Tests des effets fixes
DDL DDL Valeur
Effet Num. Res. F Pr > F
group 3 32 1.71 0.1852
time 4 128 11.21 <.0001
group*time 12 128 2.73 0.0026
Using the following R code :
options(contrasts=c('contr.sum','contr.poly'))
mod <- lme(param~group*time, random=list(mice=pdDiag(form=~group-1)),
correlation = corSymm(form=~1|mice),
weights = varIdent(form=~1|group),
na.action = na.exclude, data = data, method = "REML")
anova(mod,type="marginal")
I obtain:
numDF denDF F-value p-value
(Intercept) 1 128 1373.8471 <.0001
group 3 32 1.5571 0.2189
time 4 128 10.0628 <.0001
group:time 12 128 1.6416 0.0880
The degrees of freedom are similar, but not the tests on fixed effects and I don’t know where this comes from. Would anyone have any idea of what I am doing wrong here?
Your R code differs from the SAS code in multiple ways. Some of them are fixable, but I was not able to fix all the aspects to reproduce the SAS analysis.
The R code fits a mixed effects model with a random mice effect, while the SAS code fits a generalized linear model that allows correlation between the residuals, but there are no random effects (because there is no RANDOM statement). In R you would have to use the gls function from the same nlme package.
In the R code all observations within the same group have the same variance, while in the SAS code you have an unstructured covariance matrix, that is each time-point within each group has its own variance. You can achieve the same effect by using weights=varIdent(form=~1|group*time).
In the R code the correlation matrix is the same for every mouse regardless of group. In the SAS code each group has its own correlation matrix. This is the part that I don't know how to reproduce in R.
I have to note that the R model seems to be more meaningful - SAS estimates way too many variances and correlations (which, by the way, you can see meaningfully arranged using the R and RCORR options to the repeated statement).
"In the R code the correlation matrix is the same for every mouse regardless of group. In the SAS code each group has its own correlation matrix. This is the part that I don't know how to reproduce in R." - Try: correlation=corSymm(~1|group*time)

Error with gls function in nlme package in R

I keep getting an error like this:
Error in `coef<-.corARMA`(`*tmp*`, value = c(18.3113452983211, -1.56626248550284, :
Coefficient matrix not invertible
or like this:
Error in gls(archlogfl ~ co2, correlation = corARMA(p = 3)) : false convergence (8)
with the gls function in nlme.
The former example was with the model gls(archlogflfornma~nma,correlation=corARMA(p=3)) where archlogflfornma is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
and nma is
[1] 138 139 142 148 150 134 137 135
You can see the model in the latter, and archlogfl is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
[9] 2.105556 2.176747
and co2 is
[1] 597.5778 917.9308 1101.0430 679.7803 886.5347 597.0668 873.4995
[8] 816.3483 1427.0190 423.8917
I have R 2.13.1.
Roland
#GavinSimpson's comment above, that trying to estimate a model with 5 parameters from 10 observations is very hopeful, is correct. The general rule of thumb is that you should have at least 10 times as many data points as parameters, and that's for standard fixed effect/regression parameters. (Generally variance structure parameters such as AR parameters are even a bit harder/require a bit more data than regression parameters to estimate.)
That said, in a perfect world one could hope to estimate parameters even from overfitted models. Let's just explore what happens though:
archlogfl <- c(2.611840,2.618454,2.503317,
2.305531,2.180464,2.185764,2.221760,2.211320,
2.105556,2.176747)
co2 <- c(597.5778,917.9308,1101.0430,679.7803,
886.5347,597.0668,873.4995,
816.3483,1427.0190,423.8917)
Take a look at the data,
plot(archlogfl~co2,type="b")
library(nlme)
g0 <- gls(archlogfl~co2)
plot(ACF(g0),alpha=0.05)
This is an autocorrelation function of the residuals, with 95% confidence intervals (note that these are curvewise confidence intervals, so we would expect about 1/20 points to fall outside these boundaries in any case).
So there is indeed some (graphical) evidence for some autocorrelation here. We'll fit an AR(1) model, with verbose output (to understand the scale on which these parameters are estimated, you'll probably have to dig around in Pinheiro and Bates 2000: what's presented in the printout are the unconstrained values of the parameters, what's printed in the summaries are the constrained values ...
g1 <- gls(archlogfl ~co2,correlation=corARMA(p=1),
control=glsControl(msVerbose=TRUE))
Let's see what's left after we fit AR1:
plot(ACF(g1,resType="normalized"),alpha=0.05)
Now fit AR(2):
g2 <- gls(archlogfl ~co2,correlation=corARMA(p=2),
control=glsControl(msVerbose=TRUE))
plot(ACF(g2,resType="normalized"),alpha=0.05)
As you correctly state, trying to go to AR(3) fails.
gls(archlogfl ~co2,correlation=corARMA(p=3))
You can play with tolerances, starting conditions, etc., but I don't think it's going to help much.
gls(archlogfl ~co2,correlation=corARMA(p=3,value=c(0.9,-0.5,0)),
control=glsControl(tolerance=1e-4,msVerbose=TRUE),verbose=TRUE)
If I were absolutely desperate to get these values I would code my own generalized least-squares function, constructing the AR(3) correlation matrix from scratch, and try to run it with some slightly more robust optimizer, but I would really have to have a good reason to work that hard ...
Another alternative would be to use arima to fit to the residuals from a gls or lm fit without autocorrelation: arima(residuals(g0),c(3,0,0)). (You can see that if you do this with arima(residuals(g0),c(2,0,0)) the answers are close to (but not quite equal to) the results from gls with corARMA(p=2).)

Resources