R package leaps: Undefined Mallows Cp in results - r

I am running all possible regression models on 92 variables in the R package 'leaps', predicting one dependent variable. Each of these variables consists of 51 numerical values.
Leaps produces five model statistics (i.e., r-square, adjusted r-square, residual sum of square, Mallow's Cp, and BIC). My results show seemingly normal values for each of these statistics, except Mallow's Cp, where all values are are negative infinity. Clearly there is division by zero as some point, but I am not familiar enough with model fit statistics to know whether this is a problem.
Any thoughts?

I'm just being introduced to Mallow's Cp but it sounds like you have more variables than data points. The s^2 is the MSE for the full model (has all variables). But since you have more variables than data points, the data is fit perfectly by the extraneous variables. So you should either collect more data or find a means to reduce the number of variables other than Mallow's Cp.
Cp = RSS/s^2 + N - 2*p
s^2 = MSE for the full model.
You can look at this reference here: http://www.public.iastate.edu/~mervyn/Stat401E_Spring2013/Other/mallows.pdf

Related

Random forest regression - cumulative MSE?

I am new to Random Forests and I have a question about regression. I am using R package randomForests to calculate RF models.
My final goal is to select sets of variables important for prediction of a continuous trait, and so I am calculating a model, then I remove the variable with lowest mean decrease in accuracy, and I calculate a new model, and so on. This worked with RF classification, and I compared the models using the OOB errors from prediction (training set), development and validation data sets. Now with regression I want to compare the models based on %variation explained and MSE.
I was evaluating the results for MSE and %var explained, and I get exactly the same results when calculating manually using the prediction from model$predicted. But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.
As an example you can try this code in R:
library(randomForest)
data("iris")
head(iris)
TrainingX<-iris[1:100,2:4] #creating training set - X matrix
TrainingY<-iris[1:100,1] #creating training set - Y vector
TestingX<-iris[101:150,2:4] #creating test set - X matrix
TestingY<-iris[101:150,1] #creating test set - Y vector
set.seed(2)
model<-randomForest(x=TrainingX, y= TrainingY, ntree=500, #calculating model
xtest = TestingX, ytest = TestingY)
#for prediction (training set)
pred<-model$predicted
meanY<-sum(TrainingY)/length(TrainingY)
varpY<-sum((TrainingY-meanY)^2)/length(TrainingY)
mseY<-sum((TrainingY-pred)^2)/length(TrainingY)
r2<-(1-(mseY/varpY))*100
#for testing (test set)
pred_2<-model$test$predicted
meanY_2<-sum(TestingY)/length(TestingY)
varpY_2<-sum((TestingY-meanY_2)^2)/length(TestingY)
mseY_2<-sum((TestingY-pred_2)^2)/length(TestingY)
r2_2<-(1-(mseY_2/varpY_2))*100
training_set_mse<-c(model$mse[500], mseY)
training_set_rsq<-c(model$rsq[500]*100, r2)
testing_set_mse<-c(model$test$mse[500],mseY_2)
testing_set_rsq<-c(model$test$rsq[500]*100, r2_2)
c<-cbind(training_set_mse,training_set_rsq,testing_set_mse, testing_set_rsq)
rownames(c)<-c("last tree", "by hand")
c
model
As a result after running this code you will obtain a table containing values for MSE and %var explaines (also called rsq). The first line is called "last tree" and contains the values of MSE and %var explained for the 500th tree in the forest. The second line is called "by hand" and it contains results calculated in R based on the vectors model$predicted and model$test$predicted.
So, my questions are:
1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)
2- Is the last tree to be considered as an average of all the others?
3- Why are MSE and %var explained of the RF model (presented in the main board when you call model) the same as the ones from the 500th tree (see first line of table)? Do the vectors model$mse or model$rsq contain cumulative values?
After the last edit I found this post from Andy Liaw (one of the creators of the package) that says that MSE and %var explained are in fact cumulative!: https://stat.ethz.ch/pipermail/r-help/2004-April/049943.html.
Not sure I understand what your issue is; I'll give it a try nevertheless...
1- Are the predictions of the trees somehow cumulative? Or are they
independent from each other? (I thought they were independent)
You thought correctly; the trees are fit independently of each other, hence their predictions are indeed independent. In fact, this is a crucial advantage of RF models, since it allows for parallel implementations.
2- Is the last tree to be considered as an average of all the others?
No; as clarified above, all trees are independent.
3- If each tree gets a prediction, how can I get the matrix with all the trees, since what I need is the MSE and % var explained for the forest?
Here is where what you ask starts being really unclear, given your code above; the MSE and r2 you say you need are exactly what you are already computing in mseY and r2:
mseY
[1] 0.1232342
r2
[1] 81.90718
which, unsurpizingly, are the very same values reported by model:
model
# result:
Call:
randomForest(x = TrainingX, y = TrainingY, ntree = 500)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 0.1232342
% Var explained: 81.91
so I'm not sure I can really see your issue, or what these values have to do with the "matrix with all the trees"...
But when I do model$mse, the value presented corresponds to the value
of MSE for the last tree calculated, and the same happens for % var
explained.
Most certainly not: model$mse is a vector of length equal to the number of trees (here 500), containing the MSE for each individual tree; (see UPDATE below) I have never seen any use for this in practice (similarly for model$rsq):
length(model$mse)
[1] 500
length(model$rsq)
[1] 500
UPDATE: Kudos to the OP herself (see comments), who discovered that the quantities in model$mse and model$rsq are indeed cumulative (!); from an old (2004) thread by package maintainer Andy Liaw, Extracting the MSE and % Variance from RandomForest:
Several ways:
Read ?randomForest, especially the `Value' section.
Look at str(myforest.rf).
Look at print.randomForest.
If the forest has 100 trees, then the mse and rsq are vectors with 100
elements each, the i-th element being the mse (or rsq) of the forest
consisting of the first i trees. So the last element is the mse (or
rsq) of the whole forest.

Covariance structure in lme - AR(1)

My response variable is Yijk corresponding to the recovery time of
patient i (i=1,...,I)
with treatment j (j=1,...,J)
and measured at time k (k=1,...,K)
I would like to fit the following model:Model equation, where:
μ is a global fixed intercept
αj is a fixed effect for the treatment
bik is a random effect with the following covariance structure. Denote bi the K-dimensional vector of effect for the patient i, then its variance-covariance matrix would have the following AR(1) structure.
Variance covariance matrix
uijk is the usual error term with variance σ²
Consider the following line of command:
lme(recovery ~ treatment, method="REML", random=~1|patient, correlation=corAR1,form=~time|patient,data=data)
Several questions:
What does this correlation argument correspond to? The structure of covariance of what? Is that the var-cov matrix which I defined as R?
Does the line actually do what I would like to?
If not, what does it do?
If not, is there a way to do what I would like to?
Thank you in advance!
First, you have a command lme, I will assume that is meant to be nlme because a) lme isn't an R command in any package that I know of or that R could find and b) correlation isn't an option in lme4
Second, in the documentation for nlme they have this:
an optional corStruct object describing the within-group correlation
structure. See the documentation of corClasses for a description of
the available corStruct classes. Defaults to NULL, corresponding to no
within-group correlations.
and in corClasses it says
corAR1 autoregressive process of order 1.
So, the answers to your first two questions appears to be "Yes".

predict and multiplicative variables / interaction terms in probit regressions

I want to determine the marginal effects of each dependent variable in a probit regression as follows:
predict the (base) probability with the mean of each variable
for each variable, predict the change in probability compared to the base probability if the variable takes the value of mean + 1x standard deviation of the variable
In one of my regressions, I have a multiplicative variable, as follows:
my_probit <- glm(a ~ b + c + I(b*c), family = binomial(link = "probit"), data=data)
Two questions:
When I determine the marginal effects using the approach above, will the value of the multiplicative term reflect the value of b or c taking the value mean + 1x standard deviation of the variable?
Same question, but with an interaction term (* and no I()) instead of a multiplicative term.
Many thanks
When interpreting the results of models involving interaction terms, the general rule is DO NOT interpret coefficients. The very presence of interactions means that the meaning of coefficients for terms will vary depending on the other variate values being used for prediction. The right way to go about looking at the results is to construct a "prediction grid", i.e. a set of values that are spaced across the range of interest (hopefully within the domain of data support). The two essential functions for this process are expand.grid and predict.
dgrid <- expand.grid(b=fivenum(data$b)[2:4], c=fivenum(data$c)[2:4]
# A grid with the upper and lower hinges and the medians for `a` and `b`.
predict(my_probit, newdata=dgrid)
You may want to have the predictions on a scale other than the default (which is to return the linear predictor), so perhaps this would be easier to interpret if it were:
predict(my_probit, newdata=dgrid, type ="response")
Be sure to read ?predict and ?predict.glm and work with some simple examples to make sure you are getting what you intended.
Predictions from models containing interactions (at least those involving 2 covariates) should be thought of as being surfaces or 2-d manifolds in three dimensions. (And for 3-covariate interactions as being iso-value envelopes.) The reason that non-interaction models can be decomposed into separate term "effects" is that the slopes of the planar prediction surfaces remain constant across all levels of input. Such is not the case with interactions, especially those with multiplicative and non-linear model structures. The graphical tools and insights that one picks up in a differential equations course can be productively applied here.

specifying probability weights in R *without* using Lumley survey package

I would really appreciate any help with specifying probability weights in R without using the Lumley survey package. I am conducting mediation analysis in R using the Imai et al mediation package, which does not currently support svyglm.
The code I am currently running is:
olsmediator_basic<-lm(poledu ~ gateway_strict_alt + gender_n + spline1 + spline2 + spline3,
data = unifiedanalysis, weights = designweight).
However, I'm unsure if this is weighting the data correctly. The reason is that this code yields standard errors that differ from those I am getting in Stata. The Stata code I am running is:
reg poledu gateway_strict_alt gender_n spline1 spline2 spline3 [pweight=designweight]).
I was wondering if the weights option in R may not be for inverse probability weights, but I was unable to determine this from the documentation, this forum or elsewhere. If I am missing something, I really apologize - I am new to R as well as to this forum.
Thank you in advance for your help.
The R documentation specifies that the weights parameter of the lm function is inversely proportional to the variance of the observations. This is the definition of analytic weights, or aweights in Stata.
Have a look at the ipw package for inverse probability weighting.
To correct a previous answer - I looked up the manual on weights and found the following description for weights in lm
Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).
These are actually frequency weights (fweights in stata). They multiply out the observation n number of times as defined by the weight vector. Probability weights, on the other hand, refer to the probability that observations group is included in the population. Doing so adjusts the impact of the observation on the coefficients, but not on the standard errors, as they don't change the number of observations represented in the sample.

estimating density in a multidimensional space with R

I have two types of individuals, say M and F, each described with six variables (forming a 6D space S). I would like to identify the regions in S where the densities of M and F differ maximally. I first tried a logistic binomial model linking F/ M to the six variables but the result of this GLM model is very hard to interpret (in part due to the numerous significant interaction terms). Thus I am thinking to an “spatial” analysis where I would separately estimate the density of M and F individuals everywhere in S, then calculating the difference in densities. Eventually I would manually look for the largest difference in densities, and extract the values at the 6 variables.
I found the function sm.density in the package sm that can estimate densities in a 3d space, but I find nothing for a space with n>3. Would you know something that would manage to do this in R? Alternatively, would have a more elegant method to answer my first question (2nd sentence)?
In advance,
Thanks a lot for your help
The function kde of the package ks performs kernel density estimation for multinomial data with dimensions ranging from 1 to 6.
pdfCluster and np packages propose functions to perform kernel density estimation in higher dimension.
If you prefer parametric techniques, you look at R packages doing gaussian mixture estimation like mclust or mixtools.
The ability to do this with GLM models may be constrained both by interpretablity issues that you already encountered as well as by numerical stability issues. Furthermore, you don't describe the GLM models, so it's not possible to see whether you include consideration of non-linearity. If you have lots of data, you might consider using 2D crossed spline terms. (These are not really density estimates.) If I were doing initial exploration with facilities in the rms/Hmisc packages in five dimensions it might look like:
library(rms)
dd <- datadist(dat)
options(datadist="dd")
big.mod <- lrm( MF ~ ( rcs(var1, 3) + # `lrm` is logistic regression in rms
rcs(var2, 3) +
rcs(var3, 3) +
rcs(var4, 3) +
rcs(var5, 3) )^2,# all 2way interactions
data=dat,
max.iter=50) # these fits may take longer times
bplot( Predict(bid.mod, var1,var2, n=10) )
That should show the simultaneous functional form of var1's and var2's contribution to the "5 dimensional" model estimates at 10 points each and at the median value of the three other variables.

Resources