Determining the degrees of freedom from 'chisq' result in ROCR package - r

Forgive me for what perhaps may be a simply question. I am relatively new to statistics and R, for that matter.
I am currently performing multiple logistic regression on the following model: E6~S+FA+FR+DV. Where the dependent variable E6 is dichotomous(0,1) and S, FA and FR are ordinal categorical independent variables with scales of 0:7, 1:3 and 1:5 respectively. DV is a dichotomous independent variable (0,1).
As a measure of performance I am currently using the ROCR package in R. Where I have created a prediction object and am then proceeding to produce a performance object using the measure 'chisq' for evaluating the models fit by test of independence.
This is all working fine and I receive the X2 value for the model fit, however the Degrees of Freedom(DF) are not returned. I should mention that the model has 4110 observations.
Is there a simple way of determining the DF for X2 for the above mentioned model. For example, is there a reason why ROCR returns the X2 value, but not the DF?
Any guidance would be great. Thanks in advance.

Related

Warning in the Matchit package ("Glm.fit: fitted probabilities numerically 0 or 1 occurred") How to deal with that?

I'm running a code with around 200.000 observations, where 10.000 were treated and the remaining I'm trying to match using the package MatchIt.
Because of one of these variables, there is a warning message appearing and I don't know if I should just ignore it or not. The message is: Glm.fit: fitted probabilities numerically 0 or 1 occurred
The code that I'm running is similar to the one below:
m.out <- matchit(var ~ VAR1 + VAR2 + VAR3 + VAR4 + VAR5, data = mydata, method = "nearest", exact = c("VAR1", "VAR3", "VAR5"))
For illustration, let's say that the variable with the issue is the "VAR5". This variable is a character variable with about 200 different texts. So, my question is if this warning is a real problem or if it's just because there are too many options in this variable for the size of my data, and, because of that, it's not possible to find a treatment/control prediction? Anyway, is there something that I can do to not have this warning?
Best,
MatchIt by default uses logistic regression through the glm function to estimate propensity scores. This warning means that the logistic regression model has been overfit, with some variables perfectly predicting treatment status. This may indicate a violation of positivity (i.e., your two groups are fundamentally different from each other), but, as you mentioned, it could just be that a relatively unimportant feature has many categories, and some of these perfectly overlap with treatment. There are a few ways to handle this problem; one of them is indeed to drop VAR5, but you can also try to estimate your own propensity scores outside MatchIt using a method that doesn't suffer from this problem and then supply those propensity scores to matchit() through the distance argument.
Two methods come to mind. The first is to use brglm2, a package that implements an alternate method of fitting logistic regression models so that fitted probabilities are never 0 or 1. This method is easy to implement because it just uses a slight variation of the glm function.
A second is to use a machine learning method that performs regularization (i.e., variable selection) so that only the variables and levels of the factors that are important for the analysis are included. You could use glmnet to perform lasso or elastic net logistic regression, you could use gbm or twang to do generalized boosted modeling, or you could use SuperLearner to stack several machine learning methods and take the best predictions from them. You can then supply the predicted values to matchit().

Predict Survival using RMS package in R?

I am using the function survest in the RMS package to generate survival probabilities. I want to be able to take a subset of my data and pass it through survest. I have developed a for loop that does this. This runs and outputs survival probabilities for each set of predictors.
for (i in 1:nrow(df)){
row <- df[i,]
print(row)
surv=survest(fit, row, times=365)
print(surv)
}
My first question is whether there is a way to use survest to predict median survival rather than having to specify a specific time frame, or alternatively is there a better function to use?
Secondly,I want to be able to predict survival using only four of the five predictors of my cox model, for example (as below), while I understand this will be less accurate is it possible to do this using survest?
survest(fit, expand.grid(Years.to.birth =NA, Tumor.stage=1, Date=2000,
Somatic.mutations=2, ttype="brca"), times=300)
To get median survival time, use the Quantile function generator, or the summary.survfit function in the survival package. The function created by Quantile can be evaluated for the 0.5 quantile. It is a function of the linear predict. You'll need to use the predict function on the subset of observations to get the linear predictor value to pass to compute the median.
For your other two questions, survest needs to use the full model you fitted (all the variables). You would need to use multiple imputation if a variable is not available, or a quick approximate refit to the model a la fastbw.
We are trying to do something similar with the missing data.
While MI is a good idea, a simpler idea for a single missing variable is to run the prediction multiple times, and replace the missing variable with a value sampled at random distribution of the missing variable.
E.g. If we have x1, x2 and x3 as predictors, and we want to model when x3 is missing, we run predictions using x1 and x2 and take_random_sample_from(x3), and then averaging the survival times over all of the results.
The problem with reformulating the model (e.g. in this case re-modelling so we only consider x1 and x2) is that it doesn't let you explore the impact of x3 explicitly.
For simple cases this should work - it is essentially averaging the survival prediction for a large range of x3, and therefore makes x3 relatively uninformative.
HTH,
Matt

Logistic Regression Model & Multicolinearity of Categorical Variables in R

I have a training dataset that has 3233 rows and 62 columns. The independent variable is Happy (train$Happy), which is a binary variable. The other 61 columns are categorical independent variables.
I've created a logistic regression model as follows:
logModel <- glm(Happy ~ ., data = train, family = binary)
However, I want to reduce the number of independent variables that go into the model, perhaps down to 20 or so. I would like to start by getting rid of colinear categorical variables.
Can someone shed some light on how to determine which categorical variables are colinear and what threshold that I should use when removing a variable from a model?
Thank you!
if your variables were categorical then the obvious solution would be penalized logistic regression (Lasso) in R it is implemented in glmnet.
With categorical variables the problem is much more difficult.
I was in a similar situation and I used the importance plot from the package random forest in order to reduce the number of variables.
This would not help you to find collinearity but only to rank the variables by importance.
You have only 60 variable and maybe you have a knowledge of the field so you can try to add to you model some variables that makes sense to you (like z=x1-x3 if you think that the value x1-x3 is important.) and then rank them according to a random forest model
You could use Cramer's V, or the related Phi or contingency coefficient (see a great paper at http://www.harding.edu/sbreezeel/460%20files/statbook/chapter15.pdf), to measure colinearity among categorical variables. If two or more categorical variables have a Cramer's V value close to 1, it means they're highly "correlated" and you may not need to keep all of them in your logistic regression model.

Pseudo R squared for cumulative link function

I have an ordinal dependent variable and trying to use a number of independent variables to predict it. I use R. The function I use is clm in the ordinal package, to perform a cumulative link function with a probit link, to be precise:
I tried the function pR2 in the package pscl to get the pseudo R squared with no success.
How do I get pseudo R squareds with the clm function?
Thanks so much for your help.
There are a variety of pseudo-R^2. I don't like to use any of them because I do not see the results as having a meaning in the real world. They do not estimate effect sizes of any sort and they are not particularly good for statistical inference. Furthermore in situations like this with multiple observations per entity, I think it is debatable which value for "n" (the number of subjects) or the degrees of freedom is appropriate. Some people use McFadden's R^2 which would be relatively easy to calculate, since clm generated a list with one of its values named "logLik". You just need to know that the logLikelihood is only a multiplicative constant (-2) away from the deviance. If one had the model in the first example:
library(ordinal)
data(wine)
fm1 <- clm(rating ~ temp * contact, data = wine)
fm0 <- clm(rating ~ 1, data = wine)
( McF.pR2 <- 1 - fm1$logLik/fm0$logLik )
[1] 0.1668244
I had seen this question on CrossValidated and was hoping to see the more statistically sophisticated participants over there take this one on, but they saw it as a programming question and dumped it over here. Perhaps their opinion of R^2 as a worthwhile measure is as low as mine?
Recommend to use function nagelkerke from rcompanion package to get Pseudo r-squared.
When your predictor or outcome variables are categorical or ordinal, the R-Squared will typically be lower than with truly numeric data. R-squared merely a very weak indicator about model's fit, and you can't choose model based on this.

How do I plot predictions from new data fit with gee, lme, glmer, and gamm4 in R?

I have fit my discrete count data using a variety of functions for comparison. I fit a GEE model using geepack, a linear mixed effect model on the log(count) using lme (nlme), a GLMM using glmer (lme4), and a GAMM using gamm4 (gamm4) in R.
I am interested in comparing these models and would like to plot the expected (predicted) values for a new set of data (predictor variables). My goal is to compare the predicted effects for each model under particular conditions (x variables). Of particular interest is the comparison between marginal (GEE) and conditional estimates.
I think my main problem might be getting the new data in the correct form with the correct labels and attributes and such. I am still very much an R novice and struggle with this stuff (no course on this at my university unfortunately).
I currently have fitted models
gee1 lme1 lmer1 gamm1
and can extract their fixed effect coefficients and standard errors without a problem. I also don't have a problem converting them from the log scale or estimating confidence intervals accounting for the random effects.
I also have my new dataframe newdat which has 365 observations of 23 variables (average environmental data for each day of the year).
I am stuck on how to predict new count estimates from this. I played around with the model.matrix function but couldn't get it to work. For example, I tried:
mm = model.matrix(terms(glmm1), newdat) # Error in model.frame.default(object,
# data, xlev = xlev) : object is not a matrix
newdat$pcount = mm %*% fixef(glmm1)
Any suggestions or good references would be greatly appreciated. Can anyone help with the error above?
Getting predictions for lme() and lmer() is documented on http://glmm.wikidot.com/faq

Resources