Why can't I use cv.glm on the output of bestglm? - r

I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'

bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)

Related

How to evaluate a string variable as factor in the emmeans() command in R?

I would like to assign a variable with a custom factor from an ANOVA model to the emmeans() statement. Here I use the oranges dataset from R to make the code reproducible. This is my model and how I would usually calculate the emmmeans of the factor store:
library(emmeans)
oranges$store<-as.factor(oranges$store)
model <- lm (sales1 ~ 1 + price1 + store ,data=oranges)
means<-emmeans(model, pairwise ~ store, adjust="tukey")
Now I would like to assign a variable (lsmeanfact) defining the factor for which the lsmeans are calculated.
lsmeanfact<-"store"
However, when I want to evaluate this variable in the emmeans() function it returns an error, it basically does not find the variable lsmeanfact, so it does not evaluate this variable.
means<-emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust="tukey")
Error in emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust = "tukey") :
No variable named lsmeanfact in the reference grid
How should I change my code to be able to evaluate the variable lsmeanfact so that the lsmeans for "plantcode" are correctly calculated?
You can make use of reformulate function.
library(emmeans)
lsmeanfact<-"store"
means <- emmeans(model, reformulate(lsmeanfact, 'pairwise'), adjust="tukey")
Or construct a formula with formula/as.formula.
means <- emmeans(model, formula(paste('pairwise', lsmeanfact, sep = '~')), adjust="tukey")
Here both reformulate(lsmeanfact, 'pairwise') and formula(paste('pairwise', lsmeanfact, sep = '~')) return pairwise ~ store.
You do not need to do anything special at all. The specs argument to emmeans() can be a character value. You can get the pairwise comparisons in a separate call, which is actually a better way to go anyway.
library(emmeans)
model <- lm(sales1 ~ price1 + store, data = oranges)
lsmeanfact <- "store"
( EMM <- emmeans(model, lsmeanfact) )
## store emmean SE df lower.CL upper.CL
## 1 8.01 2.61 29 2.67 13.3
## 2 9.60 2.30 29 4.89 14.3
## 3 7.84 2.30 29 3.13 12.6
## 4 10.44 2.35 29 5.63 15.2
## 5 10.19 2.28 29 5.53 14.9
## 6 15.22 2.28 29 10.56 19.9
##
## Confidence level used: 0.95
pairs(EMM)
## contrast estimate SE df t.ratio p.value
## 1 - 2 -1.595 3.60 29 -0.443 0.9976
## 1 - 3 0.165 3.60 29 0.046 1.0000
## 1 - 4 -2.428 3.72 29 -0.653 0.9856
## 1 - 5 -2.185 3.50 29 -0.625 0.9882
## 1 - 6 -7.209 3.45 29 -2.089 0.3206
## 2 - 3 1.761 3.22 29 0.546 0.9936
## 2 - 4 -0.833 3.23 29 -0.258 0.9998
## 2 - 5 -0.590 3.23 29 -0.182 1.0000
## 2 - 6 -5.614 3.24 29 -1.730 0.5239
## 3 - 4 -2.593 3.23 29 -0.802 0.9648
## 3 - 5 -2.350 3.23 29 -0.727 0.9769
## 3 - 6 -7.375 3.24 29 -2.273 0.2373
## 4 - 5 0.243 3.26 29 0.075 1.0000
## 4 - 6 -4.781 3.28 29 -1.457 0.6930
## 5 - 6 -5.024 3.23 29 -1.558 0.6314
##
## P value adjustment: tukey method for comparing a family of 6 estimates
Created on 2021-06-29 by the reprex package (v2.0.0)
Moreover, in any case what is needed in specs are the name(s) of the factors involved, not the factors themselves. Note also that it was unnecessary to convert store to a factor before fitting the model

How to treat negative values in lm(x~y) function in R?

When running my script I get the following error message: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases and I'm guessing that is due some negative values?
The script is looping trough a list of csv files and for a small selection of them, the code is working. But for all of them I get the error message. I checked the data and there are some (about 2% of the whole data) negative NDVI values which are always -99999. And I have some soil moisture values which are 0.
I found this solution na.action=na.exclude to add in the lm function:
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf)
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf, na.action=na.exclude)
But the same error still occurs. Do you have any other solution for this, besides deleting the negative values from the data. Best would be to ignore the whole the not exclude these values in the linear regression (lm) or to ignore the whole csv file. If there are negative values in it.
Missing values in R should be coded as NA. You could use replace,
replace(dat, dat == -99999, NA)
# X1 X2 X3
# 1 1.37 1.30 -0.31
# 2 NA 2.29 -1.78
# 3 0.36 -1.39 -0.17
# 4 0.63 -0.28 1.21
# 5 0.40 NA 1.90
# 6 -0.11 0.64 -0.43
# 7 1.51 -0.28 -0.26
# 8 -0.09 -2.66 -1.76
# 9 2.02 -2.44 NA
# 10 -0.06 1.32 -0.64
what actually works directly in the formula without changing the data.
lm(X1 ~ X2 + X3, replace(dat, dat == -99999, NA))$coefficients
# (Intercept) X2 X3
# 0.61499466 0.06062925 0.25979370
If there are more than one missing code, you could do e.g.:
replace(dat, array(unlist(dat) %in% c(-99999, -88888), dim(dat)), NA)
Data:
set.seed(42)
dat <- data.frame(matrix(round(rnorm(30), 2), 10, 3))
dat[2, 1] <- -99999
dat[5, 2] <- -99999
dat[9, 3] <- -99999

Conditional density distribution, two discrete variables

I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1).
Some data:
dat <- read.table( text="y x
3.00 0.0
2.75 0.0
2.75 0.1
2.75 0.1
2.75 0.2
2.25 0.2
3 0.3
2 0.3
2.25 0.4
1.75 0.4
1.75 0.5
2 0.5
1.75 0.6
1.75 0.6
1.75 0.7
1 0.7
0.54 0.8
0 0.8
0.54 0.9
0 0.9
0 1.0
0 1.0", header=TRUE, colClasses="factor")
I wonder if my variables are appropriate to run this kind of analysis.
Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables:
library(rms) # also loads Hmisc
# first get data in the form you described
dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables
?lrm
#read help page ... Also look at the supporting book and citations on that page
lrm( y ~ x, data=dat)
# --- output------
Logistic Regression Model
lrm(formula = y ~ x, data = dat)
Frequencies of Responses
0 0.54 1 1.75 2 2.25 2.75 3 3.00
4 2 1 5 2 2 4 1 1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 22 LR chi2 51.66 R2 0.920 C 0.869
max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738
Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916
gp 0.500 tau-a 0.658
Brier 0.048
Coef S.E. Wald Z Pr(>|Z|)
y>=0.54 41.6140 108.3624 0.38 0.7010
y>=1 31.9345 88.0084 0.36 0.7167
y>=1.75 23.5277 74.2031 0.32 0.7512
y>=2 6.3002 2.2886 2.75 0.0059
y>=2.25 4.6790 2.0494 2.28 0.0224
y>=2.75 3.2223 1.8577 1.73 0.0828
y>=3 0.5919 1.4855 0.40 0.6903
y>=3.00 -0.4283 1.5004 -0.29 0.7753
x -19.0710 19.8718 -0.96 0.3372
x=0.2 0.7630 3.1058 0.25 0.8059
x=0.3 3.0129 5.2589 0.57 0.5667
x=0.4 1.9526 6.9051 0.28 0.7773
x=0.5 2.9703 8.8464 0.34 0.7370
x=0.6 -3.4705 53.5272 -0.06 0.9483
x=0.7 -10.1780 75.2585 -0.14 0.8924
x=0.8 -26.3573 109.3298 -0.24 0.8095
x=0.9 -24.4502 109.6118 -0.22 0.8235
x=1 -35.5679 488.7155 -0.07 0.9420
There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".

Model multiple imputation with interaction terms

According to the documentation of the mice package, if we want to impute data when we're interested in interaction terms we need to use passive imputation. Which is done the following way.
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I((bmi-25)*(chl-200))"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
It is said that
Imputations created in this way preserve the interaction of bmi with chl
Here, a new variable called bmi.chl is created in the original dataset. The meth step tells how this variable needs to be imputed from the existing ones. The pred step says we don't want to predict bmi and chl from bmi.chl. But now, if we want to apply a model, how do we proceed? Is the product defined by "~I((bmi-25)*(chl-200))" is just a way to control for the imputed values of the main effects, i.e. bmi and chl?
If the model we want to fit is glm(hyp~chl*bmi, family="binomial"), what is the correct way to specify this model from the imputed data? fit1 or fit2?
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
Or do we have to use somehow the imputed values of the new variable created, i.e. bmi.chl?
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
With passive imputation, it does not matter if you use the passively imputed variable, or if you re-calculate the product term in your call to glm.
The reason that fit1 and fit2 yield different results in your example is because are not just doing passive imputation for the product term.
Instead you are transforming the two variables befor multiplying (i.e., you calculate bmi-25 and chl-100). As a result, the passively imputed variable bmi.chl does not represent the product term bmi*chl but rather (bmi-25)*(chl-200).
If you just calculate the product term, then fit1 and fit2 yield the same results like they should:
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I(bmi*chl)"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
pred[c("hyp"), "bmi.chl"] <- 1
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
# > round(summary(pool(fit1)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# chl:bmi 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 NA 0.43 0.33
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
# > round(summary(pool(fit2)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# bmi.chl 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 25 0.43 0.33
This is not surprising because the ~I(bmi*chl) in mice and the bmi*chl in glm do the exact same thing. They merely calculate the product of the two variables.
Remark:
Note that I added a line saying that bmi.chl should be used as a predictor when imputing hyp. Without this step, passive imputation has no purpose because the imputation model would neglect the product term, thus being incongruent with the analysis model.

Error in R: multi effects models

I'm having a few issue's I'd appreciate some help with.
head(new.data)
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB seasons
1 2 3 1996 1 30.7 0.35 0.5000750 0.75 7.4 0.055 winter
2 6 1 1996 2 24.8 0.25 0.5001375 0.75 6.9 0.200 winter
3 7 4 1996 2 60.4 0.05 0.5001375 0.75 7.1 0.055 winter
4 7 4 1996 2 58.1 0.15 0.5001570 0.75 7.5 0.055 winter
5 7 4 1996 3 62.2 0.20 0.5003881 2.00 7.6 0.055 spring
6 5 2 1996 3 40.3 0.15 0.5003500 2.00 7.7 0.055 spring
library(nlme)
> mod3 <- lme(TTHM ~ CL2_FREE, random= ~ 1| Treatment_Code/WSZ_Code, data=new.data, method ="ML")
> mod3
Linear mixed-effects model fit by maximum likelihood
Data: new.data
Log-likelihood: -1401.529
Fixed: TTHM ~ CL2_FREE
(Intercept) CL2_FREE
54.45240 -40.15033
Random effects:
Formula: ~1 | Treatment_Code
(Intercept)
StdDev: 0.004156934
Formula: ~1 | WSZ_Code %in% Treatment_Code
(Intercept) Residual
StdDev: 10.90637 13.52372
Number of Observations: 345
Number of Groups:
Treatment_Code WSZ_Code %in% Treatment_Code
4 8
> plot(augPred(mod3))
Error in plot(augPred(mod3)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Error in sprintf(gettext(fmt, domain = domain), ...) :
invalid type of argument[1]: 'symbol'
I'm not sure why I get this error. The ranef plot seems OK
plot(ranef(mod3))
But that only gives the value of the random intercepts, no TTHM predictions.
I'm looking for a way to plot the predictions like in a typical augPred which would show all the random effects for each zone. Hope that makes sense.
You need a groupedData object to use augPred. I hope this helps.
Best wishes #CSJCampbell
con <- textConnection("
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB seasons
2 3 1996 1 30.7 0.35 0.5000750 0.75 7.4 0.055 winter
6 1 1996 2 24.8 0.25 0.5001375 0.75 6.9 0.200 winter
7 4 1996 2 60.4 0.05 0.5001375 0.75 7.1 0.055 winter
7 4 1996 2 58.1 0.15 0.5001570 0.75 7.5 0.055 winter
7 4 1996 3 62.2 0.20 0.5003881 2.00 7.6 0.055 spring
5 2 1996 3 40.3 0.15 0.5003500 2.00 7.7 0.055 spring
")
new.data <- read.table(con, header = TRUE)
library(nlme)
new.data.grp <- groupedData(TTHM ~ CL2_FREE | Treatment_Code/WSZ_Code, data = new.data)
mod3 <- lme(TTHM ~ CL2_FREE, random= ~ 1| Treatment_Code/WSZ_Code, data=new.data.grp, method ="ML")
mod3
ap3 <- augPred(mod3)
plot(ap3)
I realize most are probably using ggplot2 and lme4 at this point, but I'm a bit crufty.
Here are a couple of things that I've found working with lists of response variables that are fit using lme().
So, I've been working with a number of response variables that I want to fit to a particular set of inputs. In short my code looks something like
mymodels = list()
for(resp in my_response_vars){
f = as.formula(paste(resp,paste(my_input_vars,collapse='+'),sep='~'))
mymodels[[resp]] = lme(fixed=f,random=~wave|group,method="ML",
data=mydata, na.action=na.exclude)
}
I've been successful in treating the entries in the resulting list as normal lme() objects. The problem comes when I want to plot predictions via augPred(). Specifically I get the following error,
Error in tapply(object[[nm]], groups, FUN[["numeric"]], ...) :
arguments must have same length
So, after much searching, I decided to have a look under the hood of augPred() via debug(). Here are some of the insights I came to... I'm not sure that these qualify as bugs or if they would require a patch, but I hope they can help others with similar problems.
When calling augPred() the function looks for the name of the data that was used in the original lme() call, then inherits this object from the parent.frame() via a call to eval(). I'm not sure if this defaults to the object frame or the global, but, when I change this to data = object$data in the debug, things work. So, ostensibly, if you have used a subset of these data in your model, it will call on the full set of data.
The above causes issues if one response has missing values and you are interested in one that does not. Since it includes everything in the data.frame as part of an eventual call to gsummary() the missing values in the non-response variable will throw a wrench into things.
So, missing values mess things up. I have defaulted to making a temporary data.frame with the columns of interest, then running complete.cases() on this prior to fitting the lme() model.
mymods = list()
for(resp in my_response_vars){
f = as.formula(paste(resp,paste(my_input_vars,collapse='+'),sep='~'))
v2keep = all.vars(f) # grab terms
smdat = mydata[,c(v2keep,'group')] # include group
smdat=smdat[complete.cases(smdat),] # scrub missing
tmpmod = lme(fixed=f, random=~wave|group,
method='ML', data=smdat)
mymods[[resp]] = tmpmod
# include augPred() call here
}
If you are not including a primary argument in your call to augPred() it will require that your data.frame is a groupedData() object.
So, if you are running into the arguments must have the same length error, try: subsetting your data first under a different name, make sure to clear out missing rows explicitly prior to fitting your model.

Resources