I cannot get the gam.check function to work - r

I used the model below to predict marine survival in both SAS and R. The R code is below.
gam_mod<-gam(recruits/smolt ~ s(max, k = 6) + s(medB, k = 10),
weights = smolt, data = GAMdata, method = "REML", family = binomial("logit"))
However when I use gam.check(gam_mod) I get the error message below:
Error in dm[, i] <- sort(residuals(object, type = type)) :
number of items to replace is not a multiple of replacement length

Related

Why is gamm returning the error "unused argument (offset= ...)"?

all-
First-time poster, here, so please be forbearing if I've violated some of the conventions for asking questions (like, for example, providing a replicable example).
I'm trying to estimate a Generalized Additive Mixed Model using the "gamm" function with this code:
fit1.1 = gamm(opioidNonFatalOD ~ s(mandatoryReg.l2, k = 3, fx = TRUE,
bs = "cr") +
s(coalitionActive.l2, k = 3, fx = TRUE, bs = "cr") +
monthsSinceJan2011 +
everFunded +
ICD10 +
spoke5 +
hub +
s(monthly2, bs = "cc", fx = FALSE, k = 4) +
s(county2, bs = "re"),
#+ offset(log(population / 100000)),
correlation = corAR1(form = ~ monthsSinceJan2011 | county2),
data = tsData,
family = quasipoisson, offset = log(population / 100000),
niterPQL = 20,
verbosePQL = TRUE)
For some reason, it looks like the "offset" argument isn't getting passed to gammPQL. I get this error:
iteration 1
Quitting from lines 201-220 (pfs_model_experiments_041520.Rmd)
Error in lme(fixed = fixed, random = random, data = data, correlation = correlation, :
unused argument (offset = log(population/1e+05))
Calls: <Anonymous> ... withVisible -> eval -> eval -> gamm -> eval -> eval -> gammPQL
Execution halted
Here're the traceback messages:
Error in lme(fixed = fixed, random = random, data = data, correlation = correlation, : unused argument (offset = log(population/1e+05))
4.
gammPQL(y ~ X - 1, random = rand, data = strip.offset(mf), family = family, correlation = correlation, control = control, weights = weights, niter = niterPQL, verbose = verbosePQL, mustart = mustart, etastart = etastart, ...) at <text>#1
3.
eval(parse(text = paste("ret$lme<-gammPQL(", deparse(fixed.formula), ",random=rand,data=strip.offset(mf),family=family,", "correlation=correlation,control=control,", "weights=weights,niter=niterPQL,verbose=verbosePQL,mustart=mustart,etastart=etastart,...)", sep = "")))
2.
eval(parse(text = paste("ret$lme<-gammPQL(", deparse(fixed.formula), ",random=rand,data=strip.offset(mf),family=family,", "correlation=correlation,control=control,", "weights=weights,niter=niterPQL,verbose=verbosePQL,mustart=mustart,etastart=etastart,...)", sep = "")))
1.
gamm(opioidNonFatalOD ~ s(mandatoryReg.l2, k = 3, fx = TRUE, bs = "cr") + s(coalitionActive.l2, k = 3, fx = TRUE, bs = "cr") + monthsSinceJan2011 + everFunded + ICD10 + spoke5 + hub + s(monthly2, bs = "cc", fx = FALSE, k = 4) + s(county2, bs = "re"), ...
I've tried using the offset as a term in the model (see commented-out code), but get a similar error.
Just be inspecting the code, does anyone have an idea of what I'm doing wrong?
Thanks,
David
tl;dr;
Create the offset outside the gamm function and then pass it to the formula using ...+offset().
In your example then use:
tsData$off = log(tsData$population/100000)
gamm(opioidNonFatalOD ~ <other variables> + s(county2, bs = "re") + offset(off),
<other stuffs>)
The general syntax for gams to add an offset is to include it in the formula, like y ~ ... + x + offset(offset_variable). However, as seen in the examples below it seems as if gammis struggling to parse functions (i.e. the log or division) within the offset function.
Some examples:
library(mgcv)
# create some data
set.seed(1)
dat <- gamSim(6,n=200,scale=.2,dist="poisson")
# create an offset
dat$off1 = (dat$y+1)*sample(2:10, 200, TRUE)
Attempt 1: finds off1 but errors likely due to the large values in off1 (and we really would like the log transfromed, or whichever link function was used)
m1 <- gamm(y~s(x0)+s(x1)+s(x2) + offset(off1),
family=poisson,data=dat,random=list(fac=~1))
Maximum number of PQL iterations: 20
iteration 1
iteration 2
Show Traceback
Rerun with Debug
Error in na.fail.default(list(Xr.1 = c(-0.00679246534326368, -0.0381904761033802,
:missing values in object
Attempt 2: can't seem to find off1 after log transform within offset function
m2 <- gamm(y~s(x0)+s(x1)+s(x2) + offset(log(off1)),
family=poisson, data=dat,random=list(fac=~1))
Maximum number of PQL iterations: 20
iteration 1
Show Traceback
Rerun with Debug
Error in eval(predvars, data, env) : object 'off1' not found
Attempt 3: define offset term outside offset function
# Success
dat$off2 = log(dat$off1)
m3 <- gamm(y~s(x0)+s(x1)+s(x2) + offset(off2),
family=poisson, data=dat, random=list(fac=~1))
So create the offset variable outside then pass it to the gamm formula.

How to fix "variable length differ" error in cv.zipath?

Trying to run a Cross validation of a zero-inflated poisson model using cv.zipath from the mpath package.
Fitting the LASSO
fit.lasso = zipath(estimation_sample_nomiss ~ .| .,
data = missings,
nlambda = 100,
family = "poisson",
link = "logit")
Cross validation
n <- dim(docvisits)[1]
K <- 10
set.seed(197)
foldid <- split(sample(1:n), rep(1:K, length = n))
fitcv <- cv.zipath(F_time_unemployed~ . | .,
data = estimation_sample_nomiss, family = "poisson",
nlambda = 100, lambda.count = fit.lasso$lambda.count[1:30],
lambda.zero = fit.lasso$lambda.zero[1:30], maxit.em = 300,
maxit.theta = 1, theta.fixed = FALSE, penalty = "enet",
rescale = FALSE, foldid = foldid)
I encounter the following error:
Error in model.frame.default(formula = F_time_unemployed ~ . + ., data = list(: variable lengths differ (found for '(weights)')
I have cleaned the sample of all NA's but still encounter the error message.
The solution turns out to be that the cv.zipath() command does not accept tibble data formats - at least in this instance. (No guarantee as to how this statement can be generalised). Having used dplyr commands, one needs to convert back to data frame. Thus, the solution is as simple as as.dataframe().

How to cluster standard error in clubSandwich's vcovCR()?

I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.

Format of newx in Lasso regression gives error in R

I am trying to implement lasso linear regression. I train my model but when I try to make prediction on unknown data it gives me the following error:
Error in cbind2(1, newx) %*% nbeta :
invalid class 'NA' to dup_mMatrix_as_dgeMatrix
Summary of my data is:
I want to predict the unknown percent_gc. I initially train my model using data for which percent_gc is known
set.seed(1)
###training data
data.all <- tibble(description = c('Xylanimonas cellulosilytica XIL07, DSM 15894','Teredinibacter turnerae T7901',
'Desulfotignum phosphitoxidans FiPS-3, DSM 13687','Brucella melitensis bv. 1 16M'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('63.93','69.372','3.493.11','44.393.704'),
Longitude = c('-22.1','88.235','134.082.527','-0.130781'),
genome_size = c(8361599,2469596,2158157,3207552),
percent_gc = c(34,24,55,44),
percent_psuedo = c(0.0032987747,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
###data for prediction
data.prediction <- tibble(description = c('Liberibacter crescens BT-1','Saprospira grandis Lewin',
'Sinorhizobium meliloti AK83','Bifidobacterium asteroides ATCC 25910'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('39.53','69.372','5.493.12','44.393.704'),
Longitude = c('20.1','-88.235','134.082.527','-0.130781'),
genome_size = c(474832,2469837,2158157,3207552),
percent_gc = c(NA,NA,NA,NA),
percent_psuedo = c(0.0074639239,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
x=model.matrix(percent_gc~.,data.all)
y=data.all$percent_gc
cv.out <- cv.glmnet (x, y, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
fit <- glmnet(x,y,alpha=1)
I then want to make predictions for which percent_gc in not known.
newX = matrix(data = data.prediction %>% select(-percent_gc))
data.prediction$percent_gc <-
predict(object = fit ,type="response", s=best.lambda, newx=newX)
And this generates the error I mentioned above.
I don't understand which format newX should be in order to get rid of this help. Insights would be appreciated.
I could not really figure out how to construct a appropiate matrix, but package glmnetUtils provides functionality to directly fit a formula on a dataframe and predict. With this I got it to predict values:
library(glmnetUtils)
fit <- glmnet(percent_gc~.,data.all,alpha=1)
cv.out <- cv.glmnet (percent_gc~.,data.all, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
predict(object = fit,data.prediction,s=best.lambda)

Cannot use a Variable Importance Plot when Using Argument 'Importance = TRUE'

When I try to run the following code:
reg <- randomForest(max_orders ~ ., data = df[-c(1:3)], ntree = 100, importance = T)
varImpPlot(reg, sort = T)
I get the error:
Error in plot.window(xlim = xlim, ylim = ylim, log = "") :
need finite 'xlim' values
But if I run:
reg <- randomForest(max_orders ~ ., data = df[-c(1:3)], ntree = 100)
varImpPlot(reg, sort = T)
Everything's fine and dandy!
I'm legitimately about to lose my sanity. I've made the MSE variable importance plots a countless number of times, I don't know what the issue is here. Here's my original regression data (df[-c(1:3]):
EDIT: R has officially gonna full blown schizophrenic on me:
> # Test Variables
> reg <- randomForest(max_orders ~ release_age, data = df[-c(1:3)], ntree =
100, importance = T)
> varImpPlot(reg, sort = T)
> # Test Variables
> reg <- randomForest(max_orders ~ release_age, data = df[-c(1:3)], ntree =
100, importance = T)
> varImpPlot(reg, sort = T)
Error in plot.window(xlim = xlim, ylim = ylim, log = "") :
need finite 'xlim' values
HOW DOES R RUN THE EXACT SAME CODE WORD FOR WORD AND HAVE AN ERROR ONE TIME AND NOT ANOTHER?! Well, I guess I fixed the problem, just kept rerunning the code until a plot finally showed up, still want to know what is reason behind this enigma.

Resources