Forward procedure with BIC - r

I'm trying to select variables for a linear model with forward stepwise algorithm and BIC criterion. As the help file indicates and as I always did, I wrote the following:
model.forward<-lm(y~1,data=donnees)
model.forward.BIC<-step(model.forward,direction="forward", k=log(n), scope=list(lower = ~1, upper = ~x1+x2+x3), data=donnees)
with k=log(n) indicating I'm using BIC. But R returns:
Error in extractAIC.lm(fit, scale, k = k, ...) : object 'n' not found
I never really asked myself the question before but I think that n is supposed to be defined in function step(it s the number of variables in the model at each iteration).... Anyway, the issue never happened to me before! Restarting R doesn't change anything and I admit I have no idea of what can cause this error.
Here is some code to test:
y<-runif(20,0,10)
x1<-runif(20,0,1)
x2<-y+runif(20,0,5)
x3<-runif(20,0,1)-runif(20,0,1)*y
donnees<-data.frame(x1,x2,x3,y)
Any ideas?

step(model.forward,direction="forward",
k=log(nrow(donnees)), scope=list(lower = ~1, upper = ~x1+x2+x3),
data=donnees)
or more generally ...
... k=log(nobs(model.forward)) ...
(for example, if there are NA values in your data, then nobs(model.forward) will be different from nrow(donnees). On the other hand, if you have NA values in your predictors, you're going to run into trouble when doing model selection anyway.)

Related

MLE using nlminb in R - understand/debug certain errors

This is my first question here, so I will try to make it as well written as possible. Please be overbearing should I make a silly mistake.
Briefly, I am trying to do a maximum likelihood estimation where I need to estimate 5 parameters. The general form of the problem I want to solve is as follows: A weighted average of three copulas, each with one parameter to be estimated, where the weights are nonnegative and sum to 1 and also need to be estimated.
There are packages in R for doing MLE on single copulas or on a weighted average of copulas with fixed weights. However, to the best of my knowledge, no packages exist to directly solve the problem I outlined above. Therefore I am trying to code the problem myself. There is one particular type of error I am having trouble tracing to its source. Below I have tried to give a minimal reproducible example where only one parameter needs to be estimated.
library(copula)
set.seed(150)
x <- rCopula(100, claytonCopula(250))
# Copula density
clayton_density <- function(x, theta){
dCopula(x, claytonCopula(theta))
}
# Negative log-likelihood function
nll.clayton <- function(theta){
theta_trans <- -1 + exp(theta) # admissible theta values for Clayton copula
nll <- -sum(log(clayton_density(x, theta_trans)))
return(nll)
}
# Initial guess for optimization
guess <- function(x){
init <- rep(NA, 1)
tau.n <- cor(x[,1], x[,2], method = "kendall")
# Guess using method of moments
itau <- iTau(claytonCopula(), tau = tau.n)
# In case itau is negative, we need a conditional statement
# Use log because it is (almost) inverse of theta transformation above
if (itau <= 0) {
init[1] <- log(0.1) # Ensures positive initial guess
}
else {
init[1] <- log(itau)
}
}
estimate <- nlminb(guess(x), nll.clayton)
(parameter <- -1 + exp(estimate$par)) # Retrieve estimated parameter
fitCopula(claytonCopula(), x) # Compare with fitCopula function
This works great when simulating data with small values of the copula parameter, and gives almost exactly the same answer as fitCopula() every time.
For large values of the copula parameter, such as 250, the following error shows up when I run the line with nlminb():"Error in .local(u, copula, log, ...) : parameter is NA
Called from: .local(u, copula, log, ...)
Error during wrapup: unimplemented type (29) in 'eval'"
When I run fitCopula(), the optimization is finished, but this message pops up: "Warning message:
In dlogcdtheta(copula, u) :
dlogcdtheta() returned NaN in column(s) 1 for this explicit copula; falling back to numeric derivative for those columns"
I have been able to find out using debug() that somewhere in the optimization process of nlminb, the parameter of interest is assigned the value NaN, which then yields this error when dCopula() is called. However, I do not know at which iteration it happens, and what nlminb() is doing when it happens. I suspect that perhaps at some iteration, the objective function is evaluated at Inf/-Inf, but I do not know what nlminb() does next. Also, something similar seems to happen with fitCopula(), but the optimization is still carried out to the end, only with the abovementioned warning.
I would really appreciate any help in understanding what is going on, how I might debug it myself and/or how I can deal with the problem. As might be evident from the question, I do not have a strong background in coding. Thank you so much in advance to anyone that takes the time to consider this problem.
Update:
When I run dCopula(x, claytonCopula(-1+exp(guess(x)))) or equivalently clayton_density(x, -1+exp(guess(x))), it becomes apparent that the density evaluates to 0 at several datapoints. Unfortunately, creating pseudobservations by using x <- pobs(x) does not solve the problem, which can be see by repeating dCopula(x, claytonCopula(-1+exp(guess(x)))). The result is that when applying the logarithm function, we get several -Inf evaluations, which of course implies that the whole negative log-likelihood function evaluates to Inf, as can be seen by running nll.clayton(guess(x)). Hence, in addition to the above queries, any tips on handling log(0) when doing MLE numerically is welcome and appreciated.
Second update
Editing the second line in nll.clayton as follows seems to work okay:
nll <- -sum(log(clayton_density(x, theta_trans) + 1e-8))
However, I do not know if this is a "good" way to circumvent the problem, in the sense that it does not introduce potential for large errors (though it would surprise me if it did).

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

Understanding errors from ordinal logistic regression

I am using MASS::polr to run ordinal logistic regressions, but I am getting a lot of errors that I am hoping people can enlighten me about.
First if I run this the function fails to find starting values:
MASS::polr(as.ordered(cyl)~mpg+gear,mtcars)
So if I specify starting values, I get an error from optim stating 'non-finite value supplied by optim':
MASS::polr(as.ordered(cyl)~mpg+gear,mtcars,start=c(1,1,1,1))
After reading some R-help, and previous stack overflow questions about this, the response is usually that there is something wrong with the data i.e. the response variable has a category with relatively few values, but in this instance I don't see anything wrong with mtcars.
Any guidance on how to diagnose, and deal with issues in data that will impact MASS::polr would be appreciated.
Regards
Going on a scavenger hunt through ?polr, the starting values are to be specified "in the format c(coefficients, zeta)". Looking lower, we see that zeta is "the intercepts for the class boundaries.". In the Details section, we can see that the zeta values must be ordered:
zeta_0 = -Inf < zeta_1 < ... < zeta_K = Inf
([sic], as that presumably should be a < Inf at the end.)
So you need the second zeta value to be greater than the first. This works, for example:
MASS::polr(as.ordered(cyl) ~ mpg + gear, mtcars, start = c(1, 1, 1, 2))

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
data=sqlQuery(myconn,train_dat),family=binomial(link="logit"),
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.

randomForest() machine learning in R

I am exploring with the function randomforest() in R and several articles I found all suggest using a similar logic as below, where the response variable is column 30 and independent variables include everthing else except for column 30:
dat.rf <- randomForest(dat[,-30],
dat[,30],
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
When I try this, I got the following error messages:
Error in randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
NA not permitted in predictors
In addition: Warning message:
In randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
The response has five or fewer unique values. Are you sure you want to do regression?
However, I was able to get it to work when I listed the independent variables one by one while keeping all the other parameters the same.
dat.rf <- randomForest(as.factor(Y) ~X1+ X2+ X3+ X4+ X5+ X6+ X7+ X8+ X9+ X10+......,
data=dat
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
Could someone help me debug the simplier command where I don't have to list each predictor one by one?
The error message gives you a clue to two problems:
First, you need to remove any row that has a NA anywhere. Removing NA should be easy enough and I'll leave you that one as an exercise.
It looks like you need to do classification (which predicts a response which only has one of a few discrete levels), rather than regression (which predicts a continuous response). If the response is continuous, randomForest() will automatically apply regression.
So, how do you force randomForest() to use classification?As you noticed in your first try, randomForest allows you to give data as predictors and response data, not just using the formula style. To force randomForest() to apply classification, make sure that the value you are trying to predict (the response, or dat[,30]) is a factor. Remember to explicitly identify the $x$ and $y$ arguments. This is easy to do:
randomForest(x = dat[,-30],
y = factor(dat[,30]),
...)
This way your output can only take one of the levels given in y.
This is all buried in the description of the arguments $x$ and $y$: see ?help.

Resources