Why is my glmer model in R taking so long to run? - r

I had previously been using simple stats in Statistica, but required R for my masters research. I am trying to run the following code to test for any significant interactions, and it is just running forever. If I simplify the model by taking month out, then it runs, but biologically it makes sense that month is significant so I would really like this to run including month as a factor. Once I run the model, the stop sign in R studio just stays present for hours, what could be the reason for this? Like I said I'm very new and it has been really difficult to learn this on my own. I am working with presence/absence data (as %) which I do cbind as my dependent variable. SO far this is what my coad looks like:
library(car)
library(languageR)
library(AICcmodavg)
library(lme4)
Scat <- read.csv("Scat2.csv", header=T)
attach(Scat)
names(Scat)
y <- cbind(Present,Absent)
ScatData <- glmer(y ~ Estate * Species * Month * Content * (1|Site) + Min + Max,family=binomial)
summary(ScatData)
Once I get to running the actual model, I don't even get to do the summary because R is not done computing the results of the actual model. I ran the model for approximately 4 hours, and when I clicked on the stop sign, I received this message:
Warning message:
In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf, :
failure to converge in 10000 evaluations
I would really appreciate some input on this matter.

You have a few problems with your model specification. Your model
y ~ Estate * Species * Month * Content * (1|Site) + Min + Max
is asking for all the main effects and interactions of estate, species, month, content, and site, which is incredibly complex.
Also, you have specified site as a random effect and asked for its interaction with fixed effects. I'm not sure whether that's possible, but it certainly seems wrong. You should decide whether you want site to be a fixed effect or a random effect.
If you post a minimal replicable example, I can give more specific advice.

Related

A strange case of singular fit in lme4 glmer - simple random structure with variation among groups

Background
I am trying to test for differences in wind speed data among different groups. For the purpose of my question, I am looking only on side wind (wind direction that is 90 deg from the individual), and I only care about the strength of the wind. Thus, I use absolute values. The range of wind speeds is 0.0004-6.8 m/sec and because I use absolute values, Gamma distribution describes it much better than normal distribution.
My data contains 734 samples from 68 individuals, with each individual having between 1-30 repeats. However, even if I reduce my samples to only include individuals with at least 10 repeats (which leaves me with 26 individuals and a total of 466 samples), I still get the problematic error.
The model
The full model is Wind ~ a*b + c*d + (1|individual), but for the purpose of this question, the simple model of Wind ~ 1 + (1|individual) gives the same singularity error, so I do not think that the explanatory variables are the problem.
The complete code line is glmer(Wind ~ 1 + (1|individual), data = X, family = Gamma(log))
The problem and the strange part
When running the model, I get the boundary (singular) fit: see ?isSingular error, although, as you can see, I use a very simple model and random structure. The strange part is that I can solve this by adding 0.1 to the Wind variable (i.e. glmer(Wind+0.1 ~ 1 + (1|Tag), data = X, family = Gamma(log)) does not give any error). I honestly do not remember why I added 0.1 the first time I did it, but I was surprised to see that it solved the error.
The question
Is this a problem with lme4? Am I missing something? Any ideas what might cause this and why does me adding 0.1 to the variable solve this problem?
Edit following questions
I am not sure what's the best way to add data, so here is a link to a csv file in Google drive
using glmmTMB does not produce any warnings with the basic formula glmmTMB(Wind ~ 1 + (1|Tag), data = X, family = Gamma(log)), but gives convergence problems warnings ('non-positive-definite Hessian matrix') when using the full model (i.e., Wind ~ a*b + c*d + (1|individual)), which are then solved if I scale the continuous variables

CPU requirement to run logistic regression with pairwise interactions

I am trying to fit a logistic regression model with pairwise interactions for 7 variables; however, I have let the code run for as long as 12 hours, and still no results. My dataset is not terribly large…about 3000 lines. One of my variables has 82 degrees of freedom, and I am wondering if that is the problem? I have no problem running the code with main level effects, but I expect interactions between my variables, so I would like pairwise interactions included. I have tried adding arguments to the code to speed up the process, but I still can’t get the code to kick back results even after 12 hours of running. I am using the glmulti package to fit the model, and I included the method = “g” argument and the conseq = 5 argument in an attempt to make the code run faster. Is there anything else I can do to speed it up? Another code or different package, or is a basic laptop just not enough to run it?
This is the code I used:
detectmodel<- glmulti::glmulti(outcome~ bird + year + season + sex + numobs + obsname + season + month, data=detect, level=2, fitfunction=glm, crit="aicc", family=binomial, confsetsize=10, method = "g")

Count-process datasets for Non-proportional Hazard (Cox) models with interaction variables

I am trying to run a nonproportional cox regression model featuring an interaction-with-time variable, as described in Chapter 15 (section 15.3) of Applied Longitudinal Data Analaysis by Singer and Willett. However I cannot seem to get answers that agree with the book.
The data used in this book and source code is supplied at this fantastic website. Unfortunarely no R code is supplied for the final chapter and the supplied dataset for R for the example discussed in-text is incomplete and provides incorrect answers for the simplest model (which I do know how to run). Instead, to obtain the complete dataset for this example, one must click the 'Download' link in the 'SAS' column (which has the correct dataset) and then, after installing the haven package (which allows one to read in foreign data formats), read in the dataset in question via:
haven::read_sas("alda/lengthofstay.sas7bdat")
This dataset indicates participants' (variable ID) length of stay (variable DAYS) in inpatient treatment in a hospital. The censoring variable is CENSOR. The researchers hypothesised that two different types of treatment (binary variable TREAT) would predict differential values of hazard of checking out of treatment. In addition they anticipated that the between-group difference in hazard would not be constant over time, therefore requiring the creation of an interaction term. I can get the simple main effect model to work, returning the same hazard coefficients reported in the book (which is how i eventually found out the .csv file supplied with the R code was incomplete).
summary(modA <- coxph(Surv(DAYS,1-CENSOR) ~ TREAT, data = los))
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.1457 1.1568 0.1541 0.945 0.345
I tried to follow the procedure laid out here, and here, and the sources listed therein (e.g. Therneau vignette on time-varying covariates in the survival package), and, of course, when I am copy-pasting someone else's code and running that it all works fine. But I am trying to do this for myself from scratch with a dataset whose results I can compare against mine. And I just can't make it work.
first I created an EVENT variable
los$EVENT <- 1 - los$CENSOR
there is a duplicate id number in the dataset that causes issues. So we have to change it to a new ID number
los$ID[which(duplicated(los$ID))] <- 842
Now, based on what I read here and here the dataframe needs to be split so that, for every participant, there is one row indicating the EVENT status at every point prior to their event (or censorship) time when any other participant experienced an event. Therefore we need to create a vector of all the unique event times, then split the dataset on those event times
cutPoints <- sort(unique(los$DAYS[los$EVENT == 1]))
# now split the dataset
longLOS <- survSplit(Surv(DAYS,EVENT)~ ., data = los, cut = cutPoints)
# and (just because I'm anal) rename the interval upper bound column (formerly "DAYS")
names(longLOS)[5] <- "tstop"
When I looked at this dataset it appeared to be what I was after, with (1) as many rows for each participant as there are intervals prior to their event time when anyone else in the dataset experienced an event, (2) two columns indicating the lower and upper bounds of each interval, and (3) an event column with a 0 for all rows when the respondent did not experience the event, and a 1 in the final row when they either did experience the event or were censored.
Next I created the interaction-with-time variable, subtracting 1 from the 'interval upper bound' column so that main effect of TREAT represents the treatment effect on the first day of hospitalisation.
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
And ran the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
But it doesn't work! I got the (fairly unhelpful) error message
Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
routine failed due to numeric overflow.This should never happen. Please contact the author.
What am I doing wrong? I have been slowly working through Singer and Willett for almost three years (I started while still a grad student), and now the final chapter is proving to be by far my greatest challenge. I have thirty pages to go; any help would be incredibly appreciated.
I figured out what I was doing wrong. A stupid error when I created the interaction variable TREATINT. instead of
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
it should have been
longLOS$TREATINT <- longLOS$TREAT*(longLOS$tstop - 1)
Now when you run the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
Not only does it work, it yields coefficients that match those reported in the Singer and Willett book.
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.706411 2.026705 0.292404 2.416 0.0157
TREATINT -0.020833 0.979383 0.009207 -2.263 0.0237
Given how dumb my mistake was I was tempted to just delete this whole post but I think I'll leave it up for others like me who want to know how to do interaction with time Cox models in R.

Fixing a coefficient on variable in MNL [duplicate]

This question already has an answer here:
Set one or more of coefficients to a specific integer
(1 answer)
Closed 6 years ago.
In R, how can I set weights for particular variables and not observations in lm() function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm() function in R has weights parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()?
Of course, lm() function is not the only option. If you know how to do it with other similar solutions (e.g. glm()), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm(), gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory column in data frame, but I'm looking for more elegant solution, affecting model, not data.
The following code is a bit complicated because lm() minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm() is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory) is good idea. I(c*memory) is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size to price ~ offset(c1*memory) + offset(c2*screen_size). But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2 only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))
It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim function or look into linear or quadratic programming (linprog and quadprog packages).
If you insist on using modeling tools like lm then use the offset argument in the formula to specify your own multiplyer rather than computing one.

"system is computationally singular" error when using `gmm` (GMM Estimation)

Trying to use the GMM package in R to estimate the parameters (a-f) of a linear model:
LEV1 = a*Macro + b*Firm + c*Sector + d*qtr + e*fqtr + f*tax
Macro, Firm and Sector are matrices with n number of rows. qtr, fqtr and tax are vectors with n members.
I have one large data frame called unconstrd that has all of the data. First, I break that data into separate matrices:
v_LEV1 <- as.matrix(unconstrd$LEV1)
Macro <- as.matrix(cbind(unconstrd$Agg_Corp_Prof,unconstrd$R1000_TR, unconstrd$CP_Spread))
Firm <- as.matrix(cbind(unconstrd$ppe_ratio, unconstrd$op_inc_ratio_avg, unconstrd$selling_exp_avg,
unconstrd$tax_avg, unconstrd$Mark_to_Bk, unconstrd$mc_ratio))
Sector <- as.matrix(cbind(unconstrd$Sect_Flag03,
unconstrd$Sect_Flag04, unconstrd$Sect_Flag05, unconstrd$Sect_Flag06,
unconstrd$Sect_Flag07, unconstrd$Sect_Flag08, unconstrd$Sect_Flag12,
unconstrd$Sect_Flag13, unconstrd$Sect_Flag14, unconstrd$Sect_Flag15,
unconstrd$Sect_Flag17))
v_qtr <- as.matrix(unconstrd$qtr)
v_fqtr <- as.matrix(unconstrd$fqtr)
v_tax <- as.matrix(unconstrd$tax_dummy)
Then, I bind the data together for the x variable called by gmm:
h=cbind(Macro,Firm,Sector,v_qtr, v_fqtr, v_tax)
Then, I invoke gmm:
gmm1 <- gmm(v_LEV1 ~ Macro + Firm + Sector + v_qtr + v_fqtr + v_tax, x=h)
I get the message:
Error in solve.default(crossprod(hm, xm), crossprod(hm, ym)) :
system is computationally singular: reciprocal condition number = 1.10214e-18
I apologize in advance and admit that I'm a neophyte at R and I've never used GMM before. The GMM function is so general, I've looked at the examples available on the web but nothing seems specific enough to help my situation.
You are trying to fit onto a matrix which does not have full rank---try excluding some of the variable and/or look for errors. We cannot say much more without your data, or at least a sample.
That's more of a modelling question for Crossvalidated.com than a programming question for StackOverflow.
I was pretty certain there was no linear dependency between my variables but I went through the exercise of adding one variable at a time to see what was causing the errors. In the end, I asked a colleague to run GMM on SAS and it ran perfectly, no error messages. I'm not sure what the problem is with the R version is but at this point I have a solution and give u on GMM on R.
Thanks to everyone who tried to help.

Resources