adding interaction to GEE - model matrix is rank deficient - r

I am trying to run a GEE model, using geepack. I have done this successfully, using the below call.
Call:
geeglm(formula = pdc1 ~ country + post + time_post +
TIME + age + sex + country * time_post + country * post, family = gaussian("identity"), data = lipid_data,
id = id, waves = ID, corstr = "ar1", std.err = "san.se").
where:
pdc1=numeric
country=factor
post=factor
time_post=numeric
TIME=numeric
I'm trying to run the exact same model on different data, which are in the exact same format as above. I can run the model with main effects, but not with the interactions. this is the error I get:
Error in geeglm(pdc1 ~ STATE + post + time_post + TIME + STATE * post, :
Model matrix is rank deficient; geeglm can not proceed
I have tried recoding STATE as a numeric variable (and post) but this does not prove fruitful. I don't understand whats going on, the variables hold the exact same data as the first model, and are coded the way. Does anyone know what could be going on here?

I have recently solved a similar problem - it related to having one of the covariates constant across the dataset. For example, STATE might be the same state for each observation which may be causing the error "rank deficient". This might also explain why this error is specific to your new dataset.

I am not sure if you encountered the same problem as mine. I used to fit a geeglm model in R and got the same problem.
My model is something like:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
This model runs without any problem.
However, when I tried to put another binary covariates setting (Urban/Rural) into the model:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)+as.factor(setting)-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
R gave me the error:
Model matrix is rank deficient; geeglm can not proceed
Then I decided to create a new binary variable: setting_new (0/1) and put this new variable in the mode:
geeglm(outCC$counts~outCC$partage:as.factor(outCC$contage)+setting_new-1,
family=poisson(link="log"),id=outCC$partid,data=outCC,subset=sel,weights=outCC$partwe)
And now the problem is solved.
I got the same problem when trying to put another categorical variables into the above mentioned model, with the syntax:
as.factor(Q12_travel)
And the model could not run until I tried to create new dummy variables (0/1) and put it back to the model.
Maybe you have another problem, but I suggest you can try this approach to see if the problem is solved.

Related

Is using 'nAGQ = 0' in glmer an issue?

I have seen a couple of similar questions to my own without sufficient answers, so posting this here. I am running a generalised linear mixed effect regression model in R using the glmer() function in the lme4 package. My code is as follows:
model <- glmer(Response ~ step * type + duration + StemArousal + StemValency + stemBNCfreq + (1|ParticipantID)+(1|word),family=binomial, data=demodata, nAGQ = 0)
Response is a binary numerical variable (0 & 1). Step and type are both categorical variables (step has 7 levels, type has 3) and the other 4 predictors are numerical. Both random effects are categorical. I have 389 participants and 20 words.
Currently, I also include the argument 'nAGQ = 0' which I found via another post (https://stats.stackexchange.com/questions/77313/why-cant-i-match-glmer-family-binomial-output-with-manual-implementation-of-g). If I don't do this, then the model does not converge.
I found an explanation elsewhere of the difference between 'nAGQ = 1' and 'nAGQ = 0' (https://stat.ethz.ch/pipermail/r-sig-mixed-models/2017q3/025942.html) which suggests that what I have done is less precise as there is less interaction with the random effects.
Is there a general consensus about this acceptability of this approach? and does anyone have a reliable source for a discussion about it?
Apologies if this sounds like a cross-post/repeat question; it's just that nothing clear seemed to have been resolved.

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

"system is computationally singular" error when using `gmm` (GMM Estimation)

Trying to use the GMM package in R to estimate the parameters (a-f) of a linear model:
LEV1 = a*Macro + b*Firm + c*Sector + d*qtr + e*fqtr + f*tax
Macro, Firm and Sector are matrices with n number of rows. qtr, fqtr and tax are vectors with n members.
I have one large data frame called unconstrd that has all of the data. First, I break that data into separate matrices:
v_LEV1 <- as.matrix(unconstrd$LEV1)
Macro <- as.matrix(cbind(unconstrd$Agg_Corp_Prof,unconstrd$R1000_TR, unconstrd$CP_Spread))
Firm <- as.matrix(cbind(unconstrd$ppe_ratio, unconstrd$op_inc_ratio_avg, unconstrd$selling_exp_avg,
unconstrd$tax_avg, unconstrd$Mark_to_Bk, unconstrd$mc_ratio))
Sector <- as.matrix(cbind(unconstrd$Sect_Flag03,
unconstrd$Sect_Flag04, unconstrd$Sect_Flag05, unconstrd$Sect_Flag06,
unconstrd$Sect_Flag07, unconstrd$Sect_Flag08, unconstrd$Sect_Flag12,
unconstrd$Sect_Flag13, unconstrd$Sect_Flag14, unconstrd$Sect_Flag15,
unconstrd$Sect_Flag17))
v_qtr <- as.matrix(unconstrd$qtr)
v_fqtr <- as.matrix(unconstrd$fqtr)
v_tax <- as.matrix(unconstrd$tax_dummy)
Then, I bind the data together for the x variable called by gmm:
h=cbind(Macro,Firm,Sector,v_qtr, v_fqtr, v_tax)
Then, I invoke gmm:
gmm1 <- gmm(v_LEV1 ~ Macro + Firm + Sector + v_qtr + v_fqtr + v_tax, x=h)
I get the message:
Error in solve.default(crossprod(hm, xm), crossprod(hm, ym)) :
system is computationally singular: reciprocal condition number = 1.10214e-18
I apologize in advance and admit that I'm a neophyte at R and I've never used GMM before. The GMM function is so general, I've looked at the examples available on the web but nothing seems specific enough to help my situation.
You are trying to fit onto a matrix which does not have full rank---try excluding some of the variable and/or look for errors. We cannot say much more without your data, or at least a sample.
That's more of a modelling question for Crossvalidated.com than a programming question for StackOverflow.
I was pretty certain there was no linear dependency between my variables but I went through the exercise of adding one variable at a time to see what was causing the errors. In the end, I asked a colleague to run GMM on SAS and it ran perfectly, no error messages. I'm not sure what the problem is with the R version is but at this point I have a solution and give u on GMM on R.
Thanks to everyone who tried to help.

Error in fitting a model with gee(): NA/NaN/Inf in foreign function call (arg 3)

I'm fitting a gee model on a dataset including 13,500 observations (here students). Students are grouped into 52 different schools. I know that there is evidence that students are nested within schools (low ICC) and therefore I should adjust this nesting effect in the variance covariance matrix. What I'm planning to do is to first fit a gee model with exchangeable var-cov structure. Then, on top of that, I'll run Huber-White Sandwich estimator also known as robust variance estimator. I wrote my own code for robust variance estimator and it works perfectly. My gee statement doesn't work and give the error below:
NA/NaN/Inf in foreign function call (arg 3)
Here is my code:
STMath.OneYr.C1 = gee(postCSTMath1Yr ~ TRT1Yr + preCSTMath + preCSTENG +
post1YrGradeRef + ELLBaseLine + GENDER + ECODIS + ETHNICITY.F +
as.factor(FailedInd1Yr), data = UCI.clone[UCI.clone$COHORT0809 == "C1",],
id = post1YrSchIID, corstr = "exchangeable")
Unfortunately, the code above is not reproducible for you guys and perhaps difficult to figure out what the issue is.
I appreciate if you could help me figure out to solve the issue.
OK, this question is quite old but I ended up here, so this might help someone eventually.
Basically, this error was caused because unlike in other libraries, the id parameter is treated as a numeric vector.
Indeed, the gee function is casting id as a double, which I don't really understand. Here are the implicated lines (l. 119-120 of the function):
if (!(is.double(id)))
id <- as.double(id)
If your id column is a character, just cast it to a factor, or use some function (like dplyr::min_rank) to turn it to a numeric variable.
This should do the trick.

Resources