How to retrieve the data frame used in a GEE model fit? - r

I have a longitudinal data frame with multiple rows per id.
> data("dietox")
> head(dietox, 5)
Pig Evit Cu Litter Start Weight Feed Time
1 4601 Evit000 Cu000 1 26.5 26.50000 NA 1
2 4601 Evit000 Cu000 1 26.5 27.59999 5.200005 2
3 4601 Evit000 Cu000 1 26.5 36.50000 17.600000 3
4 4601 Evit000 Cu000 1 26.5 40.29999 28.500000 4
5 4601 Evit000 Cu000 1 26.5 49.09998 45.200001 5
I am trying to fit a GEE model to predict Weight for each row of the data frame.
library(gee)
library(dplyr)
> model1 <- gee(Weight ~ Start + Feed, id=Pig, data=dietox, corstr="exchangeable")
> model1
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Exchangeable
Call:
gee(formula = Weight ~ Start + Feed, id = Pig, data = dietox,
corstr = "exchangeable")
Number of observations : 789
Maximum cluster size : 11
Coefficients:
(Intercept) Start Feed
5.1539561 0.9384232 0.4294209
I now want to be able to add a new column to the data frame- prediction, which contains the predicted weight value for each row of data. The idea is that I will then be able to compare the original Weight variable with the prediction variable at different points in the Time variable.
When I try do do this using mutate and predict functions, I get an error saying that the number of observations used in the model fit (789) is different from the number of observations in the original data frame (861).
> new_df <- dietox %>%
+ mutate(prediction = predict(model1))
Error: Column `prediction` must be length 861 (the number of rows) or one, not 789
My questions are:
1. How do I extract the data frame for the 789 observations that
were used in the model fit?
2. Why is the number of observations
used in the model fit different to the total number of observations
in the original data frame?

The 789 observations used in model fitting were the ones which were without NA. You had 72 observations as NA in Feed column
sum(is.na(dietox$Feed))
#[1] 72
and 789 + 72 gives you complete 861 observations. To get all the predicted values you could do
dietox$Prediction <- NA
dietox$Prediction[!is.na(dietox$Feed)] <- predict(model1)
head(dietox)
# Weight Feed Time Pig Evit Cu Litter Prediction
#1 26.50000 NA 1 4601 1 1 1 NA
#2 27.59999 5.200005 2 4601 1 1 1 31.43603
#3 36.50000 17.600000 3 4601 1 1 1 36.76708
#4 40.29999 28.500000 4 4601 1 1 1 41.45324
#5 49.09998 45.200001 5 4601 1 1 1 48.63296
#6 55.39999 56.900002 6 4601 1 1 1 53.66306
Also the values which were used in the model are present in model1$y.

Related

Flexsurvreg with treatment covariate. Separating variance covariance matrix

I am using flexsurvreg() to fit and extrapolate parametric models on survival data. I use the treatment group as a covariate to make a proportional hazard model. I need a variance covariance matrix separately for the two treatment groups but I am unable to find out how I separate the groups after fitting the parametric model.
weib <- flexsurvreg(Surv(os_mnts, censoring) ~ treat, data = date_ex, dist = "weibull")
An example of the data is below. I do have treat == control as well even though it does not show here.
#sx_date last_fup_date censoring sex treat os_mnts
# <date> <date> <dbl> <dbl> <chr> <dbl>
# 1 2010-06-03 2013-08-10 0 1 treatment 38.2
# 2 2013-06-10 2014-09-09 1 1 treatment 15.0
# 3 2014-11-05 2015-07-03 0 0 treatment 7.89
# 4 2011-03-07 2014-08-10 1 1 treatment 41.1
# 5 2010-03-06 2013-12-11 0 1 treatment 45.2
# 6 2011-09-08 2015-01-01 0 1 treatment 39.8
# 7 2008-10-09 2016-06-02 1 0 treatment 91.8
# 8 2010-02-11 2015-01-02 1 1 treatment 58.7
# 9 2009-08-06 2014-07-06 0 1 treatment 59.0
#10 2011-07-03 2016-04-03 0 0 treatment 57.0
When I call vcov(weib) to get the variance covariance matrix, I get the following.
# shape scale treattreatment
#shape 0.0218074155 -0.004631324 -0.0001595603
#scale -0.0046313242 0.007912648 -0.0068951896
#treattreatment -0.0001595603 -0.006895190 0.0138593195
However, I need two variance covariance matrices (1 for each treatment group) with shape and scale only.
I have tried searching for way to separate the matrix itself and to subset the weib object. However I cannot find how to do either of these things. Does anyone know how I can get separate matrices out of this?

Creating new variables after imputation with the MICE package

I have longitudinal panel data of 1000 individuals measured at two time points. Using the MICE package I have imputed values for those variables with missing data. The imputation itself works fine, generating the required 17 imputed data frames. One of the imputed variables is fitness. I would like to create a new variable of fitness scaled, scale(fitness). My understanding is that I should impute first, and then create the new variable with the imputed data. How do I access each of the 17 imputed datasets and generate a scaled fitness variable in each?
My original data frame looks like (some variables missing):
id age school sex andersen ldl_c_trad pre_post
<dbl> <dbl> <fct> <fct> <int> <dbl> <fct>
1 2 10.7 1 1 951 2.31 1
2 2 11.3 1 1 877 2.20 2
3 3 11.3 1 1 736 2.88 1
4 3 11.9 1 1 668 3.36 2
5 4 10.1 1 0 872 3.31 1
6 4 10.7 1 0 905 2.95 2
7 5 10.5 1 1 925 2.02 1
8 5 11.0 1 1 860 1.92 2
9 8 10.7 1 1 767 3.41 1
10 8 11.2 1 1 709 3.32 2
My imputation code is:
imputed <- mice(imp_vars, method = meth, predictorMatrix = predM, m = 17)
imp_vars are the variables selected for imputation.
I have pre-specified both the method and predictor matrix.
Also, my assumption is that the scaling should be performed separately for each time point, as fitness is likely to have improved over time. Is it possible to perform the scaling filtered by pre_post for each imputed dataset?
Many thanks.
To access each of the imputations where x is a value from 1-17
data <- complete(imputed, x)
or if you want access to the fitness variable
complete(imputed, x)$fitness
If you want to filter observations according to a value of another variable in the dataframe, you could use
data[which(data$pre_post==1), "fitness"]
This should return the fitness observations for when pre_post==1, from there it is simply a matter of scaling these observations for each level of pre_post, assigning them to another variable fitness_scaled and then repeating for each imputation 1-17.

covariance structure for multilevel modelling

I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

How can I filter out rows from linear regression based on another linear regression

I would like to conduct a linear regression that will have three steps: 1) Running the regression on all data points 2) Taking out the 10 outiers as found by using the absolute distanse value of rstandard 3) Running the regression again on the new data frame.
I know how to do it manually but these is very awkwarding. Is there a way to do it automatically? Can it be done for taking out columns as well?
Here is my toy data frame and code (I'll take out 2 top outliers):
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 16
608 0 1 5
123 1 17 7
321 1 8 7
226 0 2 7
556 0 20 3
334 1 6 3
225 0 1 1
999 0 3 11
987 0 30 1 ",header = TRUE)
model<- lm(target~ birds+ wolfs,data=df)
rstandard <- abs(rstandard(model))
df<-cbind(df,rstandard)
g<-subset(df,rstandard > sort(unique(rstandard),decreasing=T)[3])
g
userid target birds wolfs rstandard
4 543 1 2 3 1.189858
13 334 1 6 3 1.122579
modelNew<- lm(target~ birds+ wolfs,data=df[-c(4,13),])
I don't see how you could do this without estimating two models, the first to identify the most influential cases and the second on the data without those cases. You could simplify your code and avoid cluttering the workspace, however, by doing it all in one shot, with the subsetting process embedded in the call to estimate the "final" model. Here's code that does this for the example you gave:
model <- lm(target ~ birds + wolfs,
data = df[-(as.numeric(names(sort(abs(rstandard(lm(target ~ birds + wolfs, data=df))), decreasing=TRUE)))[1:2]),])
Here, the initial model, evaluation of influence, and ensuing subsetting of the data are all built into the code that comes after the first data =.
Also, note that the resulting model will differ from the one your code produced. That's because your g did not correctly identify the two most influential cases, as you can see if you just eyeball the results of abs(rstandard(lm(target ~ birds + wolfs, data=df))). I think it has to do with your use of unique(), which seems unnecessary, but I'm not sure.

coxph() X matrix deemed to be singular;

I'm having some trouble using coxph(). I've two categorical variables:"tecnologia" and "pais", and I want to evaluate the possible interaction effect of "pais" on "tecnologia"."tecnologia" is a variable factor with 2 levels: gps and convencional. And "pais" as 2 levels: PT and ES. I have no idea why this warning keeps appearing.
Here's the code and the output:
cox_AC<-coxph(Surv(dados_temp$dias_seg,dados_temp$status)~tecnologia*pais,data=dados_temp)
Warning message:
In coxph(Surv(dados_temp$dias_seg, dados_temp$status) ~ tecnologia * :
X matrix deemed to be singular; variable 3
> cox_AC
Call:
coxph(formula = Surv(dados_temp$dias_seg, dados_temp$status) ~
tecnologia * pais, data = dados_temp)
coef exp(coef) se(coef) z p
tecnologiagps -0.152 0.859 0.400 -0.38 7e-01
paisPT 1.469 4.345 0.406 3.62 3e-04
tecnologiagps:paisPT NA NA 0.000 NA NA
Likelihood ratio test=23.8 on 2 df, p=6.82e-06 n= 127, number of events= 64
I'm opening another question about this subject, although I made a similar one some months ago, because I'm facing the same problem again, with other data. And this time I'm sure it's not a data related problem.
Can somebody help me?
Thank you
UPDATE:
The problem does not seem to be a perfect classification
> xtabs(~status+tecnologia,data=dados)
tecnologia
status conv doppler gps
0 39 6 24
1 30 3 34
> xtabs(~status+pais,data=dados)
pais
status ES PT
0 71 8
1 49 28
> xtabs(~tecnologia+pais,data=dados)
pais
tecnologia ES PT
conv 69 0
doppler 1 8
gps 30 28
Here's a simple example which seems to reproduce your problem:
> library(survival)
> (df1 <- data.frame(t1=seq(1:6),
s1=rep(c(0, 1), 3),
te1=c(rep(0, 3), rep(1, 3)),
pa1=c(0,0,1,0,0,0)
))
t1 s1 te1 pa1
1 1 0 0 0
2 2 1 0 0
3 3 0 0 1
4 4 1 1 0
5 5 0 1 0
6 6 1 1 0
> (coxph(Surv(t1, s1) ~ te1*pa1, data=df1))
Call:
coxph(formula = Surv(t1, s1) ~ te1 * pa1, data = df1)
coef exp(coef) se(coef) z p
te1 -23 9.84e-11 58208 -0.000396 1
pa1 -23 9.84e-11 100819 -0.000229 1
te1:pa1 NA NA 0 NA NA
Now lets look for 'perfect classification' like so:
> (xtabs( ~ s1+te1, data=df1))
te1
s1 0 1
0 2 1
1 1 2
> (xtabs( ~ s1+pa1, data=df1))
pa1
s1 0 1
0 2 1
1 3 0
Note that a value of 1 for pa1 exactly predicts having a status s1 equal to 0. That is to say, based on your data, if you know that pa1==1 then you can be sure than s1==0. Thus fitting Cox's model is not appropriate in this setting and will result in numerical errors.
This can be seen with
> coxph(Surv(t1, s1) ~ pa1, data=df1)
giving
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; beta may be infinite.
It's important to look at these cross tables before fitting models. Also it's worth starting with simpler models before considering those involving interactions.
If we add the interaction term to df1 manually like this:
> (df1 <- within(df1,
+ te1pa1 <- te1*pa1))
t1 s1 te1 pa1 te1pa1
1 1 0 0 0 0
2 2 1 0 0 0
3 3 0 0 1 0
4 4 1 1 0 0
5 5 0 1 0 0
6 6 1 1 0 0
Then check it with
> (xtabs( ~ s1+te1pa1, data=df1))
te1pa1
s1 0
0 3
1 3
We can see that it's a useless classifier, i.e. it does not help predict status s1.
When combining all 3 terms, the fitter does manage to produce a numerical value for te1 and pe1 even though pe1 is a perfect predictor as above. However a look at the values for the coefficients and their errors shows them to be implausible.
Edit #JMarcelino: If you look at the warning message from the first coxph model in the example, you'll see the warning message:
2: In coxph(Surv(t1, s1) ~ te1 * pa1, data = df1) :
X matrix deemed to be singular; variable 3
Which is likely the same error you're getting and is due to this problem of classification. Also, your third cross table xtabs(~ tecnologia+pais, data=dados) is not as important as the table of status by interaction term. You could add the interaction term manually first as in the example above then check the cross table. Or you could say:
> with(df1,
table(s1, pa1te1=pa1*te1))
pa1te1
s1 0
0 3
1 3
That said, I notice one of the cells in your third table has a zero (conv, PT) meaning you have no observations with this combination of predictors. This is going to cause problems when trying to fit.
In general, the outcome should be have some values for all levels of the predictors and the predictors should not classify the outcome as exactly all or nothing or 50/50.
Edit 2 #user75782131 Yes, generally speaking xtabs or a similar cross-table should be performed in models where the outcome and predictors are discrete i.e. have a limited no. of levels. If 'perfect classification' is present then a predictive model / regression may not be appropriate. This is true for example for logistic regression (outcome is binary) as well as Cox's model.

Resources