I'm having trouble converting an SAS script to the corresponding R script.
The model is a repeated measures analysis of the response (resp) based on treatment (trt) with plot (plot) nested in the treatment.
SAS code:
data data_set;
input trt $ plot time resp;
datalines;
Burn 1 1 27
Burn 1 9 25
Burn 1 12 18
Burn 1 15 21
Burn 2 1 5
Burn 2 9 15
Burn 2 12 10
Burn 2 15 12
...
Unburn 1 1 57
Unburn 1 9 46
Unburn 1 12 49
Unburn 1 15 51
Unburn 2 1 43
Unburn 2 9 59
Unburn 2 12 59
Unburn 2 15 60
proc mixed data = data_set;
class trt plot time;
model resp = trt time trt*time / ddfm = kr;
repeated time / subject = trt(plot) type = vc rcorr;
run;
R code attempted (loading the data set from a CSV file):
library(nlme)
data.set <- read.csv( "data_set.csv" )
data.set$plot <- factor( data.set$plot )
data.set$time <- factor( data.set$time )
model1 <- lme( resp ~ trt + time + trt:time, data = data.set, random = ~1 | plot )
This works, but isn't the desired model. Other attempts I've tried have generally resulted in the error:
Error in getGroups.data.frame(dataMix, groups) :
invalid formula for groups
Basically I'm off in the weeds here...
Question 1: how to specify the same model in R as what is already specified in SAS?
Question 2: I want to be able to change the covariance matrix to replicate other work done in SAS. I believe I know how to do this with the correlation parameter for the lme function. But please correct me if I'm wrong.
Thanks in advance.
The specification of the model in R would logically be:
model1 <- lme( resp ~ trt + time + trt:time, data = data.set, random = ~1 | trt:plot )
This given that plot is nested in treatment per the coding, or alternatively, there is an interaction between plot and treatment. However if specified as such, then it generates the warning mentioned:
Error in getGroups.data.frame(dataMix, groups) : invalid formula
for groups
The problem encountered has to do with the levels introduced (I think) by using such an interaction. Regardless of the exact issue, the problem can be resolved by creating a combined treatment plot predictor variable:
data.set$trtplot <- with( data.set, factor( paste( trt, plot, sep = "." ) ) )
And then performing the analysis as follows:
model1 <- lme( resp ~ trt + time + trt:time, data = data.set, random = ~ 1 | trtplot )
For completeness this could just as easily be the following, where each predictor variable is added plus the interaction:
model1 <- lme( resp ~ trt * time, data = data.set, random = ~ 1 | trtplot )
This then matches results achieved in SAS when a Compound Symmetry (CS) covariance structure is specified (although the AIC criterion is a different - not sure why). So a little different to the SAS code above where a Variance Components (VC) covariance structure is specified, but this is just a matter of changing the structure type in the SAS code.
As for comparing different covariance structures, this appears to be more of a challenge. The covariance structures that I would like to investigate are:
Compound Symmetry (CS) - done
Variance Components(VC)
Unstructured (UN)
Spatial Power (SP)
Any thoughts would be most welcome!
Related
I am trying to incorporate the prior settings of my dependent variable in my logistic-regression in r using the glm-function. The data-set I am using is created to predict churn.
So far I am using the function below:
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family =
binomial(link='logit'))
What I am looking for is how the weights function works and how to include it in the function or if there is another way to incorporate this. The dependent variable is a nominal variables with the options 0 or 1. The data set is imbalanced in a way that only 10 % has a value of 1 on the dependent variable CH1 and the other 90% has a value of 0. Therefore the weights are (0.1, 0.9)
My dataset Is build-up in the following manner:
Where the independent variables vary in data type between continues and class variables and
Although the ratio of 0 to 1s is 1:9, it does not mean the weights are 0.1 and 0.9. The weights decides how much emphasis you want to give observation compared to the others.
And in your case, if you want to predict something, it is essential you split your data into train and test, and see what influence the weights have on prediction.
Below is using the pima indian diabetes example, I subsample the Yes type such that the training set has 1:9 ratio.
set.seed(111)
library(MASS)
# we sample 10 from Yes and 90 from No
idx = unlist(mapply(sample,split(1:nrow(Pima.tr),Pima.tr$type),c(90,10)))
Data = Pima.tr
trn = Data[idx,]
test = Data[-idx,]
table(trn$type)
No Yes
90 10
Lets try regressing it with weight 9 if positive, 1 if negative:
library(caret)
W = 9
lvl = levels(trn$type)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
# we test it on the test set
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 34 26
Yes 8 32
You can see from above, you can see it's doing ok, but you are missing out on 8 positives and also falsely labeling 26 false positives. Let's say we try W = 3
W = 3
lvl = levels(trn$type)
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 39 30
Yes 3 28
Now we manage to get almost all the positive calls correct.. But still miss out on a lot of potential "Yes". Bottom line is, code above might work, but you need to do some checks to figure out what is the weight for your data.
You can also look around the other stats provided by confusionMatrix in caret to guide your choice.
In your dataset trainingset create a column called weights_col that contains your weights (.1, .9) and then run
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family = binomial(link='logit'), weights = weights_col)
I have been comparing Poisson, negative binomial (NB), and zero-inflated Poisson and NB models in R. My dependent variable is a symptom count for generalized anxiety disorder (GAD), and my predictors are two personality traits (disinhibition [ZDis_winz] and meanness [ZMean_winz]), their interaction, and covariates of age and assessment site (dummy-coded; there are 8 sites so I have 7 of these dummy variables). I have a sample of 1206 with full data (and these are the only individuals included in the data frame).
I am using NB models for this disorder because the variance (~40) far exceeds the mean (~4). I wanted to consider the possibility of a ZINB model as well, given that ~30% of the sample has 0 symptoms.
For other symptom counts (e.g., conduct disorder), I have run ZINB models perfectly fine in R, but I am getting an error when I do the exact same thing with the GAD model. The standard NB model works fine for GAD; it is only the GAD ZINB model that's erroring out.
Here is the error I'm receiving:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 4.80021e-36
Here is the code I'm using for the (working) NB model:
summary(
NB_GAD_uw_int <- glm.nb(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data=eurodata))
Here is the code I'm using for the (not working) ZINB model (which is identical to other ZINB models I've run for other disorders):
summary(
ZINB_GAD_uw_int <- zeroinfl(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data = eurodata,
dist = "negbin",
model = TRUE,
y = TRUE, x = TRUE))
I have seen a few other posts on StackOverflow and other forums about this type of issue. As far as I can tell, people generally say that this is an issue of either 1) collinear predictors or 2) too complex a model for too little data. (Please let me know if I am misinterpreting this! I'm fairly new to Poisson-based models.) However, I am still confused about these answers because: 1) In this case, none of my predictors are correlated more highly than .15, except for the main predictors of interest (ZMean_winz and ZDis_winz), which are correlated about .45. The same predictors are used in other ZINB models that have worked. 2) With 1206 participants, and having run the same ZINB model with similarly distributed count data for other disorders, I'm a little confused how this could be too complex a model for my data.
If anyone has any explanation for why this version of my model will not run and/or any suggestions for troubleshooting, I would really appreciate it! I am also happy to provide more info if needed.
Thank you so much!
The problem may be that zeroinfl is not converting categorical variables into dummy variables.
You can dummify your variables using model.matrix, which is what glm, glm.nb, etc. call internally to dummify categorical variables. This is usually preferred over manually dummifying categorical variables, and should be done to avoid mistakes and to ensure full rank of your model matrix (a full rank matrix is non-singular).
You can of course dummify categorical variables yourself; in that case I would use model.matrix to transform your input data involving categorical variables (and potentially interactions between categorical variables and other variables) into the correct model matrix.
Here is an example:
set.seed(2017)
df <- data.frame(
DV = rnorm(100),
IV1_num = rnorm(100),
IV2_cat = sample(c("catA", "catB", "catC"), 100, replace = T))
head(df)
# DV IV1_num IV2_cat
#1 1.43420148 0.01745491 catC
#2 -0.07729196 1.37688667 catC
#3 0.73913723 -0.06869535 catC
#4 -1.75860473 0.84190898 catC
#5 -0.06982523 -0.96624056 catB
#6 0.45190553 -1.96971566 catC
mat <- model.matrix(DV ~ IV1_num + IV2_cat, data = df)
head(mat)
# (Intercept) IV1_num IV2_catcatB IV2_catcatC
#1 1 0.01745491 0 1
#2 1 1.37688667 0 1
#3 1 -0.06869535 0 1
#4 1 0.84190898 0 1
#5 1 -0.96624056 1 0
#6 1 -1.96971566 0 1
The manually dummified input data would then be
df.dummified = cbind.data.frame(DV = df$DV, mat[, -1])
# DV IV1_num IV2_catB IV2_catC
#1 1.43420148 0.01745491 0 1
#2 -0.07729196 1.37688667 0 1
#3 0.73913723 -0.06869535 0 1
#4 -1.75860473 0.84190898 0 1
#5 -0.06982523 -0.96624056 1 0
#6 0.45190553 -1.96971566 0 1
which you'd use in e.g.
glm.nb(DV ~ ., data = df.dummified)
I have a problem when performing a two-way rm ANOVA in R on the following data (link : https://drive.google.com/open?id=1nIlFfijUm4Ib6TJoHUUNeEJnZnnNzO29):
subjectnbr is the id of the subject and blockType and linesTTL are the independent variables. RT2 is the dependent variable
I first performed the rm ANOVA through using ezANOVA with the following code:
ANOVA_RTS <- ezANOVA(
data=castRTs
, dv=RT2
, wid=subjectnbr
, within = .(blockType,linesTTL)
, type = 2
, detailed = TRUE
, return_aov = FALSE
)
ANOVA_RTS
The result is correct (I double-checked using statistica).
However, when I perform the rm ANOVA using the lme function, I do not get the same answer and I have no clue why.
There is my code:
lmeRTs <- lme(
RT2 ~ blockType*linesTTL,
random = ~1|subjectnbr/blockType/linesTTL,
data=castRTs)
anova(lmeRTs)
Here are the outputs of both ezANOVA and lme.
I hope I have been clear enough and have given you all the information needed.
I'm looking forward for your help as I am trying to figure it out for at least 4 hours!
Thanks in advance.
Here is a step-by-step example on how to reproduce ezANOVA results with nlme::lme.
The data
We read in the data and ensure that all categorical variables are factors.
# Read in data
library(tidyverse);
df <- read.csv("castRTs.csv");
df <- df %>%
mutate(
blockType = factor(blockType),
linesTTL = factor(linesTTL));
Results from ezANOVA
As a check, we reproduce the ez::ezANOVA results.
## ANOVA using ez::ezANOVA
library(ez);
model1 <- ezANOVA(
data = df,
dv = RT2,
wid = subjectnbr,
within = .(blockType, linesTTL),
type = 2,
detailed = TRUE,
return_aov = FALSE);
model1;
# $ANOVA
# Effect DFn DFd SSn SSd F p
#1 (Intercept) 1 13 2047405.6654 34886.767 762.9332235 6.260010e-13
#2 blockType 1 13 236.5412 5011.442 0.6136028 4.474711e-01
#3 linesTTL 1 13 6584.7222 7294.620 11.7348665 4.514589e-03
#4 blockType:linesTTL 1 13 1019.1854 2521.860 5.2538251 3.922784e-02
# p<.05 ges
#1 * 0.976293831
#2 0.004735442
#3 * 0.116958989
#4 * 0.020088855
Results from nlme::lme
We now run nlme::lme
## ANOVA using nlme::lme
library(nlme);
model2 <- anova(lme(
RT2 ~ blockType * linesTTL,
random = list(subjectnbr = pdBlocked(list(~1, pdIdent(~blockType - 1), pdIdent(~linesTTL - 1)))),
data = df))
model2;
# numDF denDF F-value p-value
#(Intercept) 1 39 762.9332 <.0001
#blockType 1 39 0.6136 0.4382
#linesTTL 1 39 11.7349 0.0015
#blockType:linesTTL 1 39 5.2538 0.0274
Results/conclusion
We can see that the F test results from both methods are identical. The somewhat complicated structure of the random effect definition in lme arises from the fact that you have two crossed random effects. Here "crossed" means that for every combination of blockType and linesTTL there exists an observation for every subjectnbr.
Some additional (optional) details
To understand the role of pdBlocked and pdIdent we need to take a look at the corresponding two-level mixed effect model
The predictor variables are your categorical variables blockType and linesTTL, which are generally encoded using dummy variables.
The variance-covariance matrix for the random effects can take different forms, depending on the underlying correlation structure of your random effect coefficients. To be consistent with the assumptions of a two-level repeated measure ANOVA, we must specify a block-diagonal variance-covariance matrix pdBlocked, where we create diagonal blocks for the offset ~1, and for the categorical predictor variables blockType pdIdent(~blockType - 1) and linesTTL pdIdent(~linesTTL - 1), respectively. Note that we need to subtract the offset from the last two blocks (since we've already accounted for the offset).
Some relevant/interesting resources
Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS, Springer (2000)
Potvin and Schutz, Statistical power for the two-factor
repeated measures ANOVA, Behavior Research Methods, Instruments & Computers, 32, 347-356 (2000)
Deming Mi, How to understand and apply
mixed-effect models, Department of Biostatistics, Vanderbilt university
In R, what is the best way to incorporate the interaction term between a covariate and time, when the proportionality test (with coxph) shows that the proportionality assumption in the Cox model is violated? I know that you can either use strata or an interaction with time term, I'm interested in the latter. I haven't been able to find a definitive clear explanation with examples on how to do this on the internet. In the most common example using the Rossi dataset, Fox suggested to do,
coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + age:stop + prio, data = Rossi.2)
Is there a difference between modeling with age:stop versus age:start? Does the formula have to use this format? If I use the Surv with the two parameter format, would the following also make sense?
coxph(formula = Surv(week, arrest) ~ fin + age + age:week + prio, data = Rossi)
Or you have to split the dataset and use the Surv(start,stop,event) method?
Also, there is the time-transform method, so,
coxph(formula = Surv(week, arrest) ~ fin + age + tt(age) + prio, data = Rossi, tt=function(x,t,...) x*t)
I know that some people would prefer model with the log(t) instead of t here. But which one of these is the correct method to model interaction with time? Do these all refer to the same/different underlying statistical model? And the end, are all modeling (for the interaction term): h(t) = h0(t)exp(b*X*t)?
This is essentially a 3 part question:
How to estimate time-varying effects?
What is the difference between different specifications of time-varying effects using survival::coxph function
How to decide what shape the time-variation has, i.e., linear, logarithmic, ...
I will try to answer these questions in the following using the veteran data example, which is featured in section 4.2 of the vignette on time-dependent covariates and time-dependent coefficients (also known as time-varying effects) in the survival package:
library(dplyr)
library(survival)
data("veteran", package = "survival")
veteran <- veteran %>%
mutate(
trt = 1L * (trt == 2),
prior = 1L * (prior == 10))
head(veteran)
#> trt celltype time status karno diagtime age prior
#> 1 0 squamous 72 1 60 7 69 0
#> 2 0 squamous 411 1 70 5 64 1
#> 3 0 squamous 228 1 60 3 38 0
#> 4 0 squamous 126 1 60 9 63 1
#> 5 0 squamous 118 1 70 11 65 1
#> 6 0 squamous 10 1 20 5 49 0
1. How to estimate time-varying effects
There are different popular methods and implementations, e.g. survival::coxph, timereg::aalen or using GAMs after appropriate data transformation (see below).
Although the specific methods and their implementaitons differ, a general idea ist to create a long form data set where
the follow-up is partitioned into intervals
for each subject, the status is 0 in all intervals except the last (if an event)
the time variable is updated in each interval
Then, the time (or a transformation of time, e.g. log(t)) is simply a covariate and time-varying effects can be estimated by an interaction between the covariate of interest and the (transformed) covariate of time.
If the functional form of the time-variation is known, you can use the tt() aproach:
cph_tt <- coxph(
formula = Surv(time, status) ~ trt + prior + karno + tt(karno),
data = veteran,
tt = function(x, t, ...) x * log(t + 20))
2. What is the difference between different specifications of time-varying effects using survival::coxph function
There is no difference. I assume the tt() function is simply a short-cut for the estimation via transformation to the long-format. You can verify that the two approaches are equivalent using the code below:
transform to long format
veteran_long <- survSplit(Surv(time, status)~., data = veteran, id = "id",
cut = unique(veteran$time)) %>%
mutate(log_time = log(time + 20))
head(veteran_long) %>% select(id, trt, age, tstart, time, log_time, status)
#> id trt age tstart time log_time status
#> 1 1 0 69 0 1 3.044522 0
#> 2 1 0 69 1 2 3.091042 0
#> 3 1 0 69 2 3 3.135494 0
#> 4 1 0 69 3 4 3.178054 0
#> 5 1 0 69 4 7 3.295837 0
#> 6 1 0 69 7 8 3.332205 0
cph_long <- coxph(formula = Surv(tstart, time, status)~
trt + prior + karno + karno:log_time, data = veteran_long)
## models are equivalent, just different specification
cbind(coef(cph_long), coef(cph_tt))
#> [,1] [,2]
#> trt 0.01647766 0.01647766
#> prior -0.09317362 -0.09317362
#> karno -0.12466229 -0.12466229
#> karno:log_time 0.02130957 0.02130957
3. How to decide what shape the time-variation has?
As mentioned before, time-varying effects are simply interactions of a covariate x and time t, thus time-varying effects can have different specifications, equivalent to interactions in standard regression models, e.g.
x*t: linear covariate effect, linearly time-varying effect
f(x)*t: non-linear covariate effect, linearly time-varying effect
f(t)*x: linear covariate effect, non-linearly time-varying (for categorical x) this essentially represents a stratified baseline hazard
f(x, t): non-linear, non-linearly time-varying effect
In each case, the functional form of the effect f can either be estimated from the data or prespecified (e.g. f(t)*x = karno * log(t + 20) above).
In most cases you would prefer to estimate f from the data. The support for the (penalized) estimation of such effects is to my knowledge limited in the survival package. However, you can use mgcv::gam to estimate any of the effects specified above (after appropriate data transformation). An example is given below and shows that the effect of karno goes towards 0 as time progresses, regardless of the Karnofsky score at the beginning of the follow-up (see here for details and also Section 4.2 here):
library(pammtools)
# data transformation
ped <- as_ped(veteran, Surv(time, status)~., max_time = 400)
# model
pam <- mgcv::gam(ped_status ~ s(tend) + trt + prior + te(tend, karno, k = 10),
data = ped, family = poisson(), offset = offset, method = "REML")
p_2d <- gg_tensor(pam)
p_slice <- gg_slice(ped, pam, "karno", tend = unique(tend), karno = c(20, 50, 80), reference = list(karno = 60))
gridExtra::grid.arrange(p_2d, p_slice, nrow = 1)
I am trying to build a model with nested random effects and a random coefficient for an
interaction term using lmer() in R.
As seen in the created data below, I have a binary Response and two explanatory variables.
Time is continuous and Binary is a factor. These data are taken from 6 individuals (AAA:FFF)
in three StudyAreas (CO, UT,MT). Because individuals only occur at one StudyArea, IndID is
nested within StudyArea.
#Make data
Response <- as.factor(round(runif(150, 0, 1)))
Time <- round(runif(150, 2,50))
Binary <- round(runif(150, 0, 1))
IndID <- as.factor(rep(c("AAA", "BBB", "CCC", "DDD", "EEE", "FFF"),25))
StudyArea <- as.factor(rep(c("CO", "UT", "MT"),50))
Data <- data.frame(Response, Time, Binary, IndID, StudyArea)
head(Data)
> head(Data)
Response Time Binary IndID StudyArea
1 0 44 1 AAA CO
2 1 16 0 BBB UT
3 1 43 0 CCC MT
4 0 13 1 DDD CO
5 0 34 1 EEE UT
6 1 10 1 FFF MT
Because I want to account for the difference across IndID and also StudyArea I have included both terms
as random effects with adjustments to the intercept in the model below.
require(lme4)
lmer1 <- lmer(Response ~ Time + Binary + (1|StudyArea) + (1|IndID), data=Data, family=binomial)
summary(lmer1)
Lets say that within a GLM structure the interaction between Time and StudyArea (i.e. (Time*StudyArea))
is a significant term. Thus, in addition to the adjustments to the intercept, I also need an adjustment
to the slope to account for differences in Time as a function of StudyArea.
While I have seen a number of examples in the Bates book (http://lme4.r-forge.r-project.org/book/Ch4.pdf)
and other posts for adding a random coef, I have not seen a rand coef for an interaction term.
From what I have gleaned from other posts the model structure should look something like the model
below, but I look forward to the feedback and suggestions of others. This code will fit a model,
although I am not sure it is correct theoretically
lmer2 <- lmer(Response ~ Time + Binary + (0+Time|StudyArea) + (1|StudyArea) + (1|IndID), data=Data, family=binomial)
Note: These are made up data and the results/p-values are obviously meaningless.
Thanks in advance.