I'm currently struggleing with some mixed models for repeated measures with R. I have read a lot of post and request for conversion of code from SAS to R, and I have found some elements but I am not sure of what I have done so far.
I am trying to model the effect of some products on subjects with different sequences patterns and different visits (following the pattern 1 product by visit).
I have some SAS code which is the "ground truth" and I would found the same results obtained by SAS with R ( with nlme package or equivalent) to display it through a Shiny App.
I've tried some models with R, with some results close to the one from SAS but some parts are still different, especially AIC, BIC and LogLik.
Below is the SAS code that I try to convert in R and my R implementation :
SAS Code
Proc mixed data = data method = reml;
Class A B C D;
Model variable = B C D / solution ddfm = kenwardroger ;
Random A(B);
etc.
etc.
Run;
R Code
library(nlme)
model <- test.lme <- lme(variable~ B + C+ D,
random = ~ 1| A / B, data = data, na.action = na.omit)
A : Subject ID
B : sequence pattern
C : visit number
D : product used
Is my conversion to R correct ? If it is, why I get different results in AIC, BIC and LogLik ?
Thanks in advance
Related
I am working with a dataset with 3,500 observations and includes a Body Mass Index variable. There are around 300 NA values for the BMI variable which I have imputed using multiple imputation. Apologies that this is not reproducible, wasn't sure how to do that quickly in this case. Here is the code I used to impute the data.
a1 <- amelia(x = df, m = 30, idvars = c("EDUC_1","REGION_3"),
noms=c("REGION_1","REGION_2","REGION_4","SMOKE","MARRIED","NON_WHITE","MOD_SEV_ANX","HYPERTEN",
"DIABETES","BELOW_100_POVERTY", "IMMIGRANT","FEMALE", "EDUC_2","EDUC_3","EDUC_4","EDUC_5"),
p2s = 0)
a1
imp.mod <- zelig(BMICALC ~ AGE+FAMSIZE+factor(BELOW_100_POVERTY)+factor(IMMIGRANT)
+factor(FEMALE)+factor(EDUC_2)+factor(EDUC_3)
+factor(EDUC_4)+factor(EDUC_5)+factor(REGION_1)+factor(REGION_2)+factor(REGION_4)+
factor(SMOKE)+factor(MARRIED)+factor(NON_WHITE)+factor(MOD_SEV_ANX)+factor(HYPERTEN)+
factor(DIABETES), model = "ls", data = a1, cite = F)
summary(imp.mod)
And here is the output
From this website here I have found information on how to interpret whether the imputation needs more investigation, or whether I can continue to analyze the data using a regression model. I have included the code where I create the 2 visuals per the website's instructions but I am unclear on how to interpret whether the imputation is accurate. Do the two distributions in the first graphic need to be close? Can someone clarify what the y=x line and blue confidence intervals/dots mean in the second image? Is the output here indicative of whether the imputation will suffice to be used in regression analysis? I've attached the code and output below. Thank you!
compare.density(a1, var = "BMICALC")
overimpute(a1, var = "BMICALC")
I have a the following data structure, with approx. studies i = 50, experiments j = 75 and conditions k = 200.
On level k I have dependent measures. For about 20 studies (25 experiments and 65 conditions) I have data on subject level and calculated the variance-covariance matrix. For the rest I calculated an Variance-Covariance matrix from estimated correlations (for subjects and conditions). Finally, I have a complete k x k variance-covariance matrix V.
To respect the multilevel structure of the data I let every condition in every experiment in every study have it's unique covariance using an unstructured variance-covariance matrix (see Details - Specifying Random Effects). Note, that I am not a 100% sure about this reasoning, or reasoning in general for/against variance-covariance assumed structures in multilevel models. So I am happy to receive some thoughts/literature on this...
I now want to conduct a multivariate (multilevel) random effects model with:
rma.mv(
yi = yk
, V = V
, random = list(~ exp_j | stu_i,
~ con_k | exp_j)
, struct = "UN"
, method = "REML"
, test = "t" ## slightly mimics knha
, data = dat
, slab = con_k
, control=list(optimizer="optimParallel", ncpus=32)
)
When run on the complete data set the calculation reaches 128GB(!) of RAM within a few minutes and at some point R just terminates with out an error message.
1) Is this to be expected with the amount of data I have?
Running the same model with a subset of the original data (i.e. i = 20, j = 25 and k = 65, I just grabbed data without estimated variance-covariance matrices) works fine and reaches a top of ~20GB RAM.
I saw the tipps section of the metafor package as well as the optimisation options for rma.mv() in the notes. 2) In my scenario, does switching to Microsofts R Open or another algorithm (with out parallelisation?!) is reasonable?
Note that the model above is not the final model I want to conduct. No moderators are included yet. Additional model(s) should include regression terms for moderators. It will become even more complex, I guess...
I am running R version 3.6.3 (2020-02-29) on x86_64-pc-linux-gnu (64-bit) under: Ubuntu 18.04.5 LTS. Metafor is on Version 2.4-0.
Best
Jonas
Probably not every study has 50 experiments and not every experiment has 200 conditions, but yes, 50 * 75 * 200 (i.e., 750,000) rows of data would be a problem. However, before I address this issue, let's start with the model itself, which makes little sense. With 75 experiments within those 50 studies, using ~ exp_j | stu_i with struct="UN" implies that you are trying to estimate the variances and covariances of a 75 x 75 var-cov matrix. That's 2850 parameters already. The ~ con_k | exp_j part adds yet another 20,000+ parameters by my calculation. This is never going to work.
Based on your description, you have a multilevel structure, but there is no inherent link between what experiment 1 in study 1 stands for and what experiment 1 in study 2 stands for. So the experiment identifier is just used here to distinguish the different experiments within studies, but carries no further meaning. Compare this with the situation where you have, for example, outcomes A and B in study 1, outcome A in study 2, outcome B in study 3, and so on. 'A' really stands for 'A' in all studies and is not just used to distinguish the elements.
Another issues is that ~ con_k | exp_j will not automatically be nested within studies. The rma.mv() function also allows for crossed random effects, so if you want to add random effects for conditions which in turn are nested within studies then you should create a new variable, for example exp.in.study that reflects this. You could do this with dat$exp.in.study <- paste0(dat$stu_i, ".", dat$exp_j). Then you can use ~ con_k | exp.in.stu to reflect this nesting.
However, based on your description, what I think you really should use is a much simpler model structure, namely random = ~ 1 | stu_i / exp_j / con_k (in that case, the struct argument is not relevant).
Still, if your dataset has 100,000+ rows, then the default way rma.mv() works will become a memory issue, because internally the function will then juggle around with matrices that are of such dimensions. A simple solution to this is to use sparse=TRUE, in which case matrices are stored internally as sparse structures. You probably don't even need any parallel processing then, but you could try if optimizer="optimParallel" will speed things up (but then ncpus=3 is all you need because that is actually the number of variance components that will be estimated by the model if it is specified as suggested above).
I am working in R in package lme4 and in MPlus and have a following situation:
I want to predict variable B (which is dichotomous) from variable A (continous) controlling for random effects on the level of a) Subjects; b) Tasks.
A -> B (1)
The problem is that when I use model to predict the values of B from A, values below probability of 0.5 get predicted, and in my case that doesn´t make sense, because, if you guess at random, the probability of correct answer on B would be 0.5.
I want to know how I can constrain the model (1) in R or in MPlus so that it doesn´t predict values lower than 0.5 in variable B.
Thank you!
I found a solution to the question thanks to Mr Kenneth Knoblauch. Basically, you need the psyphy package to use mafc.logit function.
For example, the code then looks like this:
mod <- glm(B ~ A, data = df, family = binomial(mafc.logit(.m = 2)))
It then involves the guessing parameter for (.m = 2) - two-choice tasks.
Cheers!
I build ARIMA model with regressors in SAS and R, but the model's results are totally different, I cannot figure out why two packages give different outputs.
The following is the SAS code
proc ARIMA data=TSDATA;
identify var=LOG_Sale
crosscorr=(
Log_Var1
Log_Var2
Log_Var3)
nlag=12 ALPHA=0.05 WHITENOISE = IGNOREMISS SCAN;
run;
estimate q=(4)(10)
input=
( Log_Var1
Log_Var2
Log_Var3)
method=ml plot ;
run;
The following is the R code:
finalmodel <- arima (LOG_Sale, order=c(0, 0, 4), seasonal=list (order=c(0, 0, 1),period=10),include.mean = TRUE, xreg=xinput,fixed=c(0,0,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
method="ML")
summary(finalmodel)
As you can see, the model include MA(4)(10) and 3 regressors, I defined a matrix xinput to include the three regressors(Log_Var1,Log_Var2,Log_Var3).
The coefficients are totally different in two outputs(SAS and R), I don't know why, please help me out if you can point out what's wrong in the R code, because I think the SAS code is quite typical and should be right, but I am new to R and I guess the R code maybe wrong....
Thanks.
the data is typical weekly time series data
Date Log_Var1 Log_Var2 Log_Var3
3-Jan-11 13.47487027 8.65886635 9.096499556
10-Jan-11 14.1688108 9.182043773 9.096499556
17-Jan-11 14.3192497 9.175024027 9.096499556
24-Jan-11 14.54051181 9.1902397 9.096499556
31-Jan-11 14.33370089 9.1902397 9.096499556
7-Feb-11 13.76581591 9.431962767 9.326321786
14-Feb-11 14.09526221 9.29844282 9.326321786
21-Feb-11 14.61994905 9.29844282 9.326321786
28-Feb-11 14.94652204 8.700680735 9.326321786
7-Mar-11 14.71066636 9.026056892 9.348993004
As you can see from SAS code, the model is ARIMA(0,0,4)(0,0,10) with three input series, it's straigtforward in SAS, but I read many R materials and I cannot find any useful documents and examples to show how to build ARIMA with specific high order of p,q or P,Q (subset ARIMA) with external regressors.
the R code you see here actually works in R and the output looks alright, but the coefficients are different from the output in SAS, so I guess the algorithms of ARIMA in R could be different from SAS, but both are using ML method...
so the point is whether the R code is correct or not if the model is ARIMA(0,0,4)(0,0,10), it should be noted that q=(4)(10), which is only 4 and 10, not from 1 to 4 and not from 1 to 10, which is only subset orders.
thanks.
I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
data=sqlQuery(myconn,train_dat),family=binomial(link="logit"),
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.