One-way repeated measures ANOVA with unbalanced data - r

I'm new to R, and I've read these forums (for help with R) for awhile now, but this is my first time posting. After googling each error here, I still can't figure out and fix my mistakes.
I am trying to run a one-way repeated measures ANOVA with unequal sample sizes. Here is a toy version of my data and the code that I'm using. (If it matters, my real data have 12 bins with up to 14 to 20 values in each bin.)
## the data: average probability for a subject, given reaction time bin
bin1=c(0.37,0.00,0.00,0.16,0.00,0.00,0.08,0.06)
bin2=c(0.33,0.21,0.000,1.00,0.00,0.00,0.00,0.00,0.09,0.10,0.04)
bin3=c(0.07,0.41,0.07,0.00,0.10,0.00,0.30,0.25,0.08,0.15,0.32,0.18)
## creating the data frame
# dependent variable column
probability=c(bin1,bin2,bin3)
# condition column
bin=c(rep("bin1",8),rep("bin2",11),rep("bin3",12))
# subject column (in the order that will match them up with their respective
# values in the dependent variable column)
subject=c("S2","S3","S5","S7","S8","S9","S11","S12","S1","S2","S3","S4","S7",
"S9","S10","S11","S12","S13","S14","S1","S2","S3","S5","S7","S8","S9","S10",
"S11","S12","S13","S14")
# putting together the data frame
dataFrame=data.frame(cbind(probability,bin,subject))
## one-way repeated measures anova
test=aov(probability~bin+Error(subject/bin),data=dataFrame)
These are the errors I get:
Error in qr.qty(qr.e, resp) :
invalid to change the storage mode of a factor
In addition: Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors
3: In aov(probability ~ bin + Error(subject/bin), data = dataFrame) :
Error() model is singular
Sorry for the complexity (assuming it is complex; it is to me). Thank you for your time.

For an unbalanced repeated-measures design, it might be easiest to
use lme (from the nlme package):
## this should be the same as the data you constructed above, just
## a slightly more compact way to do it.
datList <- list(
bin1=c(0.37,0.00,0.00,0.16,0.00,0.00,0.08,0.06),
bin2=c(0.33,0.21,0.000,1.00,0.00,0.00,0.00,0.00,0.09,0.10,0.04),
bin3=c(0.07,0.41,0.07,0.00,0.10,0.00,0.30,0.25,0.08,0.15,0.32,0.18))
subject=c("S2","S3","S5","S7","S8","S9","S11","S12",
"S1","S2","S3","S4","S7","S9","S10","S11","S12","S13","S14",
"S1","S2","S3","S5","S7","S8","S9","S10","S11","S12","S13","S14")
d <- data.frame(probability=do.call(c,datList),
bin=paste0("bin",rep(1:3,sapply(datList,length))),
subject)
library(nlme)
m1 <- lme(probability~bin,random=~1|subject/bin,data=d)
summary(m1)
The only real problem is that some aspects of the interpretation etc.
are pretty far from the classical sum-of-squares-decomposition approach
(e.g. it's fairly tricky to do significance tests of variance components).
Pinheiro and Bates (Springer, 2000) is highly recommended reading if you're
going to head in this direction.
It might be a good idea to simulate/make up some balanced data and do the
analysis with both aov() and lme(), look at the output, and make sure
you can see where the correspondences are/know what's going on.

Related

Why am I getting different statistical outputs than my partner using the same code?

I'm trying to fit a classification tree to the OJ dataset using the ISLR2 textbook. The response variable is "Purchase" which takes one of two values: "MM" or "CH".
library(ISLR2)
library(tree)
# (a) Create a training set containing a random sample of 800 obs
# and a test set containing the remaining observations.
set.seed(1)
train <- sample(1:nrow(OJ), 800)
# (b) Fit a regression tree to the training set. Plot the tree, and
# calculate the test MSE.
OJ.test <- OJ[-train, ]
tree.oj <- tree(Purchase~., OJ, subset = train) ##Produces error "NAs introduced by coercion"
plot(tree.oj) ##Produces error "Cannot plot singlenode tree"
My question is, does something seem wrong with my code or could I have an issue with R studios on my computer? I have the same code as my class partner and she can run the code fine. We also had the same code for our last assignment, but when I ran the code I produced different statistics in repeated runs. This has happened a few other times and it leads me to believe there's an issue with my computer and R, rather than the code. Any suggestions on where to start to resolve this?

Imputing missing observation

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with workout times in minutes.
I don't think it makes sense to fill the NA values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:
Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)
This is my code:
# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)
Here is a link to the data (compressed)
Data
Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.
Similar to what MrFlick mentioned, you are somewhat short in RAM.
Try running the algorithm on 1% of your data, and if you succeed, you should try checking out the bigmemory package for doing in-disk computations.
I also encourage you to check if the model you fit on your data is actually good without bayesian imputation, because the fact of trying to have perfect data could not be much more beneficial than just imputating mean/median/first/last values on your data.
Hope this helps.

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
data=sqlQuery(myconn,train_dat),family=binomial(link="logit"),
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Resources