I am wishing to run a linear mixed model on a dependent variable DV that is collected under two different Condition at three different Timepoint. The data is structured as follows:
## dput(head(RawData,5))
structure(list(Participant = structure(c(2L, 2L, 2L, 2L, 4L),
.Label = c("Jessie", "James", "Gus", "Hudson", "Flossy",
"Bobby", "Thomas", "Alfie", "Charles", "Will", "Mat", "Paul", "Tim",
"John", "Toby", "Blair"), class = "factor"),
xVarCondition = c(1, 1, 0, 0, 1),
Measure = structure(c(1L, 2L, 3L, 4L, 1L),
.Label = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12"), class = "factor"),
Sample = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("1", "2"), class = "factor"),
Condition = structure(c(2L, 2L, 1L, 1L, 2L),
.Label = c("AM", "PM"), class = "factor"),
Timepoint = structure(c(2L, 2L, 2L, 2L, 1L),
.Label = c("Baseline", "Mid", "Post"), class = "factor"),
DV = c(83.6381348645853, 86.9813802115179, 69.2691666620429,
71.3949807856125, 87.8931998204771)),
.Names = c("Participant", "xVarCondition", "Measure",
"Sample", "Condition", "Timepoint", "DV"),
row.names = c(NA, 5L), class = "data.frame")
Each Participant performs two trials per Condition across three Timepoints as depicted by Measure; however, there are missing data so not necessarily 12 levels per participant. The column xVarCondition is simply a dummy variable that includes a 1 for each entry of AM in Condition. The column Sample refers to the 2 trials for each Condition at each Timepoint.
I am an R user but the statistician is a SAS user who believes the code for the model should be:
proc mixed data=RawData covtest cl alpha=α
class Participant Condition Timepoint Measure Sample;
model &dep=Condition Timepoint/s ddfm=sat outp=pred residual noint;
random int xVarCondition xVarCondition*TimePoint*Sample
TimePoint/subject=Participant s;
The above SAS code gives sensible answers and is working perfectly. We believe the resulting lme4 syntax for the above model to be:
TestModel = lmer(DV ~ Condition + Timepoint +
(1 | Participant/Timepoint) +
(0 + xVarCondition | Participant) +
(1 | Participant:xVarCondition:Measure), data = RawData)
However, I get the following error when running this model:
Error: number of levels of each grouping factor must be < number of observations
Are the random effects specified correctly?
I can't quite tell from your description, but most likely your Participant:xVarCondition:Measure term constructs a grouping variable that has no more than one more observation in each level of classification, which will make the (1|Participant:xVarCondition:Measure) term redundant with the residual error term which is always included in an lmer model. You can override the error if you really want to by including
control=lmerControl(check.nobs.vs.nlev = "ignore")
in your function call, but (if I've diagnosed the problem correctly) this will lead to the residual variance and the Participant:xVarCondition:Measure variance being jointly unidentifiable. Such unidentifiability usually doesn't cause any problems with the rest of the model, but I am more comfortable with an identifiable model (there's always the possibility that such unidentifiability will lead to numerical issues).
There's a similar example here.
You can check my conjecture as follows:
ifac <- with(RawData,
interaction(Participant,xVarCondition,Measure,drop=TRUE))
length(levels(ifac)) == nrow(RawData)
Related
I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")
I have a dataset with age as continuous and as a factor, sex as a factor and 4 groups.
structure(list(Age = c(9, 12, 16, 57), Age_1 = structure(c(2L,
3L, 3L, 7L), .Label = c("8", "1", "2", "3", "4", "5", "6", "7"
), class = "factor"), Sex = structure(c(2L, 1L, 2L, 1L), .Label = c("M",
"F", "U"), class = "factor"), N = structure(c(2L, 2L, 2L,
2L), .Label = c("0", "1"), class = "factor"), G = structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), L_1 =
structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), C_1 =
structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), G_1 =
structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), m = structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), A = c(1,
1, 1, 1)), row.names = c(NA, 4L), class = "data.frame")
I want to do logistic regression for each variable (Age, Age_1 and sex) for each of the groups (N, G, L_1, C_1, G_1, m).
eg.
mylogit <- glm(N ~ Sex, data = logistic_s, family = "binomial")
mylogit <- glm(N ~ Age, data = logistic_s, family = "binomial")
I am using gtsummary for combining the variables in a table.
library(gtsummary)
tbl_n <-
tbl_uvregression(
logistic_s[c("N", "Age", "sex", "Age_1")],
method = glm,
y = N,
method.args = list(family = binomial),
exponentiate = TRUE
)
tbl_n
This produces the output for one group (eg. N) with the variables Age, Age_1, Sex.
I want to repeat this with each of the groups (eg N, G, L_1 etc) and then combine the tables to make one combined table.
I am open to using different packages if there are other options that suit this better. I want to make a table that can be exported in word.
I agree that some reproducible code would be helpful. I am not 100% certain what kind of output you're looking to get. You want to build some univariate logistic regression models, separately for 2 or more groups?
If that is correct, here is one way to go about it:
I will use the trial data set in the gtsummary package as an example. I'll make the grouping variable treatment (trt).
library(gtsummary)
library(tidyverse)
trial_subset <-
trial %>%
select(trt, response, age, marker, grade)
We'll begin by constructing univariate regression tables stratified by trt using the tbl_uvregression() function from gtsummary package. They will be stored in a new column in the data frame called tbl_uv.
df_uv <-
trial_subset %>%
# group by trt, and nest data within group
group_by(trt) %>%
nest() %>%
# build univariate logistic regression models separately within grouping variable
mutate(
tbl_uv = map(
data,
~tbl_uvregression(
data = .x,
y = response,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
)
)
)
#> # A tibble: 2 x 3
#> # Groups: trt [2]
#> trt data tbl_uv
#> <chr> <list> <list>
#> 1 Drug A <tibble [98 x 4]> <tbl_vrgr>
#> 2 Drug B <tibble [102 x 4]> <tbl_vrgr>
We can now use the tables saved in df_uv to merge them into a single table using the tbl_merge() function.
tbl_merge(
tbls = df_uv$tbl_uv, # list of the univariate logistic regression tables
tab_spanner = paste0("**", df_uv$trt, "**") # adding stars to bold the header
)
This produces the table below. I hope this is helpful!
I have the following long dataset (R dataframe called "dat"). It has around 10,000 observations of 1200 children, with a maximum of 10 observations per child. These observations are weight of the child collected at different time points in early childhood, starting from birth until age 5 years. Age in the dataset is given in days (variable = agedays) and weight in kilograms in variable wtkg.
Dataset
structure(list(subjid = c(1001L, 1001L, 1001L, 1001L), sex = structure(c(2L, 2L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"),
agedays = c(0L, 2L, 30L, 107L), wtkg = c(3.78, 3.64, 4.71,
6.5), Uobservations = c(10L, 10L, 10L, 10L), BMI_group = structure(c(2L,
2L, 2L, 2L), .Label = c("normal", "obese", "overweight",
"true_NA", "underweight"), class = "factor"), GWG_T3_cat2 = structure(c(2L,
2L, 2L, 2L), .Label = c("T3notexcessive", "T3excessive"), class = "factor")), .Names = c("subjid", "sex", "agedays", "wtkg", "Uobservations", "BMI_group", "GWG_T3_cat2"), row.names = c(NA, 4L), class = "data.frame")
I want to use the SITAR package to study the velocity, tempo and size of the children according to excessive gestational weight gain in trimester 3 (versus not excessive weight gain in trimester 3 ) (variable = GWG_T3_cat2) among obese mothers (variable = BMI_group).
I tried running a model as seen in this link: https://www.rdocumentation.org/packages/sitar/versions/1.0.9
m1 <- sitar(x=agedays, y=wtkg, id=subjid, data=dat, df=2)
But I get an error:
Error in nlme.formula(y ~ fitnlme(x, s1, s2, a, b, c), fixed = s1 + s2 +:
Singularity in backsolve at level 0, block 1
I would really appreciate if any one could help solve this issue.
I want to make a function that calculates some pre-determined summary statistic measures that I can apply to any dataset. I'll start off with an example here, but this is for datasets that could have a variety of datatypes - such as character, factor, numerical, dates, containing null values, etc.
I can do this easy enough if the data is all numeric - but handling the IF scenarios w/ apply, sapply, etc is where I run into trouble with the syntax.
When its all numeric I'm great since I can just do new_df = data.frame(min = sapply(mydf, 2,min).....etc....etc). I just can't get the syntax right when its more complicated like in my example below.
In the example below I have a data frame of 3 columns:
all numerical
numerical with a null
categorical column of data coded as a factor
I want to calculate the:
type...(character, factor, date, numeric, etc)
mean...when the data-type is numeric obviously , and excluding nulls
number of null values in the dataset
I think this is simple enough and I can run with it from here..
copy and paste this code and name as a variable for the data frame:
structure(list(allnumeric = c(10, 20, 30, 40), char_or_factor = structure(c(2L,
3L, 3L, 1L), .Label = c("bird", "cat", "dog"), class = "factor"),
num_with_null = c(10, 100, NA, NA)), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c(NA, -4L), class = "data.frame")
expected solution data frame (copy and assign to a variable):
structure(list(allnumeric = structure(c(3L, 2L, 1L), .Label = c("0",
"25", "numeric"), class = "factor"), char_or_factor = structure(c(2L,
NA, 1L), .Label = c("0", "character"), class = "factor"), num_with_null = structure(c(3L,
2L, 1L), .Label = c("2", "55", "numeric"), class = "factor")), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c("type", "mean",
"num_nulls"), class = "data.frame")
We can use sapply to loop over the columns, get the class, mean and number of NA elements, concatenate (c() and convert to data.frame
as.data.frame(sapply(df1, function(x) c(class(x), mean(x, na.rm=TRUE),
sum(is.na(x)))), stringsAsFactors=FALSE)
I am running a GLMM using glmer() in R:
glmer(survive ~ fyear + site + fyear * site.x + (1|fyear),
family = binomial(link = logexp(shaffer.sub$exposure)),
data = shaffer.sub)
where survive is 0 or 1 depending if the nest was successful or not. Here you can see what the data looks like:
structure(list(id = structure(1:7, .Label = c("1", "2", "3",
"4", "5", "6", "7"), class = "factor"), year.x = structure(c(1L,
1L, 2L, 3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1999"), class = "factor"),
survive = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), fyear = structure(c(1L, 1L, 2L,
3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1999"), class = "factor"),
site.x = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 1L), .Label = c("N",
"S"), class = "factor")), .Names = c("id", "year.x", "survive",
"fyear", "site.x"), row.names = c(NA, -7L), class = "data.frame")
but I get this warning message:
*Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0299425 (tol = 0.001, component 12)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?*
I was told I should not use the same random factor as a fixed effect on the same model.
At the end, I would like to have an output where I can see year, site and the interaction year:site effects. Like in an ANOVA table (is this possible? I've been trying to use summary(aov(model)) but this doesn't work; anova(model) does not either.
I get this error for the aov() command:
*Error in summary`(aov(syearXsite))` :
error in evaluating the argument 'object' in selecting a method for function 'summary': Error in if (fixed.only) { : argument is not interpretable as logical*
How can I see the effect of this variables on survival?
Whoever told you not to use a categorical input variable (fyear) as both a fixed and a random effect was correct. It's hard to know exactly what to recommend, it depends on the number of years and sites you have in your data set (is the data you linked to all of your data (I hope not), or just the first few rows? How many years and how many sites and how many total observations do you have?)
If you want to treat year as random and site as fixed (which would be sensible if you have only two sites (N vs S as seen in your data) and quite a few years, e.g. more than 5) then you could fit:
g1 <- glmer(survive~site.x+(site.x|fyear),
family=binomial(link=logexp(shaffer.sub$exposure)),
data=shaffer.sub)
I don't know what site vs site.x are: I only see site.x in your data snippet.
To get information, try summary(g1). (That will only give you variances for random effects, not for fixed effects; GLMMs don't operate in the same "variance explained" mode as ANOVA does, in particular because the variances explained by different terms usually do not add up to the total variance.)