I am running a GLMM using glmer() in R:
glmer(survive ~ fyear + site + fyear * site.x + (1|fyear),
family = binomial(link = logexp(shaffer.sub$exposure)),
data = shaffer.sub)
where survive is 0 or 1 depending if the nest was successful or not. Here you can see what the data looks like:
structure(list(id = structure(1:7, .Label = c("1", "2", "3",
"4", "5", "6", "7"), class = "factor"), year.x = structure(c(1L,
1L, 2L, 3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1999"), class = "factor"),
survive = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), fyear = structure(c(1L, 1L, 2L,
3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1999"), class = "factor"),
site.x = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 1L), .Label = c("N",
"S"), class = "factor")), .Names = c("id", "year.x", "survive",
"fyear", "site.x"), row.names = c(NA, -7L), class = "data.frame")
but I get this warning message:
*Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0299425 (tol = 0.001, component 12)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?*
I was told I should not use the same random factor as a fixed effect on the same model.
At the end, I would like to have an output where I can see year, site and the interaction year:site effects. Like in an ANOVA table (is this possible? I've been trying to use summary(aov(model)) but this doesn't work; anova(model) does not either.
I get this error for the aov() command:
*Error in summary`(aov(syearXsite))` :
error in evaluating the argument 'object' in selecting a method for function 'summary': Error in if (fixed.only) { : argument is not interpretable as logical*
How can I see the effect of this variables on survival?
Whoever told you not to use a categorical input variable (fyear) as both a fixed and a random effect was correct. It's hard to know exactly what to recommend, it depends on the number of years and sites you have in your data set (is the data you linked to all of your data (I hope not), or just the first few rows? How many years and how many sites and how many total observations do you have?)
If you want to treat year as random and site as fixed (which would be sensible if you have only two sites (N vs S as seen in your data) and quite a few years, e.g. more than 5) then you could fit:
g1 <- glmer(survive~site.x+(site.x|fyear),
family=binomial(link=logexp(shaffer.sub$exposure)),
data=shaffer.sub)
I don't know what site vs site.x are: I only see site.x in your data snippet.
To get information, try summary(g1). (That will only give you variances for random effects, not for fixed effects; GLMMs don't operate in the same "variance explained" mode as ANOVA does, in particular because the variances explained by different terms usually do not add up to the total variance.)
Related
I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")
I am using the semTools package to carry out EFA using categorical data. The efaUnrotate() function requires variables as ordered factors.
I am trying to convert all of my already factor variables into an ordered one using a simple code, which does not seem to work unfortunately. I wonder if anyone had an explanation for this?
My data:
test <- structure(list(fp_weightloss = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), fp_gripstrength = structure(c(1L,
2L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
fp_walktime = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), fp_metmins = structure(c(2L, 1L,
1L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
My code:
test_ord <- as.data.frame(sapply(test, as.ordered))
sapply(test_ord, class)
Results in no change:
fp_weightloss fp_gripstrength fp_walktime fp_metmins
"factor" "factor" "factor" "factor"
When I would expect:
class(as.ordered(test$fp_weightloss))
[1] "ordered" "factor"
The problem is sapply: best avoid it entirely, since its implicit conversions often invisibly mess with data, and they do here. Use lapply instead:
test_ord <- as.data.frame(lapply(test, as.ordered))
In general I prefer using vapply since it handles non-list return values, but getting vapply to work with S3 classes doesn’t seem possible.
I have the following long dataset (R dataframe called "dat"). It has around 10,000 observations of 1200 children, with a maximum of 10 observations per child. These observations are weight of the child collected at different time points in early childhood, starting from birth until age 5 years. Age in the dataset is given in days (variable = agedays) and weight in kilograms in variable wtkg.
Dataset
structure(list(subjid = c(1001L, 1001L, 1001L, 1001L), sex = structure(c(2L, 2L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"),
agedays = c(0L, 2L, 30L, 107L), wtkg = c(3.78, 3.64, 4.71,
6.5), Uobservations = c(10L, 10L, 10L, 10L), BMI_group = structure(c(2L,
2L, 2L, 2L), .Label = c("normal", "obese", "overweight",
"true_NA", "underweight"), class = "factor"), GWG_T3_cat2 = structure(c(2L,
2L, 2L, 2L), .Label = c("T3notexcessive", "T3excessive"), class = "factor")), .Names = c("subjid", "sex", "agedays", "wtkg", "Uobservations", "BMI_group", "GWG_T3_cat2"), row.names = c(NA, 4L), class = "data.frame")
I want to use the SITAR package to study the velocity, tempo and size of the children according to excessive gestational weight gain in trimester 3 (versus not excessive weight gain in trimester 3 ) (variable = GWG_T3_cat2) among obese mothers (variable = BMI_group).
I tried running a model as seen in this link: https://www.rdocumentation.org/packages/sitar/versions/1.0.9
m1 <- sitar(x=agedays, y=wtkg, id=subjid, data=dat, df=2)
But I get an error:
Error in nlme.formula(y ~ fitnlme(x, s1, s2, a, b, c), fixed = s1 + s2 +:
Singularity in backsolve at level 0, block 1
I would really appreciate if any one could help solve this issue.
I am wishing to run a linear mixed model on a dependent variable DV that is collected under two different Condition at three different Timepoint. The data is structured as follows:
## dput(head(RawData,5))
structure(list(Participant = structure(c(2L, 2L, 2L, 2L, 4L),
.Label = c("Jessie", "James", "Gus", "Hudson", "Flossy",
"Bobby", "Thomas", "Alfie", "Charles", "Will", "Mat", "Paul", "Tim",
"John", "Toby", "Blair"), class = "factor"),
xVarCondition = c(1, 1, 0, 0, 1),
Measure = structure(c(1L, 2L, 3L, 4L, 1L),
.Label = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12"), class = "factor"),
Sample = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("1", "2"), class = "factor"),
Condition = structure(c(2L, 2L, 1L, 1L, 2L),
.Label = c("AM", "PM"), class = "factor"),
Timepoint = structure(c(2L, 2L, 2L, 2L, 1L),
.Label = c("Baseline", "Mid", "Post"), class = "factor"),
DV = c(83.6381348645853, 86.9813802115179, 69.2691666620429,
71.3949807856125, 87.8931998204771)),
.Names = c("Participant", "xVarCondition", "Measure",
"Sample", "Condition", "Timepoint", "DV"),
row.names = c(NA, 5L), class = "data.frame")
Each Participant performs two trials per Condition across three Timepoints as depicted by Measure; however, there are missing data so not necessarily 12 levels per participant. The column xVarCondition is simply a dummy variable that includes a 1 for each entry of AM in Condition. The column Sample refers to the 2 trials for each Condition at each Timepoint.
I am an R user but the statistician is a SAS user who believes the code for the model should be:
proc mixed data=RawData covtest cl alpha=α
class Participant Condition Timepoint Measure Sample;
model &dep=Condition Timepoint/s ddfm=sat outp=pred residual noint;
random int xVarCondition xVarCondition*TimePoint*Sample
TimePoint/subject=Participant s;
The above SAS code gives sensible answers and is working perfectly. We believe the resulting lme4 syntax for the above model to be:
TestModel = lmer(DV ~ Condition + Timepoint +
(1 | Participant/Timepoint) +
(0 + xVarCondition | Participant) +
(1 | Participant:xVarCondition:Measure), data = RawData)
However, I get the following error when running this model:
Error: number of levels of each grouping factor must be < number of observations
Are the random effects specified correctly?
I can't quite tell from your description, but most likely your Participant:xVarCondition:Measure term constructs a grouping variable that has no more than one more observation in each level of classification, which will make the (1|Participant:xVarCondition:Measure) term redundant with the residual error term which is always included in an lmer model. You can override the error if you really want to by including
control=lmerControl(check.nobs.vs.nlev = "ignore")
in your function call, but (if I've diagnosed the problem correctly) this will lead to the residual variance and the Participant:xVarCondition:Measure variance being jointly unidentifiable. Such unidentifiability usually doesn't cause any problems with the rest of the model, but I am more comfortable with an identifiable model (there's always the possibility that such unidentifiability will lead to numerical issues).
There's a similar example here.
You can check my conjecture as follows:
ifac <- with(RawData,
interaction(Participant,xVarCondition,Measure,drop=TRUE))
length(levels(ifac)) == nrow(RawData)
I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.