Lmer for longitudinal design - r

I have a longitudinal dataset where I have the following variables for each subject:
IV: 3 factors (factorA, factorB, factorC, factorD), each measured twice, at the beginning and at the end of an intervention.
DV: one outcome variable (behavior), also measure twice, at the beginning and at the end of the intervention.
I would like to create a model that uses the change in factorA, factorB, factorC, factorD (change from beginning to end of the intervention) to predict the change in behavior (again from beginning to end).
I thought to use the delta values of factorA, factorB, factorC, factorD (from pre to post intervention) and use these delta values to predict the delta values of D1. I would also like to covary-out the absolute values of each factor (A, B, C and D) (e.g. using only the value at the beginning of the intervention for each factor) to make sure I account for the change that the absolute values (rather than the change) of these IVs may have on the DV.
Here is my dataset:
enter image description here
Here is my model so far:
Model <- lmer(Delta_behavior ~ Absolute_factorA + Absolute_factorB +
Absolute_factorC + Absolute_factorD + Delta_factorA +
Delta_factorB + Delta_factorC + Delta_factorD +
(1|Subject),a)
I think I am doing something wrong because I get this error:
Error: number of levels of each grouping factor must be < number of observations
What am I doing wrong? Is the data set structured weirdly? Should I not use the delta values? Should I use another test (not lmer)?

Because you have reduced your data to a single observation per subject, you don't need to use a multi-level/mixed model. The reason that lmer is giving you an error is that in this situation the between-subject variance is confounded with the residual variance.
You can probably go ahead and use a linear model (lm) for this analysis.
More technical detail
The equation for the distribution of the ith observation is something like [fixed-effect predictors] + eps(subject(i)) + eps(i) where eps(subject(i)) is the Normal error term of the subject associated with the ith observation, and eps(i) is the Normal residual error associated with the ith observation. If we only have one observation per subject, then each observation has two error terms that are unique to it. The sum of two Normal variables with zero means and variances of V1 and V2 is also Normal with mean zero and variance V1+V2 ... therefore V1 and V2 are jointly unidentifiable. You can use lmerControl to override the error if you really want to; lmer will return some arbitrary combination of V1, V2 estimates that sum to the total variance.
There's a similar example illustrated here.

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

How does the stratum function work in the clusrank package in R?

I'm working with the clusrank package in R to analyse insect abundance data, by using the clusWilcox.test function for clustered data. As far as I understand, this package allows you to add both a 'cluster' and a 'stratum' function when using the rgl method to cluster by multiple factors.
When adding a single factor as either only a cluster or only a stratum function to my code, the Z- and p-value is the same for both codes, which seems to indicate that the stratum function works. However, when I take the first factor as a cluster, and add a second, different one as stratum, the output is still identical to the cluster-only model. This makes me think only the cluster is taken into account, and the stratum function is ignored.
This problem should be reproducible by making a random test dataset (in this example called df) with four columns: the dependent variable (in my case 'abundance'), the grouping factor of which I want to know the effect (in my case 'treatment'), and two factors to add as cluster/stratum, let's call them 'factorA' and 'factorB'. In my own testdataset the factors have 2 levels each, in my real dataset 6 levels each, and the problem arises in both datasets.
My code is then as follows:
clusWilcox.test(abundance ~ treatment + cluster(factorA), data = df, method = "rgl")
Which gives the same Z- and p-value as adding factorA as stratum, with as only difference that number of clusters is now the number of rows in the testdataset, instead of the number of factor levels.
clusWilcox.test(abundance ~ treatment + stratum(factorA), data = df, method = "rgl")
And both exactly the same Z- and p-values as:
clusWilcox.test(abundance ~ treatment + cluster(factorA) + stratum(factorB), data = df, method = "rgl")
Which makes me think that the stratum function is ignored in this third line of code. If you switch factorA and factorB, the same problem arises, though with different output values, as the calculation is now based on factorB instead of factorA.
Does anyone know what happens here? Is my code wrong, or is the stratum function indeed not taken into account?

Generalized linear model vs Generalized additive model

I'm trying to follow this paper: Using a data science approach to predict cocaine use frequency from depressive symptoms where they use glm, gam with the beck inventory depression. So I did found a similiar dataset to test those models. However I'm having a hard time with both models. For example I have two variables d64a and d64b, and they're coded with 1,2,3,4 meaning that they're ordinal. Also, in the paper y2 is only the value of 1 but i have also a variable extra (that can be dependent, the proportion of consume)
For the GAM model I have:
b<-gam(y2~s(d64a)+s(d64b),data=DATOS2)
but I have the following error:
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
A term has fewer unique covariate combinations than specified maximum degrees of freedom
Meanwhile for the glm, I have the following:
d<-glm(y2~d64a+d64b,data=DATOS2)
I don't know since d64a and d64b are ordinal I have to use factor()?
The error message tells you that one or both of d64a and d64b do not have 9 (nine) unique values.
By default s(...) will create a basis with nine functions. You get this error if there are fewer than nine unique values in the covariate.
Check which covariates are affected using:
length(unique(d64a))
length(unique(d64b))
and see what the number of unique values is for each of the covariates you wish to include. Then set the k argument to the number returned above if it is less than nine. FOr example, assume that the above checks returned 5 and 7 unique covariates, then you would indicate this by setting k as follows:
b <- gam(y2 ~ s(d64a, k = 5) + s(d64b, k = 7), data = DATOS2)

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Explanation of the formula object used in the coxph function in R

I am a complete novice when it comes to survival analysis. I am working on a project that requires I use the coxph function in the "survival" package, but I am running into trouble because I do not understand what is required by the formula object.
Most descriptions I can find about the function are as follows:
"a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the Surv function. "
I know what needs to be on the left of the operator, the issue is what the function expects from the right-hand side.
Here is a link of what my data looks like (The actual data set is much larger, I'm only displaying the first 20 data points for brevity):
Short explanation of data:
-Row 1 is the header
-Each row after that is a separate patient
-The first column is the age of the patient at the time of the study
-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values.
-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.
-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false).
Given this data I need to plot a Cox Proportional hazard curve, but I end up with an incorrect plot because the right hand side of the formula object is wrong.
Here is my code, "temp4" is the name I gave to the data table:
library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))
I should also note that I have tried replacing the right hand side that you're seeing with the number 1, a period, leaving it blank. These methods produce a kaplan-meier curve.
The following is the console output:
Each new line is an example of the error produced depending on how I filter the data. (ie if I only include patients with ages greater than 85, etc.)
If someone could explain how it works, it would be greatly appreciated.
PS- I have searched for over a week to my solution, and I am asking for help here as a last resort.
You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.
seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)
The above would use all of the x-variables in your temp data.frame. This will use just the first 3:
seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)
Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.
Bottom line: This is not a problem with your formula construction, so much as it is a problem of not understanding the limitations of the chosen method with the dataset you have assembled. You need to be more careful about defining your goals. What is the highest priority in this research? Do you really need every variable? Is it possible to aggregate some of these anonymous variables into clinically meaningful categories such as diagnostic categories or comorbities?

Resources