Designating numeric variables as factors within lme4 - r

Is there any way to specify contrast matrices within an lmer if the variable in question is not a factor outside the lmer? As an example, toy data of a 2 x 4 mixed model with group, a between-subjects factor, and time a within-subjects factor. I am testing the between-group differences in score at level of time, however this is more of a technical question concerning whether it is possible to specify numerical variables as factors within an lmer (i.e. while keeping them as numeric outside the lmer).
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- rep(0:3, each = 8, length = 32)
group <- rep(c("A","B"), times =2, each = 4, length = 32)
df <- data.frame(id = id, group = group, time = time, score = score)
Note that the time factor, which is a numeric variable, has been included in the model housed within a factor() command.
summary(lmer(score ~ group*factor(time) + (1|id), data = df))
This works fine on its own. However when I specify contrasts using the same factor(time) syntax:
summary(lmer(score ~ group*factor(time) + (1|id), data = df, contrasts = list(factor(time) = contr.helmert(4), group = contr.helmert(2))))
I get the error:
Error: unexpected '=' in "summary(lmer(score ~ group*factor(time) + (1|id), data = df, contrasts = list(factor(time) ="
...which I assume is down to my specifying the contrast matrices for the time variable within a factor() command.
And if I do not call time a factor in the contrasts = call...
summary(lmer(score ~ group*factor(time) + (1|id), data = df, contrasts = list(time = contr.helmert(4), group = contr.helmert(2))))
it returns a result but ignores that factor, with the warning messages....
1: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
2: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
3: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
4: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
5: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
6: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
7: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
8: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
9: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
10: In model.matrix.default(fixedform, fr, contrasts) :
variable 'time' is absent, its contrast will be ignored
I am aware that I am violating the syntax of lme4 here but I am trying to demonstrate what I would like to achieve, while acknowledging that I am ignorant of how to do so or if what I would like to do is even possible.

Related

Estimating Dynamic Difference in Difference in R

I've been trying to estimate the above regression using a multiple cross section dataset, and I tried using the did library without success. I have a large dataset and I already formatted the data such that I have a event time dummy, but it gives an error. Treatment is in 2018 and outcome is emp, and base period should be 2017.
I tried:
df4<-df1[complete.cases(df1$treat),]
df4<-df4[complete.cases(df4$emp),]
df4<-df4[(df4$year>=2014),]
df4$g<-ifelse(df4$treat==1,2018,0)
att1 <- att_gt(yname = "emp",
tname = "period",
gname = "G",
xformla = ~treat+factor(month)+factor(year),
data = df4,
panel=FALSE
)
and it gives me
'
Error in DRDID::drdid_rc(y = Y, post = post, D = G, covariates = covariates, :
Outcome regression model coefficients have NA components.
Multicollinearity (or lack of variation) of covariates is a likely reason.
In addition: Warning messages:
1: glm.fit: algorithm did not converge
2: In DRDID::drdid_rc(y = Y, post = post, D = G, covariates = covariates, :
glm algorithm did not converge
'
I also did a regression using lm only but it implied insignificant results, which should not be the case for my assignment
`
ols1 <-lm(emp ~ relevel(factor(year),ref="2017")*treat+factor(month),
data=df4)
summary(ols1)
`

How can I compare 3 binary variables in R?

I'm looking at debris ingestion in gulls. Each gull is listed by row. Columns contain the sex(0=male, 1=female), if they ate debris (0=no, 1=yes) and if I found any number of other items in their stomach, for this problem I'd like to see if sex and presence of debris influences the number of birds with Shells in their stomach (0=no shells, 1=shells). Debris prevalence is likely overdispersed and zero-inflated, but I'm not sure that matters if I'm using it as a factor to evaluate shell prevalence. Shell prevalence might be overdispersed and zero inflated as well.
I've plotted the data and want to test whether the differences seen in the plot are significant.
But when trying to run a zero-inflated negative binomial model I get many diff errors depending on how I set it up.
library (aod)
library(MASS)
library (ggplot2)
library(gridExtra)
library(pscl)
library(boot)
library(reshape2)
mydata1 <- read.csv('D:/mp paper/analysis wkshts/stats files/FOdata.csv')
mydata1 <- within(mydata1, {
debris <- factor(debris)
sex <- factor(sex)
Shell_frags <- factor(Shell_frags)
})
summary(mydata1)
ggplot(mydata1, aes(Shell_frags, fill=debris)) +
stat_count() +
facet_grid(debris ~ sex, margins=TRUE, scales="free_y")
m1 <- zeroinfl((Shell_frags ~ sex + debris), data = mydata1, dist = "negbin", EM = TRUE)
summary(m1)
Error message:
Error in if (all(Y > 0)) stop("invalid dependent variable, minimum count is not zero") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(Y, 0) : ‘>’ not meaningful for factors
> summary(m1)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': object 'm1'
not found

R code: Error in model.matrix.default(mt, mf, contrasts) : Variable 1 has no levels

I am trying to build a logistic regression model with a response as diagnosis ( 2 Factor variable: B, M).
I am getting an Error on building a logistic regression model:
Error in model.matrix.default(mt, mf, contrasts) :
variable 1 has no levels
I am not able to figure out how to solve this issue.
R Code:
Cancer <- read.csv("Breast_Cancer.csv")
## Logistic Regression Model
lm.fit <- glm(diagnosis~.-id-X, data = Cancer, family = binomial)
summary(lm.fit)
Dataset Reference: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Your problem is similar to the one reported here on the randomForest classifier.
Apparently glm checks through the variables in your data and throws an error because X contains only NA values.
You can fix that error by
either by dropping X completely from your dataset, setting Cancer$X <- NULL before handing it to glm and leaving X out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial));
or by adding na.action = na.pass to the glm call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass))
However, please note that still, you'd have to make sure to provide the diagnosis variable in a form digestible by glm. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector
"For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the glm-doc
Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis).
On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)

Running a linear model in R with spreadsheet data

I have a dataset consisting of 106 individuals of two types - a and b with various variables, for example age and gender. I want to run a linear model which predicts whether each individual is of type a or type b based on the co-variates.
I read in the values for age, gender and the type label for each individual using:
`data = read.xlsx("spreadsheet.xlsx",2, as.is = TRUE)`
age = data$age
gender = data$gender
type = data$type
where each is of the form:
age = [28, 30, 19, 23 etc]
gender = [male, male, female, male etc]
type = [a b b b]
Then I try to set up the model using:
model1 = lm(type ~ age + gender)
but I get this error message:
Warning messages:
1: In model.response(mf, "numeric") :
using type="numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors
I've tried changing the format of type, age and gender using:
age = as.numeric(as.character(age))
gender = as.character(gender)
type = as.character(type)
But this doesn't work!
You can't use a linear regression model with a factor as your response variable, which is what you are attempting to do here (type is your response variable). Regression models require numeric response variables. You should instead look at classification models.
As Roland points out, you may wish to start by restating your "type" variable as a logical, binomial variable. Rather than a factor called "type" with two levels "a" and "b", you might create a new variable called "is.type.a", which would contain TRUE or FALSE.
You could then try a logistic regression based on a binomial distribution
model <- glm(is.type.a ~ age + gender,data=data,family="binomial")

Day-ahead using GLM model in R

I have the following code to get a day-ahead prediction for load consumption in 15 minute interval using outside air temperature and TOD(96 categorical variable, time of the day). When I run the code below, I get the following errors.
i = 97:192
formula = as.formula(load[i] ~ load[i-96] + oat[i])
model = glm(formula, data = train.set, family=Gamma(link=vlog()))
I get the following error after the last line using glm(),
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
And the following error shows up after the last line using predict(),
Warning messages:
1: In if (!se.fit) { :
the condition has length > 1 and only the first element will be used
2: 'newdata' had 96 rows but variable(s) found have 1 rows
3: In predict.lm(object, newdata, se.fit, scale = residual.scale, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
4: In if (se.fit) list(fit = predictor, se.fit = se, df = df, residual.scale = sqrt(res.var)) else predictor :
the condition has length > 1 and only the first element will be used
You're doing things in a rather roundabout fashion, and one that doesn't translate well to making out-of-sample predictions. If you want to model on a subset of rows, then either subset the data argument directly, or use the subset argument.
train.set$load_lag <- c(rep(NA, 96), train.set$load[1:96])
mod <- glm(load ~ load_lag*TOD, data=train.set[97:192, ], ...)
You also need to rethink exactly what you're doing with TOD. If it has 96 levels, then you're fitting (at least) 96 degrees of freedom on 96 observations which won't give you a sensible outcome.

Resources