Cannot fit multilevel ordinal logit model using clmm - r

I'm trying to fit a multilevel (random effects) ordered logit model using the ordinal package, but I keep running into this error:
Error in region:country1 : NA/NaN argument
Here's my simplified model. I'm regressing an indicator of happiness on a number of variables, including class, gender, age, etc. There are two nested levels: regions within countries.
library(ordinal)
# Set as factor
data$happiness <- as.factor(data$happiness)
# Remove NA
missing_country1 <- is.na(data$country1)
data <- data[!missing_country1, ]
missing_region <- is.na(data$region)
data <- data[!missing_region, ]
# Model
model1 <- clmm(happiness ~ age + gender + class + (1 | country1 / region),
data = data,
na.action = na.omit
)
I have removed all NA and NaN from both country1 and region.
Thanks,

Figured it out: it was because ordinal doesn't automatically convert the grouping variables to factor, so you need to do it manually.

Related

Loop mixed linear model longitudinal time data assessing groups effect on the continous y variable

EDITED:
I'm trying to assess the effect of variables (e.g. presence of severe trauma) on a continous variable (here energy expenditure (=REE) in calories) over time (Day). The dataframe is called my_data. Amongst the variables
Following I would like to display the results using the mixed linear model for each assessed variable in one large file.
General concept:
REE ~ Time*predictor + (1 + Time | Case identifier)
(1) Starting creating the lmer model:
library(tidyverse)
library(ggpmisc)
library(sjPlot)
library(lme4)
mixed.modelloop <- function(x) {
lmer(REE ~ Day*(x) + (1 + Day | Studynumber),
data=my_data,
REML=FALSE,
na.action=na.omit,
control = lmerControl(check.nobs.vs.nRE = "ignore"))
}
(2) Then creating the predictors (x)
cols <- c(colnames(my_data))
(3) And then generating the overall purrr function:
output <- purrr::map(cols, ~ mixed.modelloop(.x) %>% tab_model)
(4) generating the file which should include all separate univariate mixed model analyses:
pdf(file="mixed linear models.pdf" )
output
dev.off()
Unfortunately currently after step (3) I'm getting the following error message:
Error in model.frame.default(data = my_data, na.action = na.omit, drop.unused.levels = TRUE, :
variable lengths differ (found for 'x')
Any idea on how to adapt the function to resolve this issue?
Thanks!
Formulas have special rules, you can't insert a string into them and expect them to work.
This should work, although you haven't given a reproducible example to test with ...
mixed.modelloop <- function(x) {
form <- reformulate(c(sprintf("Day*%s", x), "(1 + Day | Studynumber)"),
response = "REE")
lmer(form,
data=my_data,
REML=FALSE,
na.action=na.omit,
control = lmerControl(check.nobs.vs.nRE = "ignore"))
}

Stratified Cox Model, Unusual Output For Stratified Factor Variable

I'm estimating a Cox PH model in R with time-varying stratification on a few of the variables. I'm using the following code to create the dataset and run the estimation
sepsis_working_fac_strat7 <- survSplit(Surv(LOS, facility) ~ ., data = sepsis_working, cut = 7, episode = "tgroup", id = "id")
cox_facility2 <- coxph(Surv(tstart, LOS, facility) ~ Age_10 + Gender + Insurance + LowInc + AbxBeforeCulture+ MorningDC+ AttendingAffilGrp+
CentralLine:strata(tgroup)+ Consults+ CountIVAbx:strata(tgroup)+ DischargeUnit+ FSSAdmit+ OralAbxBeforeDC+
OrderedToDischarge+ TimeToAbx+ TimeToBC+ UrineCulture+ Vasopressor+ Ventilator+
Behavioral_Dx + BradenGroup+ CCI+ Diabetes_Dx+ Dialysis+ PVD_Dx+ SUD_Dx, data = sepsis_working_fac_strat7, cluster = PersPersObjId)
In the output, there are four rows for the covariate "CentralLine" where there should only be two:
I have run the same estimation for other event types in the data and have not encountered this problem for CentralLine, which is a factor variable with two levels, where the base level is set to "No". Normally I see two rows for the strata, one for above and below the cut point. In the data, there are 184 events with a fairly even distribution between CentralLine == "Yes" and CentralLine == "No".
I'd like to know what is causing this output to appear as it does.
EDIT: This may be the issue: Why coxph() results some of the coefficient as NA when using survSplit() in R?

R function with separating data and finding linear regression

I want to calculate the impact that height has on earnings given the gender. I divided my data into data for male and female but when I run the lm(earnings~height+education+age, data = data_female) function it gives me an error saying: Error in model.frame.default(formula = earnings ~ height + education + :
variable lengths differ (found for 'education')
Would you be able to help in either suggesting a better way to refine my model or helping to fix this particular error? Please let me know.
setwd("~/Google Drive/R Data")
data <- read.csv('data_ass5.csv')
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
multiple_regression <- lm(earnings~height+age+gender+education,data = data)
lm(earnings~height+age+gender+education,data = data)
summary(multiple_regression)
summary(linear_regression)
multiple_regression_redefined <- lm(earnings~age+gender+education,data = data)
# Now I wish to particularly assess the impact of gender on earnings
# therefore trying to refine my model doing the following:
# but the lm last line is causing an error. Would you be able to adivse on
# if this is the correct way to refine it and/or why I am getting the error.
# I even tried putting na.rm=TRUE after the lm code, but error still.
data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)
lm(earnings~height+education+age, data = data_female)
Per docs of lm, the data argument handles variables in formula in two ways that are NOT mutually exclusive:
data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
Specifically, all your vector assignments are redundant and overlap with column names in the data frame except for gender and education:
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
multiple_regression <- lm(earnings~height+age+gender+education,data = data)
When above is run, all referenced names except for gender and education derive from the dataframe. But gender and education is pulled from the global environment for the vectors you assigned above. Had you used sex and educ, values would be pulled from the data frame like all the others.
Relatedly, your subset calls use the gender vector and not sex column. Fortunately, they are the exact same that no errors or undesired results occurred.
data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)
Therefore, when you subsetted your data, lm is pulling all values from the subsetted data and one value, education, from global environment. But remember education is based on the full data frame so maintains a larger length than the columns of subsetted data frame.
Altogether, simply avoid assigning the redundant vectors and use columns for full and subsetted data frames.
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
# REPLACE gender WITH sex AND education WITH educ (RENAME COLS IF NEEDED)
multiple_regression <- lm(earnings ~ height + age + sex + educ, data = data)
# REPLACE gender WITH sex
data_female <- subset(data, sex==0)
data_male <- subset(data, sex==1)
# REPLACE education WITH educ
lm(earnings ~ height + educ + age, data = data_female)

Adding a vector of dummy variables in logistic regression

I am currently trying to conduct logistic regression where one of the variables is a vector of 32 dummy variables. Each dummy represents a type of crime. For example:
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "NARCOTICS", 1,0)
Then the vector is created:
crime.type <- c(narcotics, theft, other.offense, burglary, motor.vehicle.theft, battery, robbery, assault, criminal.damage, deceptive.practice, kidnapping, etc.)
The logistic model is as follows:
logit.mod.train <- lm(street1 ~ BEAT+WARD+X.COORDINATE+Y.COORDINATE+LATITUDE+LONGITUDE+crime.type, data = train, family = "binomial")
It's important to note that street1 is actually a dummy variable for the location of the crime being on the street. So the column is LOCATION.DESCRIPTION and the element is street.
street1 <- ifelse(train$LOCATION.DESCRIPTION == "STREET", 1,0).
It yields this error:
Error in model.frame.default(formula = street1 ~ BEAT + WARD + X.COORDINATE + :
variable lengths differ (found for 'crime.type')
I thought this would work because they are derived from the same data set and the dummies represent each unique element of one of the columns. When I input each dummy variable separately it's successful but I want to condense the regression and make it more efficient.
Thank you in advance
If you intend for each type of crime to be its own predictor, you'll need to bind them to train, and then specify the variables in your lm formula. (Actually for logit it should be glm().)
For a more compact formula, subset train in the data= argument of glm() to include only your response variable and your intended design matrix. Then use street1 ~ . as your formula.
train <- cbind(train, narcotics, theft)I
model.vars <- c("narcotics", "theft", "street1")
logit.mod.train <- glm(street1 ~ ., data = train[,model.vars], family = "binomial")
More explanation:
Using ifelse as you've done produces a 1 or 0 for every element in train.
When you define crime.type as narcotics (which has the length of train) plus any additional elements, crime.type is longer than the number of rows in train.
Then you're asking lm() to process a lopsided design matrix, where one predictor (crime.type) has more elements in it than the other predictors. That's why you're getting the error.
Here's a replication of the issue:
N <- 100
train <- data.frame(PRIMARY.DESCRIPTION=sample(c("A","B"), replace = T, size = N),
response = rbinom(n=N, prob=0.7, size=1))
dim(train) # 100 2
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "A", 1, 0)
length(narcotics) # 100
theft <- ifelse(train$PRIMARY.DESCRIPTION == "B", 1, 0)
length(theft) # 100
crime.type <- c(desc.A, desc.B)
length(crime.type) # 200
logit.mod.train <- glm(response ~ PRIMARY.DESCRIPTION+crime.type, data = train, family = "binomial")
Error in model.frame.default(formula = response ~ PRIMARY.DESCRIPTION + :
variable lengths differ (found for 'crime.type')

Comparing nested models with NAs in R

I am trying to compare nested regression models using the anova() function in R, but am running into problems because the level 1 and level 2 models differ in the number of observations due to missing cases. Here is a simple example:
# Create dataframe with multiple predictors with different number of NAs
dep <- c(45,46,45,48,49)
basevar <- c(10,12,10,16,17)
pred1 <- c(NA,20,NA,19,21)
dat <- data.frame(dep,basevar,pred1)
# Define level 1 of the nested models
basemodel <- lm(dep ~ basevar, data = dat)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = dat)
# Compare the models (uh oh!)
anova(basemodel, model1)
I have seen 2 suggestions to similar problems, but both are problematic.
Suggestion 1: Impute the missing data. The problem with this is that the missing cases in my data were removed because they were outliers, and thus are not "missing at random," and imputing may overfit the data.
Suggestion 2: Make a separate data frame containing only the complete cases for the variable with missing cases, and use that for regressions. This is also problematic if you are creating multiple nested models sharing the same level 1 variable, but in which the level 2 variables differ in the number of missing cases. Here is an example of this:
# Create a new predictor variable with a different number of NAs from pred1
pred2 <- c(23,21,NA,10,11)
dat <- cbind(dat,pred2)
# Create dataframe containing only completed cases of pred1
nonadat1 <- subset(dat, subset = !is.na(pred1))
# Do the same for pred2
nonadat2 <- subset(dat, subset = !is.na(pred2))
# Define level 1 of the nested models within dataframe of pred1 complete cases
basemodel1 <- lm(dep ~ basevar, data = nonadat1)
# Check values of the model
summary(basemodel1)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = nonadat1)
# Compare the models (yay it runs!)
anova(basemodel1, model1)
# Define level 1 of the nested models within dataframe of pred2 complete cases
basemodel2 <- lm(dep ~ basevar, data = nonadat2)
# Values are different from those in basemodel1
summary(basemodel2)
# Add level 2
model2 <- lm(dep ~ basevar + pred2, data = nonadat2)
# Compare the models
anova(basemodel2, model2)
As you can see, creating individual data frames creates differences at level 1 of the nested models, which makes interpretation problematic.
Does anyone know how I can compare these nested models while circumventing these problems?
Could this work? See here for more information. It doesn't exactly deal with the fact that models are fitted on different datasets, but it does allow for a comparison.
A<-logLik(basemodel)
B<-logLik(model1)
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
(p.val <- pchisq(teststat, df = 1, lower.tail = FALSE))

Resources