Passing strings into 'contrasts' argument of lme/lmer - r

I am writing a sub-routine to return output of longitudinal mixed-effects models. I want to be able to pass elements from lists of variables into lme/lmer as the outcome and predictor variables. I would also like to be able to specify contrasts within these mixed-effects models, however I am having trouble with getting the contrasts() argument to recognise the strings as the variable names referred to in the model specification within the same lme/lme4 call.
Here's some toy data,
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now the following call to lme works just fine, with contrasts specified (I know these are the default so this is all purely pedagogical).
mod <- lme(score ~ group*time, random = ~1|id, data = df, contrasts = list(group = contr.treatment(2), time = contr.treatment(4)))
The following also works, passing strings as variable names into lme using the reformulate() function.
t <- "time"
g <- "group"
dv <- "score"
mod1R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df)
But if I want to specify contrasts, like in the first example, it doesn't work
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df, contrasts = list(g = contr.treatment(2), t = contr.treatment(4)))
# Error in `contrasts<-`(`*tmp*`, value = contrasts[[i]]) : contrasts apply only to factors
How do I get lme to recognise that the strings specified to in the contrasts argument refer to the variables passed into the reformulate() function?

You should be able to use setNames() on the list of contrasts to apply the full names to the list:
# Using a %>% pipe so need to load magrittr
library(magrittr)
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"),
random = ~1|id,
data = df,
contrasts = list(g = contr.treatment(2), t = contr.treatment(4)) %>%
setNames(c(g, t))
)

Related

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

How to creat a new data set based on rows of one variable from an existing dataset while each row has multiple observations

I have a dataset with the following structure:
Variable "Class" = 1,..,50
each class has multiple observations: from 2000 (#obs in class1) to 200(#obs in class 50)
variables Age, Sex, HIV for each individual in each class
What I have to do is to create data from this original dataset in a way that each row shows the variable "Class" (50 rows on the other hand instead of something around 10000 rows that I have for the original dataset) and with the variables you see.
Im new to R, so Im not sure how I can squeeze(?!) the data in a way that for example row 1 shows class 1 but with the information of Age and Sex and HIV for 2000 individuals!
I need this new dataset because I am writing a function (a glm) and the source of data for that function should not be the original data, it should be based on classes!
But the predictions of this glm will be on the individual level! (on the original data)
Can anyone kidnly give me a hand or hint on this?
Here is a mini-scale of data looks like:
library(simstudy)
Class <- defData(varname = "Class", dist = "categorical", formula = "0.8;0.2", id="Class1")
Class <- defData(Class, varname = "Classic", dist = "categorical", formula = "0.8;0.2")
Class <- defData(Class, varname = "clustersize",dist = "normal", formula = "5", variance = 0)
d1 <- genData(1, Class) #'
d1
dF1 <- genCluster(d1, cLevelVar = "Class", numIndsVar = "clustersize", level1ID = "Class1")
dF1
Class2<- defData(varname = "Class", dist = "categorical", formula = "0.3;0.2;0.1;0.3;0.1", id="Class1")
Class2 <- defData(Class2, varname = "Classic", dist = "categorical", formula = "0.3;0.2;0.1;0.3;0.1")
Class2 <- defData(Class2, varname = "clustersize",dist = "noZeroPoisson", formula = "3")
d2 <- genData(3, Class2) #'
d2
dF2 <- genCluster(d2, cLevelVar = "Class", numIndsVar = "clustersize", level1ID = "Class1")
dF2
d<-rbind(dF1,dF2)
v <- defDataAdd( varname = "Age", dist = "normal", formula = "20", variance = 10)
v <- defDataAdd(v, varname = "Sex", dist = "binary", formula = "0.4", link = "logit")
v <- defDataAdd(v, varname = "HIV", dist = "binary", formula = "0.7", link = "logit")
d <- addColumns(v, d)
Y<- defDataAdd( varname = "Y", dist = "binary", formula = "0.1*Age+0.2*Sex+0.5*HIV", link = "logit")
d <- addColumns(Y, d)
d
Let's put it this way. "d" is the original dataset I have, with 16 rows( individuals) according to the code I gave. Now I want to model Y by Age, Sex, HIV but the data that this model should be using, is not "d", it has to be a new data set extracting from "d" in a way that I end up with 3 rows (because I have 3 classes). So my confusion is how can I do that (create a new dataset from d) when I have 11 individuals in class 1, 2 individuals in class 2, 3 individuals in class 3. So I will run the model in this new data set, and will predict it in the original dataset "d"
Thanks for updating the question. However, I can't reproduce your example. The code gives an error. In case you would like to estimate a GLM, you can first create factors, and then fit the GLM. It is not clear to me what you mean by classes.
Let's say you have the following data mtcars, and would like to model cyl by vs and gear. Then you can first create factors for vs and gear, and then use the new data in a glm.
library(dplyr)
# Change vs and gear to factors
mtcars1 <- mtcars %>%
mutate(across(c(vs,gear), as.factor))
Compare the following two:
glm(cyl ~ vs + gear, data = mtcars1)
glm(cyl ~ vs + gear, data = mtcars)
The first one uses factors and the second one numerical values.

R Catboost to handle categorical variables

I have a question about Catboost. Whether do I preprocess the categorical before modeling?
If I have 86 variables including 1 target variable. In these 85 variables, there are 2 numeric variables and 83 categorical variables (Factor type). The target variable is binary factor, 1 or 0.
Column 1, and Column 4 to Column 85 are factors type.
Column 2 and 3 are numeric.
I am a little confused with cat_features in catboost.train(). In the parameters, I can set a vector of categorical features. Also, I can set in the catboost.load_pool.
library(Catboost)
library(dplyr)
X_train <- train %>% select(-Target)
y_train <- (as.numeric(unlist(train[c('Target')])) - 1)
X_valid <- test %>% select(-Target)
y_valid <- (as.numeric(unlist(test[c('Target')])) - 1)
train_pool <- catboost.load_pool(data = X_train, label = y_train, cat_features = c(0,3:84))
test_pool <- catboost.load_pool(data = X_valid, label = y_valid, cat_features = c(0,3:84))
params <- list(iterations=500,
learning_rate=0.01,
depth=10,
loss_function='RMSE',
eval_metric='RMSE',
random_seed = 1,
od_type='Iter',
metric_period = 50,
od_wait=20,
use_best_model=TRUE,
cat_features = c(0,3:84))
catboost.train(train_pool, test_pool, params = params)
However, after I ran the code above, I got an error:
Error in catboost.train(train_pool, test_pool, params = params) :
catboost/libs/options/plain_options_helper.cpp:339: Unknown option {cat_features} with value "[0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]"
Any help?
Look at this example cat_features should not go in param <- list() only in catboost.load_pool()
library(catboost)
countries = c('RUS','USA','SUI')
years = c(1900,1896,1896)
phone_codes = c(7,1,41)
domains = c('ru','us','ch')
dataset = data.frame(countries, years, phone_codes, domains, stringsAsFactors = T)
glimpse(dataset)
label_values = c(0,1,1)
fit_params <- list(iterations = 100,
loss_function = 'Logloss',
ignored_features = c(4,9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5)
pool = catboost.load_pool(dataset, label = label_values, cat_features = c(0,3))
model <- catboost.train(pool, params = fit_params)
model
I haven't tried CatBoost in R, but see the example on this page:
https://catboost.ai/docs/concepts/r-reference_catboost-train.html
It appears you only pass the categorical variables in the load_pool() call, and NOT in the train() call.
(This works differently from the Python API, where cat_features is passed in the Python fit() call.)
A suggestion: group all the categorical variables in the left most column. That way you have a simpler vector creation. I also have a check in my code to make sure I did it right...

Linear Model error or data type error?

The program generates a matrix with location and treatment. I generates some data from a normal distribution. It then tries to fit a linear model predicting yield based on treatment and location. The linear model does not work. Why?
trts = paste(0:6)
locs = paste(1:6)
reps = paste(1:4)
plotsize = 4
DF = expand.grid(locs, reps, trts, plotsize, stringsAsFactors = TRUE)
colnames(DF) = c("Location","Replicate","Treatment")
vector = rnorm(1000000, mean=138.2, sd=54.89)
DF$Treatment = as.numeric(DF$Treatment)
DF$Location = as.numeric(DF$Location)
#This approach takes one set of "plotsize" values from "vector" and adds for 5 for each treatment.
DF$Yield = apply(DF, 1, function(x) (5*DF$Treatment)+mean(sample vector,plotsize)))
DF<-t(DF)
Yield<-DF$Yield
trt=as.factor(DF$Treatment)
loc=as.factor(DF$Location)
summary(fm1 <- aov(Yield ~ loc*trt))
result1<-TukeyHSD(fm1, "trtm", ordered = TRUE)

Difference between "xvar=x1" and "xvar = ~x1" in wp() from gamlss package in R

I'm using the gamlss package in R to implement wormplots for the residuals study.
The function wp() has an argument xvar which is used for bucketing.
Assume I have a "numeric" vector x1 which if passed as "xvar = x1" behaves differently than "xvar = ~x1". Basically the second case is treated as a formula. The buckets created for both cases will be different from each other.
Code :-
library(gamlss)
glc<-gamlss.control(n.cyc = 200)
myseed <- 12345
set.seed(myseed) #this will make results reproducible
# generate data
N<-10000 # this is the sample size
dd<-data.frame(x1=rpois(N,1)
,x2=rnorm(N,.7,.3)
,x3=log(rgamma(N,shape=6,scale=10))
,x4=sample(letters[1:3], N, replace = T)
,x5=sample(letters[3:6], N, replace = T)
,ind = rbinom(N,size=1,prob=0.5)
)
#Generate distributions
dd$y_wei1<-rweibull(N,scale=exp(.3*dd$x1+.8*dd$x3),shape=5)
m1 <- gamlss(formula = y_wei1 ~ x1 + x3 + x4 + x5,
data = dd ,
family = "WEI" ,
K = 2,
control = glc
)
# Case 1.
wp(object = m1, xvar = x1, n.iter = 4)
# Case 2.
wp(object = m1, xvar = ~x1, n.iter = 4)
Edit :
I do observed that this happens only when the overlap argument is set to 0. Because when overlap=0 then internally another function( check.overlap) is called. Why is this function called?
the function has been written such that xvar = ~x1 indicated x1 is a factor/char variable and so grouping occurs based on its unique values. When user calls with xvar = x1 then bins are created based on the range and that is used to generate the wormplots.
The difference is because internally there is a check.overlap fucntion written which is impemented only if x1 is numeric. Incase of overlapping, it clips it to have non-overlapping intervals. This is missing if user calls it as xvar = ~x1.

Resources