I am attempting to fit a model with a large number of predictors, such that it would be tedious to enumerate them in a model formula. This is straightforward to do with lm():
indicatorMatrix <- data.frame(matrix(rbinom(26000, 1, 1/3), ncol = 26))
colnames(indicatorMatrix) <- LETTERS
someDV <- rnorm(nrow(indicatorMatrix))
head(indicatorMatrix)
# One method, enumerating variables by name:
olsModel1 <- lm(someDV ~ A + B + C + D, # ...etc.
data = indicatorMatrix)
# Preferred method, including the matrix of predictors:
olsModel2 <- lm(someDV ~ as.matrix(indicatorMatrix))
summary(olsModel2)
Since I have a very large number of predictors (more than the 26 in this invented example), I don't want to list them individually as in the first example (someDV ~ A + B + C + D...), and I can avoid this by just including the predictors as.matrix.
However, I want to fit a mixed effects model, like this:
library(lme4)
meModel1 <- lmer(someDV ~ (1 | A) + (1 | B) + (1 | C), # ...etc.
data = indicatorMatrix)
summary(meModel1)
Except that I want to include a large number of random effects terms. Rather than having to type (1 | A) ... (1 | ZZZ), I would like to include each predictor in a manner analogous to the matrix approach used for olsModel2 above. The following, obviously, does not work:
meModel2 <- lmer(someDV ~ (1 | as.matrix(indicatorMatrix)))
Do you have any suggestions for how I can best replicate the matrix-predictor approach for random effects with lmer()? I am very willing to consider "pragmatic" solutions (i.e. hacks), so long as they are "programmatic," and don't require me to copy & paste, etc. etc.
Thanks in advance for your time.
I think that constructing the formula as a string and then using as.formula, something along the lines of
restring1 <- paste0("(1 | ",colnames(indicatorMatrix),")",collapse="+")
form <- as.formula(paste0("someDV ~",restring1))
meModel1 <- lmer(form, data = data.frame(someDV,indicatorMatrix))
should work (it runs without complaining on my system, anyway ...)
Related
I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same sizeā¦ from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???
I have a set of a couple of dozen numeric variables and am trying to figure out how to compactly express a quadratic form in those variables. I also want to include the variables themselves. The idea here is that we are fitting a response surface, rather than interacting a group of treatments, as the standard R formula notation seems to assume. I am trying to get appropriate expressions turned into an R formula, suitable for estimation by different techniques, with different data sets, or over different periods.
If there is an explicit statement of how R's formula notation works, anywhere, I have not been able to find it. There is an ancient paper from which R supposedly copied the notation, but it is by no means identical to current R usage. Every other description I have found just gives examples, that do not cover every case -- not even close to every case.
So, just as an example, here I try to construct a quadratic form in three variables, without writing out all the pairs by hand with an I() around each pair.
library(tidyverse)
A <- B <- C <- 1:10
LHS <- 1:10 * 600
tb <- tibble(LHS, A, B, C)
my_eq <- as.formula(LHS ~ I(A + B + C)*I(A + B + C))
I have not found any way to tell if I have succeded
Neither
my_form_eq nor
terms(my_form_eq)
seem at all enlightening.
For example, can one predict whether
identical(as.formula(LHS ~ I(A + B + C)*I(A + B + C)), as.formula(LHS ~ I((A + B + C)*(A + B + C)))
is true or false? I can not even guess. Or to take an even simpler case, is ~ A * I(A) equal to A, I(A^2), or something else? And how would you know?
To restate my question, I would like either a full statement of how R's formula notation works, adequate to cover every case and predict what each would mean, or, failing that, a straightforward way of producing an expansion of any existing formula into all the atomic terms for which coefficients will be estimated.
This may not answer your question, but I'll post this anyway since I think it may help a little.
The I function inhibits the interpretation of operators such as "+", so your formula is probably not going to do what you expect it to do. For example, the results of lm(my_eq) will be the same as the results of doing the following:
D <- A + B + C
lm(LHS ~ D * D)
And then you may as well just do lm(LHS ~ D).
For your question, I believe John Maindonald wrote a good book that explains R formulas for many situations. But it's in my office and today is a Sunday.
Edit: For the expansion, I believe you have to fit the model and then look at the call or the terms:
> my_eq <- as.formula(LHS ~ (A + B + C) * (A + B + C))
> my_formula <- lm(my_eq)
> attr(terms(my_formula), "term.labels")
[1] "A" "B" "C" "A:B" "A:C" "B:C"
I am trying to build a model-based tree with a type of "two-layer interaction" where the models in the nodes of the tree are segmented again.
I am using the mob() function to this aim but I could not manage to make the argument for the fit function work with the lmtree() function.
In the following example a is function of b and the relationship between a and b depends on d and on b | d.
library("partykit")
set.seed(321)
b <- runif(200)
d <- sample(1:2, 200, replace = TRUE)
a <- jitter(ifelse(d == 1, 2 * b - 1, 4 * b - 1.2), amount = .1)
a[b < .5 & d == 1] <- jitter(rep(0, length(a[b < .5 & d == 1])))
a[b < .3 & d == 2] <- jitter(rep(0, length(a[b < .3 & d == 2])))
fit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE)
{
x <- x[, 2]
l <- lmtree(y ~ x | b)
return(l)
}
m <- mob(a ~ b | d, fit = fit) # not working
Of course with this simple example I could use lmtree(a ~ b | d + b) to find every interaction but is there a way to use as fit function of mob() a lmtree()?
No but yes ;-)
No, lmtree() cannot be used easily as a fitter for a mob().
The dimension of the inner tree (lmtree()) is not fixed, i.e., you may get a tree without any partition or with many subgroups, and this would be confusing for the outer tree (mob()).
Even if one worked around the dimension issue or fixed it by always forcing one break, one would need more work to set up the right coefficient vector, matrix of estimating functions, etc. This is also not straightforward because the convergence rate (and hence the inference) is different if breakpoints are given (e.g., for a binary factor) or have to be estimated (such as for your numeric variables b).
The way you set up your fit() function, the inner lmtree() does not know where to find b. All it has is a numeric vector y and a numeric matrix x but not the original data.
But yes, I think that all of these issues can be addressed if changing the view from fitting a "two-layer" tree to fitting a "segmented" model inside a tree. My impression is that you want to fit a model y ~ x (or a ~ b in your example) where a piecewise linear function is used with an additional breakpoint in x. If the piecewise linear function is supposed to be continuous in x, then the segmented package can be easily used. If not, then strucchange could be leveraged. Assuming you want the former (as you have simulated your data like this), I include a worked segmented example below (and also slightly modified your question to reflect this).
Changing the names and code a little bit, your data d has a segmented piecewise linear relationship of y ~ x with coefficients depending on a group variable g.
set.seed(321)
d <- data.frame(
x = runif(200),
g = factor(sample(1:2, 200, replace = TRUE))
)
d$y <- jitter(ifelse(d$g == "1",
pmax(0, 2 * d$x - 1),
pmax(0, 4 * d$x - 1.2)
), amount = 0.1)
Within every node of a tree I can then fit a model segmented(lm(y ~ x)) which comes with suitable extractors for coef(), logLik(), estfun() etc. Thus, the mobster function is simply:
segfit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...)
{
x <- as.numeric(x[, 2])
segmented::segmented(lm(y ~ x))
}
(Note: I haven't tried whether segmented() would also support lm() objects with weights and offset.)
With this we can obtain the full tree which simply splits in g in this basic example:
library("partykit")
segtree <- mob(y ~ x | g, data = d, fit = segfit)
plot(segtree, terminal_panel = node_bivplot, tnex = 2)
A hands-on introduction to segmented is available in: Muggeo VMR (2008). "segmented: An R Package to Fit Regression Models with Broken-Line Relationships." R News, 8(1), 20-25. https://CRAN.R-project.org/doc/Rnews/
For the underlying methodological background see: Muggeo VMR (2003). "Estimating Regression Models with Unknown Break-Points." Statistics in Medicine, 22(19), 3055-3071. doi:10.1002/sim.1545
I'm trying to fit a nonlinear model with nearly 50 variables (since there are year fixed effects). The problem is I have so many variables that I cannot write the complete formula down like
nl_exp = as.formula(y ~ t1*year.matrix[,1] + t2*year.matrix[,2]
+... +t45*year.matirx[,45] + g*(x^d))
nl_model = gnls(nl_exp, start=list(t=0.5, g=0.01, d=0.1))
where y is the binary response variable, year.matirx is a matrix of 45 columns (indicating 45 different years) and x is the independent variable. The parameters need to be estimated are t1, t2, ..., t45, g, d.
I have good starting values for t1, ..., t45, g, d. But I don't want to write a long formula for this nonlinear regression.
I know that if the model is linear, the expression can be simplified using
l_model = lm(y ~ factor(year) + ...)
I tried factor(year) in gnls function but it does not work.
Besides, I also tried
nl_exp2 = as.formula(y ~ t*year.matrix + g*(x^d))
nl_model2 = gnls(nl_exp2, start=list(t=rep(0.2, 45), g=0.01, d=0.1))
It also returns me error message.
So, is there any easy way to write down the nonlinear formula and the starting values in R?
Since you have not provided any example data, I wrote my own - it is completely meaningless and the model actually doesn't work because it has bad data coverage but it gets the point across:
y <- 1:100
x <- 1:100
year.matrix <- matrix(runif(4500, 1, 10), ncol = 45)
start.values <- c(rep(0.5, 45), 0.01, 0.1) #you could also use setNames here and do this all in one row but that gets really messy
names(start.values) <- c(paste0("t", 1:45), "g", "d")
start.values <- as.list(start.values)
nl_exp2 <- as.formula(paste0("y ~ ", paste(paste0("t", 1:45, "*year.matrix[,", 1:45, "]"), collapse = " + "), " + g*(x^d)"))
gnls(nl_exp2, start=start.values)
This may not be the most efficient way to do it, but since you can pass a string to as.formula it's pretty easy to use paste commands to construct what you are trying to do.
I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!
This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.