Defining an infix operator for use within a formula - r

I'm trying to create a more parsimonious version of this solution, which entails specifying the RHS of a formula in the form d1 + d1:d2.
Given that * in the context of a formula is a pithy stand-in for full interaction (i.e. d1 * d2 gives d1 + d2 + d1:d2), my approach has been to try and define an alternative operator, say %+:% using the infix approach I've grown accustomed to in other applications, a la:
"%+:%" <- function(d1,d2) d1 + d2 + d1:d2
However, this predictably fails because I haven't been careful about evaluation; let's introduce an example to illustrate my progress:
set.seed(1029)
v1 <- runif(1000)
v2 <- runif(1000)
y <- .8*(v1 < .3) + .2 * (v2 > .25 & v2 < .8) -
.4 * (v2 > .8) + .1 * (v1 > .3 & v2 > .8)
With this example, hopefully it's clear why simply writing out the two terms might be undesirable:
y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3)
One workaround which is close to my desired output is to define the whole formula as a function:
plus.times <- function(outvar, d1, d2){
as.formula(paste0(quote(outvar), "~", quote(d1),
"+", quote(d1), ":", quote(d2)))
}
This gives the expected coefficients when passed to lm, but with names that are harder to interpret directly (especially in the real data where we take care to give d1 and d2 descriptive names, in contrast to this generic example):
out1 <- lm(y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3))
out2 <- lm(plus.times(y, cut(v2, breaks = c(0, .25, .8, 1)), I(v1 < .3)))
any(out1$coefficients != out2$coefficients)
# [1] FALSE
names(out2$coefficients)
# [1] "(Intercept)" "d1(0.25,0.8]" "d1(0.8,1]" "d1(0,0.25]:d2TRUE"
# [5] "d1(0.25,0.8]:d2TRUE" "d1(0.8,1]:d2TRUE"
So this is less than optimal.
Is there any way to define the adjust the code so that the infix operator I mentioned above works as expected? How about altering the form of plus.times so that the variables are not renamed?
I've been poking around (?formula, ?"~", ?":", getAnywhere(formula.default), this answer, etc.) but haven't seen how exactly R interprets * when it's encountered in a formula so that I can make my desired minor adjustments.

You do not need to define a new operator in this case: in a formula d1/d2 expands to d1 + d1:d2. In other words d1/d2 specifies that d2 is nested within d1. Continuing your example:
out3 <- lm(y ~ cut(v2,breaks=c(0,.25,.8,1))/I(v1 < .3))
all.equal(coef(out1), coef(out3))
# [1] TRUE
Further comments
Factors may be crossed or nested. Two factors are crossed if it possible to observe every combination of levels of the two factors, e.g. sex and treatment, temperature and pH, etc. A factor is nested within another if each level of that factor can only be observed within one of the levels of the other factor, e.g. town and country, staff member and store etc.
These relationships are reflected in the parametrization of the model. For crossed factors we use d1*d2 or d1 + d2 + d1:d2, to give the main effect of each factor, plus the interaction. For nested factors we use d1/d2 or d1 + d1:d2 to give a separate submodel of the form 1 + d2 for each level of d1.
The idea of nesting is not restricted to factors, for example we may use sex/x to fit a separate linear regression on x for males and females.
In a formula, %in% is equivalent to :, but it may be used to emphasize the nested, or hierarchical structure of the data/model. For example, a + b %in% a is the same as a + a:b, but reading it as "a plus b within a" gives a better description of the model being fitted. Even so, using / has the advantage of simplifying the model formula at the same time as emphasizing the structure.

Related

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same sizeā€¦ from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

I() function and its use cases [duplicate]

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.
What does the "I" do in a model like this?
data(rock)
lm(area~I(peri - mean(peri)), data = rock)
Considering that the following does NOT work:
lm(area ~ (peri - mean(peri)), data = rock)
and that this does work:
rock$peri - mean(rock$peri)
Any key words on how to research this myself would also be very helpful.
I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x = the main effect of x, and
x^2 = the main effect and the second order interaction of x",
not the intended x plus x-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.
To get the usual operator, you need to use I() to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs.)
In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).
Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).
From the docs:
Function I has two main uses.
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
To address this point:
df1 <- data.frame(stringi = I("dog"))
df2 <- data.frame(stringi = "dog")
str(df1)
str(df2)
In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
To address this point:
lm(mpg ~ disp + drat, mtcars)
lm(mpg ~ I(disp + drat), mtcars)
Second line. "Creates a new predictor" that is the literal sum of disp + drat

Nonlinear model with many independent variables (fixed effects) in R

I'm trying to fit a nonlinear model with nearly 50 variables (since there are year fixed effects). The problem is I have so many variables that I cannot write the complete formula down like
nl_exp = as.formula(y ~ t1*year.matrix[,1] + t2*year.matrix[,2]
+... +t45*year.matirx[,45] + g*(x^d))
nl_model = gnls(nl_exp, start=list(t=0.5, g=0.01, d=0.1))
where y is the binary response variable, year.matirx is a matrix of 45 columns (indicating 45 different years) and x is the independent variable. The parameters need to be estimated are t1, t2, ..., t45, g, d.
I have good starting values for t1, ..., t45, g, d. But I don't want to write a long formula for this nonlinear regression.
I know that if the model is linear, the expression can be simplified using
l_model = lm(y ~ factor(year) + ...)
I tried factor(year) in gnls function but it does not work.
Besides, I also tried
nl_exp2 = as.formula(y ~ t*year.matrix + g*(x^d))
nl_model2 = gnls(nl_exp2, start=list(t=rep(0.2, 45), g=0.01, d=0.1))
It also returns me error message.
So, is there any easy way to write down the nonlinear formula and the starting values in R?
Since you have not provided any example data, I wrote my own - it is completely meaningless and the model actually doesn't work because it has bad data coverage but it gets the point across:
y <- 1:100
x <- 1:100
year.matrix <- matrix(runif(4500, 1, 10), ncol = 45)
start.values <- c(rep(0.5, 45), 0.01, 0.1) #you could also use setNames here and do this all in one row but that gets really messy
names(start.values) <- c(paste0("t", 1:45), "g", "d")
start.values <- as.list(start.values)
nl_exp2 <- as.formula(paste0("y ~ ", paste(paste0("t", 1:45, "*year.matrix[,", 1:45, "]"), collapse = " + "), " + g*(x^d)"))
gnls(nl_exp2, start=start.values)
This may not be the most efficient way to do it, but since you can pass a string to as.formula it's pretty easy to use paste commands to construct what you are trying to do.

How to express membership in multiple categories in R?

How does one express a linear model where observations can belong to multiple categories and the number of categories is large?
For example, using time dummies as the categories, here is a problem that is easy to set up since the number of categories (time periods) is small and known:
tmp <- "day 1, day 2
0,1
1,0
1,1"
periods <- read.csv(text = tmp)
y <- rnorm(3)
print(lm(y ~ day.1 + day.2 + 0, data=periods))
Now suppose that instead of two days there were 100. Would I need to create a formula like the following?
y ~ day.1 + day.2 + ... + day.100 + 0
Presumably such a formula would have to be created programmatically. This seems inelegant and un-R-like.
What is the right R way to tackle this? For example, aside from the formula problem, is there a better way to create the dummies than creating a matrix of 1s and 0s (as I did above)? For the sake of concreteness, say that the actual data consists (for each observation) of a start and end date (so that tmp would contain a 1 in each column between start and end).
Update:
Based on the answer of #jlhoward, here is a larger example:
num.observations <- 1000
# Manually create 100 columns of dummies called x1, ..., x100
periods <- data.frame(1*matrix(runif(num.observations*100) > 0.5, nrow = num.observations))
y <- rnorm(num.observations)
print(summary(lm(y ~ ., data = periods)))
It illustrates the manual creation of a data frame of dummies (1s and 0s). I would be interested in learning whether there is a more R-like way of dealing with these "multiple dummies per observation" issue.
You can use the . notation to include all variables other than the response in a formula, and -1 to remove the intercept. Also, put everything in your data frame; don't make y a separate vector.
set.seed(1) # for reproducibility
df <- data.frame(y=rnorm(3),read.csv(text=tmp))
fit.1 <- lm(y ~ day.1 + day.2 + 0, df)
fit.2 <- lm(y ~ -1 + ., df)
identical(coef(fit.1),coef(fit.2))
# [1] TRUE

What does the capital letter "I" in R linear regression formula mean?

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.
What does the "I" do in a model like this?
data(rock)
lm(area~I(peri - mean(peri)), data = rock)
Considering that the following does NOT work:
lm(area ~ (peri - mean(peri)), data = rock)
and that this does work:
rock$peri - mean(rock$peri)
Any key words on how to research this myself would also be very helpful.
I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x = the main effect of x, and
x^2 = the main effect and the second order interaction of x",
not the intended x plus x-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.
To get the usual operator, you need to use I() to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs.)
In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).
Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).
From the docs:
Function I has two main uses.
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
To address this point:
df1 <- data.frame(stringi = I("dog"))
df2 <- data.frame(stringi = "dog")
str(df1)
str(df2)
In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
To address this point:
lm(mpg ~ disp + drat, mtcars)
lm(mpg ~ I(disp + drat), mtcars)
Second line. "Creates a new predictor" that is the literal sum of disp + drat

Resources