It may look like an easy question but is there any fast and robust way to expand a formula like
f=formula(y ~ a * b )
to
y~a+b+ab
I'd try this:
f = y ~ a * b
reformulate(labels(terms(f)), f[[2]])
# y ~ a + b + a:b
It works on more complicated formulas as well, and relies on more internals. (I'm assuming you want a useful formula object out, so in the result a:b is nicer than the ab in the question or a*b in d.b's answer.)
f = y ~ a + b * c
reformulate(labels(terms(f)), f[[2]])
# y ~ a + b + c + b:c
f = y ~ a + (b + c + d)^2
reformulate(labels(terms(f)), f[[2]])
# y ~ a + b + c + d + b:c + b:d + c:d
vec = all.vars(f)
reformulate(c(vec[2:3], paste(vec[2:3], collapse = "*")), vec[1])
#y ~ a + b + a * b
Related
Function:
(x^3 - 3*x^2 - 2*x + 7) + (y^2 + 2*y)
Output on Wolfram (which I understand):
x^3 - 3 x^2 - 2 x + y^2 + 2 y + 7
Output of the code in R: (using the polynomial function in package polynom)
7 - 2*x^2 + x^3
R code:
library(polynom)
p <- polynomial(c(7,-2,-3, 1))
q <- polynomial(c(0, 2, 1))
p + q
Entered the code as above in R.
You can use mpoly to manipulate multivariate polynomials.
library(mpoly)
p <- as.mpoly(c(7, -2, -3, 1), 'x')
q <- as.mpoly(c(0, 2, 1), 'y')
reorder(p + q)
# x^3 - 3 x^2 - 2 x + y^2 + 2 y + 7
More functionality
https://dkahle.github.io/mpoly/
I have a situation where I need to train a regression model that will have 100 features. I want to look for interaction effects between all 100 features and one other feature. I would like to find a way to do this programatically as well since this analysis is going to be recuring and I don't want to have to reprogram a new formula each time this analysis is run. I want it to be automated. So how can I get a model that is like so
Y~a*b + a*c + .... a*z
But for 100 terms? How do I get the R formula to do this? Note I will be using statsmodels in python but I think the syntax is the same.
lm(Y ~ a * ., df)
eg
lm(Sepal.Width ~ Sepal.Length * ., iris)
Call:
lm(formula = Sepal.Width ~ Sepal.Length * ., data = iris)
Coefficients:
(Intercept) Sepal.Length Petal.Length Petal.Width
-0.91350 0.82954 0.29569 0.85334
Speciesversicolor Speciesvirginica Sepal.Length:Petal.Length Sepal.Length:Petal.Width
0.05894 -0.89244 -0.05394 -0.04654
Sepal.Length:Speciesversicolor Sepal.Length:Speciesvirginica
-0.32823 -0.21910
Here is an example of how to construct the wanted string and then convert to a formula
paste("a", letters[2:26], sep = "*") |>
paste(collapse = " + ") |>
sprintf(fmt = "Y ~ %s") |>
as.formula()
##> Y ~ a * b + a * c + a * d + a * e + a * f + a * g + a * h + a *
##> i + a * j + a * k + a * l + a * m + a * n + a * o + a * p +
##> a * q + a * r + a * s + a * t + a * u + a * v + a * w + a *
##> x + a * y + a * z
Solution use regex:
# this would be the columns of a dataframe
effects_list = ['regressor_col','A', 'B', 'C', 'D', 'E','F']
interaction = effects_list[3]
regressor = effects_list[0]
formula = regressor + ' ~'
for effect in effects_list:
# check if it's the interaction term if it is skip it
#print((effect != interaction) & (effect != regressor))
if (effect != interaction) & (effect != regressor):
formula = formula + ' + ' + effect + '*' + interaction
print(formula)
Say there is a formula:
f1 = as.formula(y~ var1 + var2 + var3)
f1
## y ~ var1 + var2 + var3
Then I want to update the formula by adding a named vector a.
a = 'aaabbbccc'
f2 = update(f1, ~ . + a)
f2
## y ~ var1 + var2 + var3 + a
This is not what I expected. I want a to be evaluated in the formula. Then I tried this:
f3 = update(f1, ~ . + get(a))
f3
## y ~ var1 + var2 + var3 + get(a)
Also failed. What I expected is this:
y ~ var1 + var2 + var3 + aaabbbccc
Any help will be highly appreciated!
If you are evaluating these statements in your global environment, then you could do:
f <- y ~ var1 + var2 + var3
a <- as.name("aaabbbccc")
update(f, substitute(~ . + a, env = list(a = a)))
## y ~ var1 + var2 + var3 + aaabbbccc
Otherwise, you could do:
update(f, substitute(~ . + a, env = environment()))
## y ~ var1 + var2 + var3 + aaabbbccc
The important thing is that the value of a in env is a symbol, not a string: as.name("aaabbbccc") or quote(aaabbbccc), but not "aaabbbccc".
Somewhat unintuitively, substitute(expr, env = .GlobalEnv) is equivalent to substitute(expr, env = NULL). That is the only reason why it is necessary to pass list(a = a) (or similar) in the first case.
I should point out that, in this situation, it is not too difficult to create the substitute result yourself, "from scratch":
update(f, call("~", call("+", quote(.), a)))
## y ~ var1 + var2 + var3 + aaabbbccc
This approach has the advantage of being environment-independent, and, for that reason, is probably the one I would use.
I'm trying to run a simple OLS regression with a restriction that the sum of the coefficients of two variables add up to 1.
I want:
Y = α + β1 * x1 + β2 * x2 + β3 * x3,
where β1 + β2 = 1
I have found how to make a relation between coefficients like:
β1 = 2* β2
But I haven't found how to make restrictions like:
β1 = 1 - β2
How would I do it in this simple example?
data <- data.frame(
A = c(1,2,3,4),
B = c(3,2,2,3),
C = c(3,3,2,3),
D = c(5,3,3,4)
)
lm(formula = 'D ~ A + B + C', data = data)
Thanks!
β1 + β2 = 1
To have β1 + β2 = 1 the model you have to fit is
fit <- lm(Y ~ offset(x1) + I(x2 - x1) + x3, data = df)
That is
Y = α + x1 + β2 * (x2 - x1) + β3 * x3
after substituting β1 = 1 - β2; x_new = x2 - x1 and the coefficient for x1 is 1.
β1 + β2 + β3 = 1
fit <- lm(Y ~ offset(x1) + I(x2 - x1) + I(x3 - x1), data = df)
Y = α + x1 + β2 * (x2 - x1) + β3 * (x3 - x1)
after substituting β1 = 1 - β2 - β3
β1 + β2 + β3 + ... = 1
I think the pattern is clear... you just have to subtract one variable, x1, from the remaining variables(x2, x3, ...) and have the coefficient of that variable, x1, to 1.
Example β1 + β2 = 1
# Data
df <- iris[, 1:4]
colnames(df) <- c("Y", paste0("x", 1:3, collaapse=""))
# β1 + β2 = 1
fit <- lm(Y ~ offset(x1) + I(x2 - x1) + x3, data = df)
coef_2 <- coef(fit)
beta_1 <- 1 - coef_2[2]
beta_2 <- coef_2[2]
1) CVXR We can compute the coefficients using CVXR directly by specifying the objective and constraint. We assume that D is the response, the coefficients of A and B must sum to 1, b[1] is the intercept and b[2], b[3] and b[4] are the coefficients of A, B and C respectively.
library(CVXR)
b <- Variable(4)
X <- cbind(1, as.matrix(data[-4]))
obj <- Minimize(sum((data$D - X %*% b)^2))
constraints <- list(b[2] + b[3] == 1)
problem <- Problem(obj, constraints)
soln <- solve(problem)
bval <- soln$getValue(b)
bval
## [,1]
## [1,] 1.6428605
## [2,] -0.3571428
## [3,] 1.3571428
## [4,] -0.1428588
The objective is the residual sum of squares and it equals:
soln$value
## [1] 0.07142857
2) pracma We can also use the pracma package to compute the coefficients. We specify the X matrix, response vector, the constraint matrix (in this case the vector given as the third argument is regarded as a one row matrix) and the right hand side of the constraint.
library(pracma)
lsqlincon(X, data$D, Aeq = c(0, 1, 1, 0), beq = 1) # X is from above
## [1] 1.6428571 -0.3571429 1.3571429 -0.1428571
3) limSolve This package can also solve for the coefficients of regression problems with constraints. The arguments are the same as in (2).
library(limSolve)
lsei(X, data$D, c(0, 1, 1, 0), 1)
giving:
$X
A B C
1.6428571 -0.3571429 1.3571429 -0.1428571
$residualNorm
[1] 0
$solutionNorm
[1] 0.07142857
$IsError
[1] FALSE
$type
[1] "lsei"
4) nls This can be formulated as a problem for nls with the B coefficient equal to one minus the A coefficient.
nls(D ~ b0 + b1 * A + (1-b1) * B + b2 * C, data,
start = list(b0 = 1, b1 = 1, b2 = 1))
## D ~ b0 + b1 * A + (1 - b1) * B + b2 * C
## data: data
## b0 b1 b2
## 1.6429 -0.3571 -0.1429
## residual sum-of-squares: 0.07143
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 2.803e-08
Check
We can double check the above by using the lm approach in the other answer:
lm(D ~ I(A-B) + C + offset(B), data)
giving:
Call:
lm(formula = D ~ I(A - B) + C + offset(B), data = data)
Coefficients:
(Intercept) I(A - B) C
1.6429 -0.3571 -0.1429
The I(A-B) coefficient equals the coefficient of A in the original formulation and one minus it is the coefficient of C. We see that all approaches do lead to the same coefficients.
I have a formula in R for example
y ~ x + z + xx + zz + tt + x:xx + x:zz + xx:z + zz:xx + xx:zz:tt
or even more complicated (y~x*z*xx*zz*tt)
Note that the names on the right-hand side of the formula are intentionally selected to be somehow similar to at least one other term.
The question is now how to remove the interaction terms that are related to a specific main effect. For example, if I remove the term x (main effect) I want to remove the interaction terms that also include x, here x:xx.
I have tried grepl() but it would remove any term that contains partially or fully the word. In my example it removes x,xx,x:xx,xx:z,zz:xx,xx:zz:tt
any ideas about a function to do it?
Update:
What I have already tried:
f = y ~ x + z + xx + zz + tt + x:xx + x:zz + xx:z + zz:xx + xx:zz:tt
modelTerms = attr(terms(f) , which = 'term.labels')
modelTerms[!grepl(pattern = 'x', x = modelTerms)]
Use update.formula:
f <- y~x*z*xx*zz*tt
update(f, . ~ . - x - x:.)
#y ~ z + xx + zz + tt + z:xx + z:zz + xx:zz + z:tt + xx:tt + zz:tt +
# z:xx:zz + z:xx:tt + z:zz:tt + xx:zz:tt + z:xx:zz:tt
f <- y ~ x + z + xx + zz + tt + x:xx + x:zz + xx:z + zz:xx + xx:zz:tt
update(f, . ~ . - x - x:.)
#y ~ z + xx + zz + tt + z:xx + xx:zz + xx:zz:tt
Are you looking for this?
> modelTerms[!grepl(pattern='^x\\:x+', x=modelTerms)]
[1] "x" "z" "xx" "zz" "tt" "x:zz" "z:xx" "xx:zz"
[9] "xx:zz:tt"
Simple:
f = y~x*z*xx*zz*tt
modelTerms = attr(terms(f) , which = 'term.labels')
l = sapply(
strsplit(x = modelTerms, split = '[:*]'),
FUN = function(x) {
'x' %in% x
}
)
modelTerms[!l]