Removing the interaction terms when the main effect is removed - r

I have a formula in R for example
y ~ x + z + xx + zz + tt + x:xx + x:zz + xx:z + zz:xx + xx:zz:tt
or even more complicated (y~x*z*xx*zz*tt)
Note that the names on the right-hand side of the formula are intentionally selected to be somehow similar to at least one other term.
The question is now how to remove the interaction terms that are related to a specific main effect. For example, if I remove the term x (main effect) I want to remove the interaction terms that also include x, here x:xx.
I have tried grepl() but it would remove any term that contains partially or fully the word. In my example it removes x,xx,x:xx,xx:z,zz:xx,xx:zz:tt
any ideas about a function to do it?
Update:
What I have already tried:
f = y ~ x + z + xx + zz + tt + x:xx + x:zz + xx:z + zz:xx + xx:zz:tt
modelTerms = attr(terms(f) , which = 'term.labels')
modelTerms[!grepl(pattern = 'x', x = modelTerms)]

Use update.formula:
f <- y~x*z*xx*zz*tt
update(f, . ~ . - x - x:.)
#y ~ z + xx + zz + tt + z:xx + z:zz + xx:zz + z:tt + xx:tt + zz:tt +
# z:xx:zz + z:xx:tt + z:zz:tt + xx:zz:tt + z:xx:zz:tt
f <- y ~ x + z + xx + zz + tt + x:xx + x:zz + xx:z + zz:xx + xx:zz:tt
update(f, . ~ . - x - x:.)
#y ~ z + xx + zz + tt + z:xx + xx:zz + xx:zz:tt

Are you looking for this?
> modelTerms[!grepl(pattern='^x\\:x+', x=modelTerms)]
[1] "x" "z" "xx" "zz" "tt" "x:zz" "z:xx" "xx:zz"
[9] "xx:zz:tt"

Simple:
f = y~x*z*xx*zz*tt
modelTerms = attr(terms(f) , which = 'term.labels')
l = sapply(
strsplit(x = modelTerms, split = '[:*]'),
FUN = function(x) {
'x' %in% x
}
)
modelTerms[!l]

Related

Shortening the formula syntax of a regression model

I was wondering if the syntax of the regression model below could be made more concise (shorter) than it currently is?
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/v/main/bv1.csv')
library(nlme)
model <- lme(achieve ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time), correlation = corSymm(),
random = ~0 + D1:time | schcode/id, data = dat, weights = varIdent(form = ~1|factor(math)),
na.action = na.omit, control = lmeControl(maxIter = 200, msMaxIter = 200, niterEM = 50,
msMaxEval = 400))
coef(summary(model))
Focusing on the fixed-effect component only.
Original formula:
form1 <- ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time)
X1 <- model.matrix(form1, data=dat)
I think this is equivalent
form2 <- ~0 +
D1 + D2 +
(D1+D2):(time + schcontext + female + female:time+schcontext:time)
X2 <- model.matrix(form2, data=dat)
(Unfortunately ~ 0 + (D1 + D2):(1 + time + ...) doesn't work as I would have liked/expected.)
For a start, the model matrix has the right dimensions. Staring at the column names of the model matrices and reordering the columns manually:
X2o <- X2[,c(1:3,6,4,7,5,8,9,11,10,12)]
all.equal(c(X1),c(X2o)) ##TRUE
(For numerical predictors, you don't need I(A*B): A:B is equivalent.)
Actually you can do a little better using the * operator
form3 <- ~0 +
D1 + D2 +
(D1+D2):(time*(schcontext+female))
X3 <- model.matrix(form3, data=dat)
X3o <- X3[,c(1:3,6,4,7,5,8,10,12,9,11)]
all.equal(c(X1),c(X3o)) ## TRUE
Compare formula length:
sapply(list(form1,form2,form3),
function(x) nchar(as.character(x)[[2]]))
## [1] 183 84 54

R: Dynamically update formula

How can I dynamically update a formula?
Example:
myvar <- "x"
update(y ~ 1 + x, ~ . -x)
# y ~ 1 (works as intended)
update(y ~ 1 + x, ~ . -myvar)
# y ~ x (doesn't work as intended)
update(y ~ 1 + x, ~ . -eval(myvar))
# y ~ x (doesn't work as intended)
You can use paste() within the update()call.
myvar <- "x"
update(y ~ 1 + x, paste(" ~ . -", myvar))
# y ~ 1
Edit
As #A.Fischer noted in the comments, this won't work if myvar is a vector of length > 1
myvar <- c("k", "l")
update(y ~ 1 + k + l + m, paste(" ~ . -", myvar))
# y ~ l + m
# Warning message:
# Using formula(x) is deprecated when x is a character vector of length > 1.
# Consider formula(paste(x, collapse = " ")) instead.
Just "k" gets removed, but "l" remains in the formula.
In this case we could transform the formula into a strings, add/remove what we want to change and rebuild the formula using reformulate, something like:
FUN <- function(fo, x, negate=FALSE) {
foc <- as.character(fo)
s <- el(strsplit(foc[3], " + ", fixed=T))
if (negate) {
reformulate(s[!s %in% x], foc[2], env=.GlobalEnv)
} else {
reformulate(c(s, x), foc[2], env=.GlobalEnv)
}
}
fo <- y ~ 1 + k + l + m
FUN(fo, c("n", "o")) ## add variables
# y ~ 1 + k + l + m + n + o
FUN(fo, c("k", "l"), negate=TRUE)) ## remove variables
# y ~ 1 + m

how to insert more densely sampled auxiliary variables into the georob package in Rstudio

I need to make a prediction of a soil variable as a function of auxiliary variables in the georob package.
My solo dataset has 200 observations and my auxiliary variables set has 19940 data, however in the code, I can't enter the coordinates of the auxiliary variables as prediction points.
dat= read.csv("malhas amostrais/solo_200.csv", sep = ",")
covar = read.csv("../dados/csv/variaveis_auxiliares.csv", sep = ";")
ku_georob_cpeso <- georob(argila ~ CV + CH + dist_bebedouros + Eca_0.5m + Eca_1m + elevacao + IH_0.5m + sd_ndvi_01 + sd_ndvi_02 + twi + S_P_T + sd_b4 +sd_b5 + sd_b6+ sd_b7,
data= dat,
locations= ~ x + y,
variogram.model="RMexp",
param=c(variance=200, nugget=600, scale=150),
verbose = 3,
psi.func = "huber")
ku_georob_cpeso <- georob(argila ~ CV + CH + dist_bebedouros + Eca_0.5m + Eca_1m + elevacao + IH_0.5m + sd_ndvi_01 + sd_ndvi_02 + twi + S_P_T + sd_b4 +sd_b5 + sd_b6+ sd_b7,
data= dat1,
subset = cova,
locations= ~ x + y,
variogram.model="RMexp",
param=c(variance=200, nugget=600, scale=150),+ verbose = 3,
psi.func = "huber")
I receive the error:
Error in xj[i] : invalid subscript type 'list'

Expand the R formula

It may look like an easy question but is there any fast and robust way to expand a formula like
f=formula(y ~ a * b )
to
y~a+b+ab
I'd try this:
f = y ~ a * b
reformulate(labels(terms(f)), f[[2]])
# y ~ a + b + a:b
It works on more complicated formulas as well, and relies on more internals. (I'm assuming you want a useful formula object out, so in the result a:b is nicer than the ab in the question or a*b in d.b's answer.)
f = y ~ a + b * c
reformulate(labels(terms(f)), f[[2]])
# y ~ a + b + c + b:c
f = y ~ a + (b + c + d)^2
reformulate(labels(terms(f)), f[[2]])
# y ~ a + b + c + d + b:c + b:d + c:d
vec = all.vars(f)
reformulate(c(vec[2:3], paste(vec[2:3], collapse = "*")), vec[1])
#y ~ a + b + a * b

Different colours for values above / below a linear trend line

I'm using ggplot to plot a time series with a linear regression line. I would like to have different colours for my time series depending on whether it is above or below the trend line.
Here is a code example to plot the series and the corresponding trend line with different colours for the series and the line:
x <- seq(as.Date("2000/1/1"), as.Date("2010/1/1"), "years")
y <- rnorm(length(x),0,10)
df <- data.frame(x,y)
ggplot(df, aes(x, y)) +
stat_smooth(method = 'lm', aes(colour = 'Trend'), se = FALSE) +
geom_line(aes(colour = 'Observation') ) +
theme_bw() +
xlab("x") +
ylab("y") +
scale_colour_manual(values = c("blue","red"))
Have a nice day!
I got rid of the dates, since they were driving me nuts. Perhaps someone can add a solution for that. Otherwise it seems quite doable, with some basic high school maths.
df <- data.frame(x = 2000:2010,
y = rnorm(11, 0, 10))
fm <- lm(y ~ x, data = df)
co <- coef(fm)
df$under_over <- sign(fm$residuals)
for (i in 1:(nrow(df) - 1)) {
# Get slope and intercept for line segment
slope <- (df$y[i + 1] - df$y[i]) / (df$x[i + 1] - df$x[i])
int <- df$y[i] - slope * df$x[i]
# find where they would cross
x <- (co[1] - int) / (slope - co[2])
y <- slope * x + int
# if that is in the range of the segment it is a crossing, add to the data
if (x > df$x[i] & x < df$x[i + 1])
df <- rbind(df, c(x = x, y = y, under_over = NA))
}
#order by x
df <- df[order(df$x), ]
# find color for intersections
for (i in 1:nrow(df))
if (is.na(df$under_over[i]))
df$under_over[i] <- df$under_over[i + 1]
ggplot(df) +
geom_abline(intercept = co[1], slope = co[2]) +
geom_path(aes(x, y, col = as.factor(under_over), group = 1)) +
theme_bw()

Resources