Model-based partitioning with "two-layer interaction" (segmented models) - r

I am trying to build a model-based tree with a type of "two-layer interaction" where the models in the nodes of the tree are segmented again.
I am using the mob() function to this aim but I could not manage to make the argument for the fit function work with the lmtree() function.
In the following example a is function of b and the relationship between a and b depends on d and on b | d.
library("partykit")
set.seed(321)
b <- runif(200)
d <- sample(1:2, 200, replace = TRUE)
a <- jitter(ifelse(d == 1, 2 * b - 1, 4 * b - 1.2), amount = .1)
a[b < .5 & d == 1] <- jitter(rep(0, length(a[b < .5 & d == 1])))
a[b < .3 & d == 2] <- jitter(rep(0, length(a[b < .3 & d == 2])))
fit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE)
{
x <- x[, 2]
l <- lmtree(y ~ x | b)
return(l)
}
m <- mob(a ~ b | d, fit = fit) # not working
Of course with this simple example I could use lmtree(a ~ b | d + b) to find every interaction but is there a way to use as fit function of mob() a lmtree()?

No but yes ;-)
No, lmtree() cannot be used easily as a fitter for a mob().
The dimension of the inner tree (lmtree()) is not fixed, i.e., you may get a tree without any partition or with many subgroups, and this would be confusing for the outer tree (mob()).
Even if one worked around the dimension issue or fixed it by always forcing one break, one would need more work to set up the right coefficient vector, matrix of estimating functions, etc. This is also not straightforward because the convergence rate (and hence the inference) is different if breakpoints are given (e.g., for a binary factor) or have to be estimated (such as for your numeric variables b).
The way you set up your fit() function, the inner lmtree() does not know where to find b. All it has is a numeric vector y and a numeric matrix x but not the original data.
But yes, I think that all of these issues can be addressed if changing the view from fitting a "two-layer" tree to fitting a "segmented" model inside a tree. My impression is that you want to fit a model y ~ x (or a ~ b in your example) where a piecewise linear function is used with an additional breakpoint in x. If the piecewise linear function is supposed to be continuous in x, then the segmented package can be easily used. If not, then strucchange could be leveraged. Assuming you want the former (as you have simulated your data like this), I include a worked segmented example below (and also slightly modified your question to reflect this).
Changing the names and code a little bit, your data d has a segmented piecewise linear relationship of y ~ x with coefficients depending on a group variable g.
set.seed(321)
d <- data.frame(
x = runif(200),
g = factor(sample(1:2, 200, replace = TRUE))
)
d$y <- jitter(ifelse(d$g == "1",
pmax(0, 2 * d$x - 1),
pmax(0, 4 * d$x - 1.2)
), amount = 0.1)
Within every node of a tree I can then fit a model segmented(lm(y ~ x)) which comes with suitable extractors for coef(), logLik(), estfun() etc. Thus, the mobster function is simply:
segfit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...)
{
x <- as.numeric(x[, 2])
segmented::segmented(lm(y ~ x))
}
(Note: I haven't tried whether segmented() would also support lm() objects with weights and offset.)
With this we can obtain the full tree which simply splits in g in this basic example:
library("partykit")
segtree <- mob(y ~ x | g, data = d, fit = segfit)
plot(segtree, terminal_panel = node_bivplot, tnex = 2)
A hands-on introduction to segmented is available in: Muggeo VMR (2008). "segmented: An R Package to Fit Regression Models with Broken-Line Relationships." R News, 8(1), 20-25. https://CRAN.R-project.org/doc/Rnews/
For the underlying methodological background see: Muggeo VMR (2003). "Estimating Regression Models with Unknown Break-Points." Statistics in Medicine, 22(19), 3055-3071. doi:10.1002/sim.1545

Related

Constraints in MARSS (R package)

I would like to estimate (MLE) this model using MARSS (or another package in R)
x_t=x_{t-1}+w_t , with w_t ~ N(0,q)
y_t= d1_t + \alpha d2_t + \beta (d3_t -x_{t-1}) + v_t, with v_t ~ N(0,6*q)
where the first line is the transition equation and the second, the observation one.
I managed to write it in form accepted by MARSS (R-package), as below:
[x1_t,x2_{t-1}]= [1,0;1,0][x1_{t-1},x2_{t-2}]+[w1_t,w2_t], with w1_t ~ N(0,q) and w2_t ~ N(0,0)
y_t= D d_t+Z x_t , with v_t ~ N(0,6*q)
where
x_t=[x1_t,x2_{t-1}]
D=[1,\alpha,\beta]
Z=[0,\beta]
d_t=[d1_t,d2_t, d3_t]
The problem is that I couldn't make the constraint work properly. When I run this system, R considers the \beta in Z matrix separately of the \beta in D matrix. All the examples that I saw on internet show a linear restriction using Z matrix only (or just D only). The same occurs in the variances that I would like to be multiples.
Anyone could help me with this?
Here's a toy data:
B <- matrix(list(1,0,1,0),2,2,byrow=TRUE)
U <- matrix(0,2,1)
C <- matrix(0,2,1)
G <- matrix(list(1,0,0,0),2,2,byrow=TRUE)
Q <- matrix(list('d',0,0,0),2,2,byrow=TRUE)
Z <- matrix(list(0,'b'),1,2)
A <- matrix(0)
D <- matrix(list(1,'a','b'),1,3)
H <- matrix(1)
R=matrix(list('6*d'))
dt<-matrix(rnorm(300),3,100)
y<-rnorm(100)
x0=matrix(list(0.094,0.094),2,1)
V0=matrix(list(0.001,0,0,0.001),2,2)
model.list = list(B=B, U=U, C=C, Q=Q, Z=Z, A=A, D=D, d=dt, H=H, R=R,x0=x0,V0=V0)
kemfit = MARSS(y, model=model.list, control=list(maxit=100,conv.test.slope.tol=0.1,abstol=0.1),method='kem')
The EM algorithm in MARSS only allows constraints (like setting values equal) within the same matrices. Setting constraints across A & D or U & C is easy but across D & Z or R & Q requires rewriting your model in a weird way where your covariates (dt) appears as dummy states (x's). So you don't want to do that.
You can just write a function to return the negative log-likelihood of your state-space model and then minimize that with optim(). I would do this with the KFAS package using the SSCustom() function because that will be fast. However, here is how to do this with the MARSS package just to show you the concept. As the author of MARSS, I can write this down immediately whereas with the KFAS package (which I also use), I'd need to look up how to do the covariates.
# Set up the parts that don't change
dt<-matrix(rnorm(300),3,100)
y<-rnorm(100)
x0=matrix(list(0.094,0.094),2,1)
V0=matrix(list(0.001,0,0,0.001),2,2)
B <- matrix(list(1,0,1,0),2,2,byrow=TRUE)
U <- A <- "zero"
# Put the parameters you will estimate into a vector
pars <- c(a=0.1624, b=-0.1, d=sqrt(0.2))
# Write a function to return the negative log-likelihood
negloglik <- function(pars){
Q <- matrix(list(pars["d"]^2,0,0,0),2,2,byrow=TRUE)
Z <- matrix(list(0, pars["b"]),1,2)
D <- matrix(list(1, pars["a"], pars["b"]),1,3)
R <- matrix(6*pars["d"]^2)
model.list = list(B=B, U=U, Q=Q, Z=Z, A=A, D=D, d=dt, R=R, x0=x0, V0=V0)
-1*MARSS(y, model=model.list, control=list(maxit=100,conv.test.slope.tol=0.1,abstol=0.1),method='kem', silent=TRUE)$logLik
}
optim(pars, negloglik, method="BFGS")
Using the MARSS() function to get the logLik is a bit silly here since that is a fitting function but with all the parameters fixed, it will just return the logLik without fitting.
If you want to see what your KFAS model should look like, you can do this:
kfas.model <- MARSSkfas(kemfit, return.kfas.model=TRUE, return.lag.one=FALSE)$kfas.model
Then
library(KFAS)
logLik(kfas.model)
will get you the log-likelihood. But how the covariates are entering the KFAS model is a little non-intuitive. They appear in the kfas.model$Z element as a time-varying Z. I am sure the KFAS package has some helper function to construct models with covariates. I always construct KFAS models from matrices (no helper functions) so I am not familiar with those, but I know they exist.

R function to find which of 3 variables correlates most with another value?

I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)

Nonlinear model with many independent variables (fixed effects) in R

I'm trying to fit a nonlinear model with nearly 50 variables (since there are year fixed effects). The problem is I have so many variables that I cannot write the complete formula down like
nl_exp = as.formula(y ~ t1*year.matrix[,1] + t2*year.matrix[,2]
+... +t45*year.matirx[,45] + g*(x^d))
nl_model = gnls(nl_exp, start=list(t=0.5, g=0.01, d=0.1))
where y is the binary response variable, year.matirx is a matrix of 45 columns (indicating 45 different years) and x is the independent variable. The parameters need to be estimated are t1, t2, ..., t45, g, d.
I have good starting values for t1, ..., t45, g, d. But I don't want to write a long formula for this nonlinear regression.
I know that if the model is linear, the expression can be simplified using
l_model = lm(y ~ factor(year) + ...)
I tried factor(year) in gnls function but it does not work.
Besides, I also tried
nl_exp2 = as.formula(y ~ t*year.matrix + g*(x^d))
nl_model2 = gnls(nl_exp2, start=list(t=rep(0.2, 45), g=0.01, d=0.1))
It also returns me error message.
So, is there any easy way to write down the nonlinear formula and the starting values in R?
Since you have not provided any example data, I wrote my own - it is completely meaningless and the model actually doesn't work because it has bad data coverage but it gets the point across:
y <- 1:100
x <- 1:100
year.matrix <- matrix(runif(4500, 1, 10), ncol = 45)
start.values <- c(rep(0.5, 45), 0.01, 0.1) #you could also use setNames here and do this all in one row but that gets really messy
names(start.values) <- c(paste0("t", 1:45), "g", "d")
start.values <- as.list(start.values)
nl_exp2 <- as.formula(paste0("y ~ ", paste(paste0("t", 1:45, "*year.matrix[,", 1:45, "]"), collapse = " + "), " + g*(x^d)"))
gnls(nl_exp2, start=start.values)
This may not be the most efficient way to do it, but since you can pass a string to as.formula it's pretty easy to use paste commands to construct what you are trying to do.

R: multicollinearity issues using glib(), Bayesian Model Averaging (BMA-package)

I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.

How to drop some interaction terms from R formula

I would like to drop some interaction terms from an R formula. My situation is that I have one factor variable with lots of levels (call this A, and it takes values from 1-50), and another continuous variable that I would like to interact it with (call this B).
A*B
creates terms A1:B, A2:B, A3:B,... I want a simple way to get rid of the first A1:B term.
Note: I saw some previous answers for the lm case that called update and then dropped some terms. This will not work for me as I am trying to estimate a multinomial logit model with the mlogit package, and I cannot make the first estimation without dropping some interactions.
Edit: Although I am not attempting to use lm, if I could the following to occur, then I think it would solve my problem.
dd<-data.frame(A=sample(letters[1:10], 100, replace=T),B = runif(100),z=rexp(100))
#need to drop B term below
reg1 <- lm(z~A*B, dd)
#or need to drop Aa:B term here
reg2 <- lm(z~A*B - B, dd)
#but this doesn't work (I realize why, but this is an
#example of what I would like to have happen)
reg3 <- lm(z~A*B - B - Aa:B, dd)
I think you should be able to work with contrasts her to make this happen. Here we create our own contrast that adjusts the default contrast.treament behavior to skip the first two variables.
contr.skip2 <- function (n, contrasts = TRUE, sparse = FALSE)
{
contr <- contr.treatment(n, 1, contrasts, sparse)
contr[2, ] <- 0
contr[, -1]
}
and then we can fit the model and pass along our special contrast
lm(z~A*B, dd, contrasts=list(A="contr.skip2"))
# Call:
# lm(formula = z ~ A * B, data = dd, contrasts = list(A = "contr.skip2"))
#
# Coefficients:
# (Intercept) Ac Ad Ae Af Ag Ah
# 1.09981 -0.14541 -0.86334 -0.18478 -0.77302 0.19681 0.23845
# Ai Aj B Ac:B Ad:B Ae:B Af:B
# -0.74962 -0.49014 0.09729 0.14705 1.09606 0.14706 0.88919
# Ag:B Ah:B Ai:B Aj:B
# -0.62796 -0.70155 1.60253 -0.20564
and as you can see we no longer have Ab terms in the model.

Resources