Defining gam function type - r

I would like to apply a gam model on a dataset with specifying the types of functions to use.
It would be something like :
y ~ cst1 * (s(var1)-s(var2)) * (1 - exp(var3*cst2))
s has to be the same function for both var1 and var2. I don't have a prior idea on s function family. If I resume, the model would find the constants (cst1 and cst2) plus the function s.
Is it possible? If not, is there any way (another type of models) i can use to do what i'm looking for?
Thanks in advance for replies.

This model could be fit with nls, the non-linear least squares package. This will allow you to model the formula you want directly. The splines will need to be done manually, though. This question gets at what you would be trying to do.
As far as getting the splines to be the same for var1 and var2, you can do this by subtracting the basis matrices. Basically you want to compute the coefficient vector A where the term is A * s(var1) + A * s(var2) = A * (s(var1) - s(var2)). You wouldn't want to just do s(var1 - var2); in general, f(x) - f(y) != f(x - y). To do this in R, you would
Compute the spline basis matrices with ns() for var1 and var2, giving them the same knots. You need to specify both the knots and the Boundary.knots parameters so that the two splines will share the same basis.
Subtract the two spline basis matrices (the output from the ns() function).
Adapt the resulting subtracted spline basis matrix for the nls formula, as they do in the question I linked earlier.

Related

Model Syntax for Simple Moderation Model in Lavaan (with bootstrapping)

I am a social scientist currently running a simple moderation model in R, in the form of y ~ x + m + m * x. My moderator is a binary categorical variable (two separate groups).
I started out with lm(), bootstrapped estimates with boot() and obtained bca confidence intervals with boot.ci. Since there is no automated way of doing this for all parameters (at my coding level at least), this is bit tedious. Howver, I now saw that the lavaan package offer bootstrapping as part of the regular sem() function, and also bca CIs as part of parameterEstimates(). So, I was wondering (since I am using lavaan in other analyses) whether I could just replace lm() with lavaan for the sake of keeping my work more consistent.
Doing this, I was wondering about what the equivalent model for lavaan would be to test for moderation in the same way. I saw this post where Jeremy Miles proposes the code below, which I follow mostly.
mod.1 <- "
y ~ c(a, b) * x
y ~~ c(v1, v1) * y # This step needed for exact equivalence
y ~ c(int1, int2) * 1
modEff := a - b
mEff := int1 - int2"
But it would be great if you could help me figure out some final things.
1) What does the y ~~ c(v1, v1) * y part mean and and why is it needed for "exact equivalence" to the lm model? From the output it seems this constrics variances of the outcome for both groups to the same value?
2) From the post, am I right to understand that either including the interaction effect as calculated above OR constraining (only) the slope between models and looking at model fit with anova()would be the same test for moderation?
3) The lavaan page says that adding test = "bootstrap" to the sem() function allows for boostrap adjusted p-values. However, I read a lot about p-values conflicting with the bca-CIs at times, and this has happened to me. Searching around, I understand that this conflict comes from the assumptions for the distribution of the data under the H0 for p-values, but not for CIs (which just give the range of most likely values). I was therefore wondering what it exactly means that the p-values given here are "bootstrap-adjusted"? Is it technically more true to report these for my SEM models than the CIs?
Many questions, but I would be very grateful for any help you can provide.
Best,
Alex
I think I can answer at least Nr. 1 and 2 of your questions but it is probably easier to not use SEM and instead program a function that conveniently gives you CIs for all coefficients of your model.
So first, to answer your questions:
What is proposed in the code you gave is called multigroup comparison. Essentially this means that you fit the same SEM to two different groups of cases in your dataset. It is equivalent to a moderated regression with binary moderator because in both cases you get two slopes (often called „simple slopes“) for the scalar predictor, one slope per group of the moderator.
Now, in your lavaan code you only see the scalar predictor x. The binary moderator is implied by group="m" when you fit the model with fit.1 <- sem(mod.1, data = df, group = "m") (took this from the page you linked).
The two-element vectors (c( , )) in the lavaan code specify named parameters for the first and second group, respectively. By y ~~ c(v1, v1) * y , the residual variances of y are set equal in both groups because they have the same name. In contrast, the slopes c(a, b) and the intercepts c(int1, int2) are allowed to vary between groups.
Yes. If you use the SEM, you would fit the model a second time adding a == b and compare the model this to the first version where the slopes can differ. This is the same as comparing lm() models with and without a:b (or a*b) in the formula.
Here I cannot provide a direct answer to your question. I suspect if you want BCa CIs as you would get from applying boot.ci to an lm model fit, this might not be implemented. In the lavaan documentation BCa confidence intervals are only mentioned once: In the section about the parameterEstimates function, which can also perform bootstrap (see p. 89). However, it does not produce actual BCa (bias-corrected and accelerated) CIs but only bias-corrected ones.
As mentioned above, I guess the simplest solution would be to use lm() and either repeat the boot.ci procedure for each coefficient or write a wrapper function that does this for you. I suggest this also because a reviewer may be quite puzzled to see you do multigroup SEM instead of a simple moderated regression, which is much more common.
You probably did something like this already:
lm_fit <- function(dat, idx) coef( lm(y ~ x*m, data=dat[idx, ]) )
bs_out <- boot::boot(mydata, statistic=lm_fit, R=1000)
ci_out <- boot::boot.ci(bs_out, conf=.95, type="bca", index=1)
Now, either you repeat the last line for each coefficient, i.e., varying index from 1 to 4. Or you get fancy and let R do the repeating with a function like this:
all_ci <- function(bs) {
est <- bs$t0
lower <- vector("numeric", length(bs$t0))
upper <- lower
for (i in 1:length(bs$t0)) {
ci <- tail(boot::boot.ci(bs, type="bca", index=i)$bca[1,], 2)
lower[i] <- ci[1]
upper[i] <- ci[2]
}
cbind(est, lower, upper)
}
all_ci(bs_out)
I am sure this could be written more concisely but it should work fine for bootstraps of simple lm() models.

Estimation of a state-space model with lags in the measurement equation in R

I'm trying to estimate an SS model from this paper that has the following form:
Setting the order of the first lag polynomial to zero and the second one to one, we can reformulate it using terms from the MARSS package guide when applicable (x is the state, y is the observed variable, d is exogenous):
MARSS package allows for estimation of a simpler model that dooesn't include lagged variables in the measurement equation. Is there a way to estimate this one using MARSS or any other package without rewriting the estimation routine for this special case? Maybe there is a way to reformulate it so it could be "fed" to MARSS or some other package?
Take a look at how say the BSM Structural time series model or ARMA model is formulated as a MARSS model, aka a multivariate state-space model. That'll give you an idea of how to reform your model in multivariate state-space form.
Basically, your x will look like
See how the x_2 is just a dummy that is forced to be x(t-1)?
Now the y equation
The d and a are your D and A. I wrote in small case to spec that they are scalars. But they can be matrices in general (if y is multivariate say). Your inputs are the d_t and y_{t-1}. You prepare that 2x1xT matrix as an input.
Be careful with your initial condition specification. Probably best/easiest to set it at t=1 and estimate or use diffuse prior.
You can fit this model with MARSS. You can fit with any Kalman filter function that will allow you to pass in inputs in the y equation (some do, some don't). KFAS::KFS() allows that using the SScustom() function.
In MARSS the model list will look like so
mod.list=list(
B=matrix(list("b",1,0,0),2,2),
U=matrix(0,2,1),
Q=matrix(list("q",0,0,0),2,2),
Z=matrix(c("z", "c"),1,2),
A=matrix(0),
R=matrix("r"),
D=matrix(c("d", "a"),1,2),
x0=matrix(c("x1","x2"),2,1),
tinitx=1,
d=rbind(dt[2:TT],y[1:(TT-1)])
)
dat <- y[2:TT] # since you need y_{t-1} in the d (inputs)
fit <- MARSS(dat, model=mod.list)
It'll probably complain that it wants initial conditions for x0. Anything will work. The EM algorithm isn't sensitive to that like a BFGS or Newton algorithm. But method="BFGS" is actually often better for this type of structural ts model and in that case pick a reasonable initial condition for x (reasonable = close to your data in this case I think).

Prediction at a new value using lowess function in R

I am using lowess function to fit a regression between two variables x and y. Now I want to know the fitted value at a new value of x. For example, how do I find the fitted value at x=2.5 in the following example. I know loess can do that, but I want to reproduce someone's plot and he used lowess.
set.seed(1)
x <- 1:10
y <- x + rnorm(x)
fit <- lowess(x, y)
plot(x, y)
lines(fit)
Local regression (lowess) is a non-parametric statistical method, it's a not like linear regression where you can use the model directly to estimate new values.
You'll need to take the values from the function (that's why it only returns a list to you), and choose your own interpolation scheme. Use the scheme to predict your new points.
Common technique is spline interpolation (but there're others):
https://www.r-bloggers.com/interpolation-and-smoothing-functions-in-base-r/
EDIT: I'm pretty sure the predict function does the interpolation for you. I also can't find any information about what exactly predict uses, so I've tried to trace the source code.
https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/stats/R/loess.R
else { ## interpolate
## need to eliminate points outside original range - not in pred_
I'm sure the R code calls the underlying C implementation, but it's not well documented so I don't know what algorithm it uses.
My suggestion is: either trust the predict function or roll out your own interpolation algorithm.

How does one extract hat values and Cook's Distance from an `nlsLM` model object in R?

I'm using the nlsLM function to fit a nonlinear regression. How does one extract the hat values and Cook's Distance from an nlsLM model object?
With objects created using the nls or nlreg functions, I know how to extract the hat values and the Cook's Distance of the observations, but I can't figure out how to get them using nslLM.
Can anyone help me out on this? Thanks!
So, it's not Cook's Distance or based on hat values, but you can use the function nlsJack in the nlstools package to jackknife your nls model, which means it removes every point, one by one, and bootstraps the resulting model to see, roughly speaking, how much the model coefficients change with or without a given observation in there.
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
require(nlstools)
plot(nlsJack(nls1))
The plot shows the percentage change in each model coefficient as each individual observation is removed, and it marks influential points above a certain threshold as "influential" in the resulting plot. The documentation for nlsJack describes how this threshold is determined:
An observation is empirically defined as influential for one parameter if the difference between the estimate of this parameter with and without the observation exceeds twice the standard error of the estimate divided by sqrt(n). This empirical method assumes a small curvature of the nonlinear model.
My impression so far is that this a fairly liberal criterion--it tends to mark a lot of points as influential.
nlstools is a pretty useful package overall for diagnosing nls model fits though.

Pseudo R squared for cumulative link function

I have an ordinal dependent variable and trying to use a number of independent variables to predict it. I use R. The function I use is clm in the ordinal package, to perform a cumulative link function with a probit link, to be precise:
I tried the function pR2 in the package pscl to get the pseudo R squared with no success.
How do I get pseudo R squareds with the clm function?
Thanks so much for your help.
There are a variety of pseudo-R^2. I don't like to use any of them because I do not see the results as having a meaning in the real world. They do not estimate effect sizes of any sort and they are not particularly good for statistical inference. Furthermore in situations like this with multiple observations per entity, I think it is debatable which value for "n" (the number of subjects) or the degrees of freedom is appropriate. Some people use McFadden's R^2 which would be relatively easy to calculate, since clm generated a list with one of its values named "logLik". You just need to know that the logLikelihood is only a multiplicative constant (-2) away from the deviance. If one had the model in the first example:
library(ordinal)
data(wine)
fm1 <- clm(rating ~ temp * contact, data = wine)
fm0 <- clm(rating ~ 1, data = wine)
( McF.pR2 <- 1 - fm1$logLik/fm0$logLik )
[1] 0.1668244
I had seen this question on CrossValidated and was hoping to see the more statistically sophisticated participants over there take this one on, but they saw it as a programming question and dumped it over here. Perhaps their opinion of R^2 as a worthwhile measure is as low as mine?
Recommend to use function nagelkerke from rcompanion package to get Pseudo r-squared.
When your predictor or outcome variables are categorical or ordinal, the R-Squared will typically be lower than with truly numeric data. R-squared merely a very weak indicator about model's fit, and you can't choose model based on this.

Resources