Restricted Cubic Spline output in R rms package after cph - r

I am developing a COX regression model in R.
The model I am currently using is as follows
fh <- cph(S ~ rcs(MPV,4) + rcs(age,3) + BMI + smoking + hyperten + gender +
rcs(FVCPP,3) + TLcoPP, x=TRUE, y=TRUE, surv=TRUE, time.inc=2*52)
If I then want to look at this with
print(fh, latex = TRUE)
I get 3 coefs/SE/Wald etc for MPV (MVP, MVP' and MVP'') and 2 for age (age, age').
Could someone please explain to me what these outputs are? i.e. I believe they are to do with the restricted cubic splines I have added.

When you write rcs(MPV,4), you define the number of knots to use in the spline; in this case 4. Similarly, rcs(age,3) defines a spline with 3 knots. Due to identifiability constraints, 1 knot from each spline is subtracted out. You can think of this as defining an intercept for each spline. So rcs(Age,3) is a linear combination of 2 nonlinear basis functions and an intercept, while rcs(MPV,4) is a linear combination of 3 nonlinear basis functions and an intercept, i.e.,
and
In the notation above, what you get out from the print statement are the regression coefficients and , with corresponding standard errors, p-values etc. The intercepts and are typically set to zero, but they are important, because without them, the model fitting routine how have no idea of where on the y-axis to constrain the splines.
As a final note, you might actually be more interested in the output of summary(fh).

Related

How to define random effects in the linear mixed effects model?

I read a paper which applied linear mixed-effects model for data analysis. I am confused about defining random effects in the equations.
First, how to define a combined random effect, such as 𝜀𝑓𝑖𝑒𝑙𝑑−𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 where 𝑓𝑖𝑒𝑙𝑑 indicates plot number and 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 indicates somewhat classification results.
Second, how to include random effects in the slope term, such as intercept + slope * (var1 + random effect) + residuals
I do not know how to write code to represent this equations.
I expect an expression of these equations.
Like Nate mentioned, the lme4 package will do all that you'd need. Their vignette here will have the examples for your answer, particularly section 2.2.
Simple REs can be written using (1 | group) which will add a group-specific intercept estimated, and a random effect on the intercept varying by group for the fixed effect x let's say, can be written as (1 + x | group).

High (or very high) order polynomial regression in R (or alternatives?)

I would like to fit a (very) high order regression to a set of data in R, however the poly() function has a limit of order 25.
For this application I need an order on the range of 100 to 120.
model <- lm(noisy.y ~ poly(q,50))
# Error in poly(q, 50) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,30))
# Error in poly(q, 30) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,25))
# OK
Polynomials and orthogonal polynomials
poly(x) has no hard-coded limit for degree. However, there are two numerical constraints in practice.
Basis functions are constructed on unique location of x values. A polynomial of degree k has k + 1 basis and coefficients. poly generates basis without the intercept term, so degree = k implies k basis and k coefficients. If there are n unique x values, it must be satisfied that k <= n, otherwise there is simply insufficient information to construct a polynomial. Inside poly(), the following line checks this condition:
if (degree >= length(unique(x)))
stop("'degree' must be less than number of unique points")
Correlation between x ^ k and x ^ (k+1) is getting closer and closer to 1 as k increases. Such approaching speed is of course dependent on x values. poly first generates ordinary polynomial basis, then performs QR factorization to find orthogonal span. If numerical rank-deficiency occurs between x ^ k and x ^ (k+1), poly will also stop and complain:
if (QR$rank < degree)
stop("'degree' must be less than number of unique points")
But the error message is not informative in this case. Furthermore, this does not have to be an error; it can be a warning then poly can reset degree to rank to proceed. Maybe R core can improve on this bit??
Your trial-and-error shows that you can't construct a polynomial of more than 25 degree. You can first check length(unique(q)). If you have a degree smaller than this but still triggering error, you know for sure it is due to numerical rank-deficiency.
But what I want to say is that a polynomial of more than 3-5 degree is never useful! The critical reason is the Runge's phenomenon. In terms of statistical terminology: a high-order polynomial always badly overfits data!. Don't naively think that because orthogonal polynomials are numerically more stable than raw polynomials, Runge's effect can be eliminated. No, a polynomial of degree k forms a vector space, so whatever basis you use for representation, they have the same span!
Splines: piecewise cubic polynomials and its use in regression
Polynomial regression is indeed helpful, but we often want piecewise polynomials. The most popular choice is cubic spline. Like that there are different representation for polynomials, there are plenty of representation for splines:
truncated power basis
natural cubic spline basis
B-spline basis
B-spline basis is the most numerically stable, as it has compact support. As a result, the covariance matrix X'X is banded, thus solving normal equations (X'X) b = (X'y) are very stable.
In R, we can use bs function from splines package (one of R base packages) to get B-spline basis. For bs(x), The only numerical constraint on degree of freedom df is that we can't have more basis than length(unique(x)).
I am not sure of what your data look like, but perhaps you can try
library(splines)
model <- lm(noisy.y ~ bs(q, df = 10))
Penalized smoothing / regression splines
Regression spline is still likely to overfit your data, if you keep increasing the degree of freedom. The best model seems to be about choosing the best degree of freedom.
A great approach is using penalized smoothing spline or penalized regression spline, so that model estimation and selection of degree of freedom (i.e., "smoothness") are integrated.
The smooth.spline function in stats package can do both. Unlike what its name seems to suggest, for most of time it is just fitting a penalized regression spline rather than smoothing spline. Read ?smooth.spline for more. For your data, you may try
fit <- smooth.spline(q, noisy.y)
(Note, smooth.spline has no formula interface.)
Additive penalized splines and Generalized Additive Models (GAM)
Once we have more than one covariates, we need additive models to overcome the "curse of dimensionality" while being sensible. Depending on representation of smooth functions, GAM can come in various forms. The most popular, in my opinion, is the mgcv package, recommended by R.
You can still fit a univariate penalized regression spline with mgcv:
library(mgcv)
fit <- gam(noisy.y ~ s(q, bs = "cr", k = 10))

Mixed Modelling - Different Results between lme and lmer functions

I am currently working through Andy Field's book, Discovering Statistics Using R. Chapter 14 is on Mixed Modelling and he uses the lme function from the nlme package.
The model he creates, using speed dating data, is such:
speedDateModel <- lme(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality,
random = ~1|participant/looks/personality)
I tried to recreate a similar model using the lmer function from the lme4 package; however, my results are different. I thought I had the proper syntax, but maybe not?
speedDateModel.2 <- lmer(dateRating ~ looks + personality + gender +
looks:gender + personality:gender +
(1|participant) + (1|looks) + (1|personality),
data = speedData, REML = FALSE)
Also, when I run the coefficients of these models I notice that it only produces random intercepts for each participant. I was trying to then create a model that produces both random intercepts and slopes. I can't seem to get the syntax correct for either function to do this. Any help would be greatly appreciated.
The only difference between the lme and the corresponding lmer formula should be that the random and fixed components are aggregated into a single formula:
dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+ (1|participant/looks/personality)
using (1|participant) + (1|looks) + (1|personality) is only equivalent if looks and personality have unique values at each nested level.
It's not clear what continuous variable you want to define your slopes: if you have a continuous variable x and groups g, then (x|g) or equivalently (1+x|g) will give you a random-slopes model (x should also be included in the fixed-effects part of the model, i.e. the full formula should be y~x+(x|g) ...)
update: I got the data, or rather a script file that allows one to reconstruct the data, from here. Field makes a common mistake in his book, which I have made several times in the past: since there is only a single observation in the data set for each participant/looks/personality combination, the three-way interaction has one level per observation. In a linear mixed model, this means the variance at the lowest level of nesting will be confounded with the residual variance.
You can see this in two ways:
lme appears to fit the model just fine, but if you try to calculate confidence intervals via intervals(), you get
intervals(speedDateModel)
## Error in intervals.lme(speedDateModel) :
## cannot get confidence intervals on var-cov components:
## Non-positive definite approximate variance-covariance
If you try this with lmer you get:
## Error: number of levels of each grouping factor
## must be < number of observations
In both cases, this is a clue that something's wrong. (You can overcome this in lmer if you really want to: see ?lmerControl.)
If we leave out the lowest grouping level, everything works fine:
sd2 <- lmer(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+
(1|participant/looks),
data=speedData)
Compare lmer and lme fixed effects:
all.equal(fixef(sd2),fixef(speedDateModel)) ## TRUE
The starling example here gives another example and further explanation of this issue.

predict and multiplicative variables / interaction terms in probit regressions

I want to determine the marginal effects of each dependent variable in a probit regression as follows:
predict the (base) probability with the mean of each variable
for each variable, predict the change in probability compared to the base probability if the variable takes the value of mean + 1x standard deviation of the variable
In one of my regressions, I have a multiplicative variable, as follows:
my_probit <- glm(a ~ b + c + I(b*c), family = binomial(link = "probit"), data=data)
Two questions:
When I determine the marginal effects using the approach above, will the value of the multiplicative term reflect the value of b or c taking the value mean + 1x standard deviation of the variable?
Same question, but with an interaction term (* and no I()) instead of a multiplicative term.
Many thanks
When interpreting the results of models involving interaction terms, the general rule is DO NOT interpret coefficients. The very presence of interactions means that the meaning of coefficients for terms will vary depending on the other variate values being used for prediction. The right way to go about looking at the results is to construct a "prediction grid", i.e. a set of values that are spaced across the range of interest (hopefully within the domain of data support). The two essential functions for this process are expand.grid and predict.
dgrid <- expand.grid(b=fivenum(data$b)[2:4], c=fivenum(data$c)[2:4]
# A grid with the upper and lower hinges and the medians for `a` and `b`.
predict(my_probit, newdata=dgrid)
You may want to have the predictions on a scale other than the default (which is to return the linear predictor), so perhaps this would be easier to interpret if it were:
predict(my_probit, newdata=dgrid, type ="response")
Be sure to read ?predict and ?predict.glm and work with some simple examples to make sure you are getting what you intended.
Predictions from models containing interactions (at least those involving 2 covariates) should be thought of as being surfaces or 2-d manifolds in three dimensions. (And for 3-covariate interactions as being iso-value envelopes.) The reason that non-interaction models can be decomposed into separate term "effects" is that the slopes of the planar prediction surfaces remain constant across all levels of input. Such is not the case with interactions, especially those with multiplicative and non-linear model structures. The graphical tools and insights that one picks up in a differential equations course can be productively applied here.

How do you fit a linear mixed model with an AR(1) random effects correlation structure in R?

I am trying to use R to rerun someone else's project, so we need to use some macros in R.
Here comes a very basic question:
m1.nlme = lme(log.bp.dia ~ M25.9to9.ma5iqr + temp.c.9to9.ma4iqr + o3.ma5iqr + sea_spring + sea_summer + sea_fall + BMI + male + age_ini, data=barbara.1.clean, random = ~ 1|study_id)
Since the model is using AR(1) [autocorrelation 1 covariance model] in SAS for within person variance, I am not sure how to do this in R.
And where I can see the index for different models, like unstructured?
Thanks
I don't know what you mean by "index" for different models, but to specify an AR(1) covariance structure for the residuals, you can add corr=corAR1() to your lme call.
The correlation at lag $1$ is say $r$, where $-1< r <1$ for a stationary $AR(1)$ model. The correlation at lag $k \geq 1$ is $r^k$. This gives you the autocovariance matrix by just multiplying by the variance of $X_t$.

Resources