How to write interactions in regressions in R? - r

DF <- data.frame(factor1=rep(1:4,1000), factor2 = rep(1:4,each=1000),base = rnorm(4000,0,1),dep=rnorm(4000,400,5))
DF$f1_1 = DF$factor1 == 1
DF$f1_2 = DF$factor1 == 2
DF$f1_3 = DF$factor1 == 3
DF$f1_4 = DF$factor1 == 4
DF$f2_1 = DF$factor2 == 1
DF$f2_2 = DF$factor2 == 2
DF$f2_3 = DF$factor2 == 3
DF$f2_4 = DF$factor2 == 4
I want to run the following regression:
Dep = (f1_1 + f1_2 + f1_3 + f1_4)*(f2_1 + f2_2 + f2_3 + f2_4)*(base+base^2+base^3+base^4+base^5)
Is there a smarter way to do it?

You should code factor1 and factor2 as real factor variables. Also, it is better to use poly for polynomials. Here is what we can do:
DF <- data.frame(factor1=rep(1:4,1000), factor2 = rep(1:4,each=1000),
base = rnorm(4000,0,1), dep = rnorm(4000,400,5))
DF$factor1 <- as.factor(DF$factor1)
DF$factor2 <- as.factor(DF$factor2)
fit <- lm(dep ~ factor1 * factor2 * poly(base, degree = 5))
By default, poly generates orthogonal basis for numerical stability. If you want ordinary polynomials like base + base ^ 2 + base ^ 3 + ..., use poly(base, degree = 5, raw = TRUE).
Be aware, you will get lots of parameters from this model, as you are fitting a fifth order polynomial for each pair of levels between factor1 and factor2.
Consider a small example.
set.seed(0)
f1 <- sample(gl(3, 20, labels = letters[1:3])) ## randomized balanced factor
f2 <- sample(gl(3, 20, labels = LETTERS[1:3])) ## randomized balanced factor
x <- runif(3 * 20) ## numerical covariate
y <- rnorm(3 * 20) ## toy response
fit <- lm(y ~ f1 * f2 * poly(x, 2))
#Call:
#lm(formula = y ~ f1 * f2 * poly(x, 2))
#
#Coefficients:
# (Intercept) f1b f1c
# -0.5387 0.8776 0.1572
# f2B f2C poly(x, 2)1
# 0.5113 1.0139 5.8345
# poly(x, 2)2 f1b:f2B f1c:f2B
# 2.4373 1.0666 0.1372
# f1b:f2C f1c:f2C f1b:poly(x, 2)1
# -1.4951 -1.4601 -6.2338
# f1c:poly(x, 2)1 f1b:poly(x, 2)2 f1c:poly(x, 2)2
# -11.0760 -2.3668 1.9708
# f2B:poly(x, 2)1 f2C:poly(x, 2)1 f2B:poly(x, 2)2
# -3.7127 -5.8253 5.6227
# f2C:poly(x, 2)2 f1b:f2B:poly(x, 2)1 f1c:f2B:poly(x, 2)1
# -7.3582 20.9179 11.6270
#f1b:f2C:poly(x, 2)1 f1c:f2C:poly(x, 2)1 f1b:f2B:poly(x, 2)2
# 1.2897 11.2041 12.8096
#f1c:f2B:poly(x, 2)2 f1b:f2C:poly(x, 2)2 f1c:f2C:poly(x, 2)2
# -9.8476 10.6664 4.5582
Note, even for 3 factor levels each and a 3rd order polynomial, we already end up with great number of coefficients.

Using I () forces the formula to treat +-×/ as arithmetic rather than model operators. Example: lm (y ~ I (x1 +x2))

Related

crr output list- remove df$ from coefficients?

I am using the cmprsk package to create a series of regressions. In the real models I used, I specified my models in the same way that is shown in the example that produces mel2 below. My problem is, I want the Melanoma$ in front of the coefficients to go away, as happens if I had specified the model like in mel1. Is there a way to delete that data frame prefix out of the object without re-running it?
library(cmprsk)
data(Melanoma, package = "MASS")
head(Melanoma)
mel1 <- crr(ftime = Melanoma$time, fstatus = Melanoma$status, cov1 = Melanoma[, c("sex", "age")], cencode = 2)
covs2 <- model.matrix(~ Melanoma$sex + Melanoma$age)[, -1]
mel2 <- crr(ftime = Melanoma$time, fstatus = Melanoma$status, cov1 = covs2, cencode = 2)
What I want:
What I have:
You could use the data argument in model.matrix, and wrap the crr call in with(Melanoma, ...)
covs2 <- model.matrix(~ sex + age, data = Melanoma)[, -1]
mel2 <- with(Melanoma, crr(ftime = time, fstatus = status,
cov1 = covs2, cencode = 2))
mel2$coef
#> sex age
#> 0.58838573 0.01259388
If you are stuck with existing models like this:
covs2 <- model.matrix(~ Melanoma$sex + Melanoma$age)[, -1]
mel2 <- crr(ftime = Melanoma$time, fstatus = Melanoma$status,
cov1 = covs2, cencode = 2)
You could simply rename the coefficients like this
names(mel2$coef) <- c("sex", "age")
mel2
#> convergence: TRUE
#> coefficients:
#> sex age
#> 0.58840 0.01259
#> standard errors:
#> [1] 0.271800 0.009301
#> two-sided p-values:
#> sex age
#> 0.03 0.18

Extract linear equations from lm

Assume I have data with a dependency y(t) and parameters p1, p2 and p3
which might influence the value y(t).
I create 3 linear equations which depend on the following combinations of the
parameters p1 and p2 - p3 has no impact on y(t), that means it follows a random assignment.
You can find a reproducible example in the end of the question.
The 3 equations are
p1 p2 Equation
1 1 5 + 3t
2 1 1 - t
2 2 3 + t
A plot of the 3 equations including random data looks like the following:
Now, if I call lm() (For formulae see here) based on my random data, I get the following result.
lm(formula = y ~ .^2, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.14707 -0.22785 0.00157 0.23099 1.10528
Coefficients: (6 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.83711 0.17548 27.565 <2e-16 ***
t 2.97316 0.02909 102.220 <2e-16 ***
p12 -3.86697 0.21487 -17.997 <2e-16 ***
p22 2.30617 0.20508 11.245 <2e-16 ***
p23 NA NA NA NA
p32 0.16518 0.21213 0.779 0.4375
p33 0.23450 0.22594 1.038 0.3012
t:p12 -4.00574 0.03119 -128.435 <2e-16 ***
t:p22 2.01230 0.03147 63.947 <2e-16 ***
t:p23 NA NA NA NA
t:p32 0.01155 0.03020 0.383 0.7027
t:p33 0.02469 0.03265 0.756 0.4508
p12:p22 NA NA NA NA
p12:p23 NA NA NA NA
p12:p32 -0.10368 0.21629 -0.479 0.6325
p12:p33 -0.11728 0.21386 -0.548 0.5843
p22:p32 -0.20871 0.19633 -1.063 0.2896
p23:p32 NA NA NA NA
p22:p33 -0.44250 0.22322 -1.982 0.0495 *
p23:p33 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4112 on 136 degrees of freedom
Multiple R-squared: 0.9988, Adjusted R-squared: 0.9987
F-statistic: 8589 on 13 and 136 DF, p-value: < 2.2e-16
If I only want to condsider parameters with high significance, I would argue to ignore parameters close to zero. If I understand correctly, zero-parameters do not lead to "new lines". I then obtain the following simplified model (Values are rounded for readability):
Estimate
(Intercept) 5 ***
t 3 ***
p12 -4 ***
p22 2 ***
t:p12 -4 ***
t:p22 2 ***
I would then reconstruct the theoretical model as follows from the estimate
above (only highly significant parameters!):
p1 p2 Equation Result
1 1 5+3t 5+3t
1 2 5+3t+p22+t:p22*t 7+5t
2 1 5+3t+p12+t:p12*t 1-t
2 2 5+3t+p22+t:p22*t+p12+t:p12*t 3+t
Now, 7 + 5t is obviously wrong, but I am not sure about the reason.
I guess, lm successively adds the paramters, thus the corresponding model
y ~ t:p2 is not contained in the model above?
This question and references therein might be related, but I didn't look at the lm result - so there is nothing about that.
Reproducible example:
r <- generate_3lines(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1))
t_m <- r$t_m; y_m <- r$y_m; y_t <- r$y_t; rm(r)
mydata <- generate_randomdata(t_m, y_m, y_t)
# What the raw data looks like:
plot(t_m[[1]], y_t[[1]], type = "l", lty = 3, col = "black", main = "Raw data",
xlim = c(0, 10), ylim = c(min(mydata$y), max(mydata$y)), xlab = "t", ylab = "y")
lines(t_m[[2]], y_t[[2]], col = "black", lty = 3)
lines(t_m[[3]], y_t[[3]], col = "black", lty = 3)
points(x = mydata$t, y = mydata$y)
fit <- lm(y ~ .^2, data = mydata) # Not all levels / variables are linearly
print(summary(fit))
and the functions
generate_3lines <- function(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1)) {
t <- seq(0,10, length.out = 1000) # large sample of x values
t_m <- list()
y_m <- list()
y_t <- list()
for (i in 1:3) {
set.seed(33*i)
t_m[[i]] <- sort(sample(t, 50, replace = F))
set.seed(33*i)
noise <- rnorm(10, 0, sigma)
y_m[[i]] <- slopes[i]*t_m[[i]] + offsets[i] + noise
y_t[[i]] <- slopes[i]*t_m[[i]] + offsets[i]
}
return(list(t_m = t_m, y_m = y_m, y_t = y_t))
}
generate_randomdata <- function(t_m, y_m, y_t) {
# Final data set
df1 <- data.frame(t = t_m[[1]], y = y_m[[1]], p1 = rep(1), p2 = rep(1),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
df2 <- data.frame(t = t_m[[2]], y = y_m[[2]], p1 = rep(2), p2 = rep(2),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
df3 <- data.frame(t = t_m[[3]], y = y_m[[3]], p1 = rep(2), p2 = rep(3),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata$p1 <- factor(mydata$p1)
mydata$p2 <- factor(mydata$p2)
mydata$p3 <- factor(mydata$p3)
mydata <- mydata[sample(nrow(mydata)), ]
return(mydata)
}
Edit after input from #MrFlick: The question is now also on Cross Validated
Comment: It seems, the fit is not really automated in ggplot, see here
In brief, everything is ok with the model and the result from lm. As explained in this answer on cross-validated, 7+5t is just an extrapolation to a range without data. Furthermore, the synthetic data suffers from collinearity.

`nlme` with crossed random effects

I am trying to fit a crossed non-linear random effect model as the linear random effect models as mentioned in this question and in this mailing list post using the nlme package. Though, I get an error regardless of what I try. Here is an example
library(nlme)
#####
# simulate data
set.seed(18112003)
na <- 30
nb <- 30
sigma_a <- 1
sigma_b <- .5
sigma_res <- .33
n <- na*nb
a <- gl(na,1,n)
b <- gl(nb,na,n)
u <- gl(1,1,n)
x <- runif(n, -3, 3)
y_no_noise <- x + sin(2 * x)
y <-
x + sin(2 * x) +
rnorm(na, sd = sigma_a)[as.integer(a)] +
rnorm(nb, sd = sigma_b)[as.integer(b)] +
rnorm(n, sd = sigma_res)
#####
# works in the linear model where we know the true parameter
fit <- lme(
# somehow we found the right values
y ~ x + sin(2 * x),
random = list(u = pdBlocked(list(pdIdent(~ a - 1), pdIdent(~ b - 1)))))
vv <- VarCorr(fit)
vv2 <- vv[c("a1", "b1"), ]
storage.mode(vv2) <- "numeric"
print(vv2,digits=4)
#R Variance StdDev
#R a1 1.016 1.0082
#R b1 0.221 0.4701
#####
# now try to do the same with `nlme`
fit <- nlme(
y ~ c0 + sin(c1),
fixed = list(c0 ~ x, c1 ~ x - 1),
random = list(u = pdBlocked(list(pdIdent(~ a - 1), pdIdent(~ b - 1)))),
start = c(0, 0.5, 1))
#R Error in nlme.formula(y ~ a * x + sin(b * x), fixed = list(a ~ 1, b ~ :
#R 'random' must be a formula or list of formulae
The lme example is similar to the one page 163-166 of "Mixed-effects Models in S and S-PLUS" with only 2 random effects instead of 3.
I should haved used a two-sided formula as written in help("nlme")
fit <- nlme(
y ~ c0 + c1 + sin(c2),
fixed = list(c0 ~ 1, c1 ~ x - 1, c2 ~ x - 1),
random = list(u = pdBlocked(list(pdIdent(c0 ~ a - 1), pdIdent(c1 ~ b - 1)))),
start = c(0, 0.5, 1))
# fixed effects estimates
fixef(fit)
#R c0.(Intercept) c1.x c2.x
#R -0.1788218 0.9956076 2.0022338
# covariance estimates
vv <- VarCorr(fit)
vv2 <- vv[c("c0.a1", "c1.b1"), ]
storage.mode(vv2) <- "numeric"
print(vv2,digits=4)
#R Variance StdDev
#R c0.a1 0.9884 0.9942
#R c1.b1 0.2197 0.4688

Slopes for lme linear b-spline model

I was wondering how to obtain slope estimates with SE and p-values for each segment, for a lme model using linear b-splines.
I can get slope estimates using predict, but not SE and p-values.
Here is an example:
rm(list = ls())
library(splines)
library(nlme)
getY <- function(x) ifelse(x < 7, x * 1.3, x * 0.6) + rnorm(length(x))
set.seed(123)
data <- data.frame(Id = numeric(0), X = numeric(0), Y = numeric(0))
for (i in 1:10) {
X <- sample(1:10, 4)
Y <- getY(X) + rnorm(1, 0.5)
Id <- rep(i, 4)
data <- rbind(data, cbind(Id = Id, X = X, Y = Y))
}
gdata <- groupedData(Y ~ X | Id, data)
mod <- lme(fixed = Y ~ bs(X, degree = 1, knots = 7), data = gdata, random = ~1 |
Id)
summary(mod)
Linear mixed-effects model fit by REML
Data: gdata
AIC BIC logLik
158.2 166.2 -74.09
Random effects:
Formula: ~1 | Id
(Intercept) Residual
StdDev: 1.217 1.389
Fixed effects: Y ~ bs(X, degree = 1, knots = 7)
Value Std.Error DF t-value p-value
(Intercept) 3.098 0.5817 28 5.326 0e+00
bs(X, degree = 1, knots = 7)1 4.031 0.7714 28 5.225 0e+00
bs(X, degree = 1, knots = 7)2 3.253 0.7258 28 4.481 1e-04
Correlation:
(Intr) b(X,d=1,k=7)1
bs(X, degree = 1, knots = 7)1 -0.597
bs(X, degree = 1, knots = 7)2 -0.385 0.233
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.469915 -0.628202 0.005586 0.541398 1.748387
Number of Observations: 40
Number of Groups: 10
plot(augPred(mod))
pred1 <- predict(mod, data.frame(X = 1:2), level = 0)
pred2 <- predict(mod, data.frame(X = 8:9), level = 0)
(slope1 <- diff(pred1))
1 0.6718
(slope2 <- diff(pred2))
1 -0.2594
Wouldn't you just take the differences of a predict result?
predict(mod, newdata=data.frame(X=1:10, Id=1) )
1 1 1 1 1 1 1 1 1
3.449572 4.121362 4.793152 5.464941 6.136731 6.808521 7.480311 7.220928 6.961544
1
6.702161
attr(,"label")
[1] "Predicted values"
So:
plot( predict(mod, newdata=data.frame(X=1:10, Id=1) ), ylim=c(-2,8))
lines( 1:9, diff(predict(mod, newdata=data.frame(X=1:10, Id=1) ), ylim=c(-2,8)) )

Printing p-values with <0.001

I wonder how to put <0.001 symbol if p-value is small than 0.001 to be used in Sweave.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8465 0.1557174 31.12368 4.185248e-17
group1 -0.1855 0.1557174 -1.19126 2.490232e-01
Desired Output
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8465 0.1557174 31.12368 <0.001
group1 -0.1855 0.1557174 -1.19126 0.249
There are two main functions that I use, format.pval and this one that I ripped from gforge and tweaked.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
tmp <- data.frame(summary(lm.D9)$coef)
tmp <- setNames(tmp, colnames(summary(lm.D9)$coef))
tmp[ , 4] <- format.pval(tmp[ , 4], eps = .001, digits = 2)
tmp
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 5.032 0.2202177 22.85012 <0.001
# groupTrt -0.371 0.3114349 -1.19126 0.25
I like this one because it removes precision from pvalues > .1 (or whatever threshold you like if you want something different; that is, regardless of digits, it only keeps two decimal places if the values is > .1), keeps trailing zeros (see example below), and adds in the < like you want for some level of precision (here 0.001).
pvalr <- function(pvals, sig.limit = .001, digits = 3, html = FALSE) {
roundr <- function(x, digits = 1) {
res <- sprintf(paste0('%.', digits, 'f'), x)
zzz <- paste0('0.', paste(rep('0', digits), collapse = ''))
res[res == paste0('-', zzz)] <- zzz
res
}
sapply(pvals, function(x, sig.limit) {
if (x < sig.limit)
if (html)
return(sprintf('< %s', format(sig.limit))) else
return(sprintf('< %s', format(sig.limit)))
if (x > .1)
return(roundr(x, digits = 2)) else
return(roundr(x, digits = digits))
}, sig.limit = sig.limit)
}
And examples:
pvals <- c(.133213, .06023, .004233, .000000134234)
pvalr(pvals, digits = 3)
# [1] "0.13" "0.060" "0.004" "< 0.001"

Resources