Extract linear equations from lm - r

Assume I have data with a dependency y(t) and parameters p1, p2 and p3
which might influence the value y(t).
I create 3 linear equations which depend on the following combinations of the
parameters p1 and p2 - p3 has no impact on y(t), that means it follows a random assignment.
You can find a reproducible example in the end of the question.
The 3 equations are
p1 p2 Equation
1 1 5 + 3t
2 1 1 - t
2 2 3 + t
A plot of the 3 equations including random data looks like the following:
Now, if I call lm() (For formulae see here) based on my random data, I get the following result.
lm(formula = y ~ .^2, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.14707 -0.22785 0.00157 0.23099 1.10528
Coefficients: (6 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.83711 0.17548 27.565 <2e-16 ***
t 2.97316 0.02909 102.220 <2e-16 ***
p12 -3.86697 0.21487 -17.997 <2e-16 ***
p22 2.30617 0.20508 11.245 <2e-16 ***
p23 NA NA NA NA
p32 0.16518 0.21213 0.779 0.4375
p33 0.23450 0.22594 1.038 0.3012
t:p12 -4.00574 0.03119 -128.435 <2e-16 ***
t:p22 2.01230 0.03147 63.947 <2e-16 ***
t:p23 NA NA NA NA
t:p32 0.01155 0.03020 0.383 0.7027
t:p33 0.02469 0.03265 0.756 0.4508
p12:p22 NA NA NA NA
p12:p23 NA NA NA NA
p12:p32 -0.10368 0.21629 -0.479 0.6325
p12:p33 -0.11728 0.21386 -0.548 0.5843
p22:p32 -0.20871 0.19633 -1.063 0.2896
p23:p32 NA NA NA NA
p22:p33 -0.44250 0.22322 -1.982 0.0495 *
p23:p33 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4112 on 136 degrees of freedom
Multiple R-squared: 0.9988, Adjusted R-squared: 0.9987
F-statistic: 8589 on 13 and 136 DF, p-value: < 2.2e-16
If I only want to condsider parameters with high significance, I would argue to ignore parameters close to zero. If I understand correctly, zero-parameters do not lead to "new lines". I then obtain the following simplified model (Values are rounded for readability):
Estimate
(Intercept) 5 ***
t 3 ***
p12 -4 ***
p22 2 ***
t:p12 -4 ***
t:p22 2 ***
I would then reconstruct the theoretical model as follows from the estimate
above (only highly significant parameters!):
p1 p2 Equation Result
1 1 5+3t 5+3t
1 2 5+3t+p22+t:p22*t 7+5t
2 1 5+3t+p12+t:p12*t 1-t
2 2 5+3t+p22+t:p22*t+p12+t:p12*t 3+t
Now, 7 + 5t is obviously wrong, but I am not sure about the reason.
I guess, lm successively adds the paramters, thus the corresponding model
y ~ t:p2 is not contained in the model above?
This question and references therein might be related, but I didn't look at the lm result - so there is nothing about that.
Reproducible example:
r <- generate_3lines(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1))
t_m <- r$t_m; y_m <- r$y_m; y_t <- r$y_t; rm(r)
mydata <- generate_randomdata(t_m, y_m, y_t)
# What the raw data looks like:
plot(t_m[[1]], y_t[[1]], type = "l", lty = 3, col = "black", main = "Raw data",
xlim = c(0, 10), ylim = c(min(mydata$y), max(mydata$y)), xlab = "t", ylab = "y")
lines(t_m[[2]], y_t[[2]], col = "black", lty = 3)
lines(t_m[[3]], y_t[[3]], col = "black", lty = 3)
points(x = mydata$t, y = mydata$y)
fit <- lm(y ~ .^2, data = mydata) # Not all levels / variables are linearly
print(summary(fit))
and the functions
generate_3lines <- function(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1)) {
t <- seq(0,10, length.out = 1000) # large sample of x values
t_m <- list()
y_m <- list()
y_t <- list()
for (i in 1:3) {
set.seed(33*i)
t_m[[i]] <- sort(sample(t, 50, replace = F))
set.seed(33*i)
noise <- rnorm(10, 0, sigma)
y_m[[i]] <- slopes[i]*t_m[[i]] + offsets[i] + noise
y_t[[i]] <- slopes[i]*t_m[[i]] + offsets[i]
}
return(list(t_m = t_m, y_m = y_m, y_t = y_t))
}
generate_randomdata <- function(t_m, y_m, y_t) {
# Final data set
df1 <- data.frame(t = t_m[[1]], y = y_m[[1]], p1 = rep(1), p2 = rep(1),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
df2 <- data.frame(t = t_m[[2]], y = y_m[[2]], p1 = rep(2), p2 = rep(2),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
df3 <- data.frame(t = t_m[[3]], y = y_m[[3]], p1 = rep(2), p2 = rep(3),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata$p1 <- factor(mydata$p1)
mydata$p2 <- factor(mydata$p2)
mydata$p3 <- factor(mydata$p3)
mydata <- mydata[sample(nrow(mydata)), ]
return(mydata)
}
Edit after input from #MrFlick: The question is now also on Cross Validated
Comment: It seems, the fit is not really automated in ggplot, see here

In brief, everything is ok with the model and the result from lm. As explained in this answer on cross-validated, 7+5t is just an extrapolation to a range without data. Furthermore, the synthetic data suffers from collinearity.

Related

Trying to annotate a plot with regression results in R markdown

I am trying to add a footnote to the bottom of my plot with betas, standard errors, and p values directly from the model summary I saved. However, it keeps telling me there is an unexpected error in the parse text. Any help would be greatly appreciated!
exact error:
Error in parse(text = text[[i]]) : :1:26: unexpected input
1: 'Main effect of age: ' $
^
minimal reproducible example:
id<-rep(1:50)
tst<-c(sample(7:9,50, replace = T))
mydf<-data.frame(id,tst)
mydf$age<-sample(40:90,50, replace = T)
mydf$bmi<-sample(20:30,50, replace = T)
mydf$sex<-sample(1:2,50, replace = T)
##Overall model##
model <- lm( tst ~ age*sex + bmi , data = mydf)
summary(model)
model.df<-ggpredict(model, terms = c("age", "sex"))
model.plot<-plot(model.df)+theme(legend.position="none")+
theme(plot.title = element_text(hjust = 0.5))+
annotate("text", x = 0, y = 0.05, parse = TRUE, size = 4,
label = " 'Main effect of age: ' $\beta == %.2g ",
coef(model)[2])
(model.plot)
Seems like parsing your syntax for parsing is wrong. Also, your code would add the text to each facet - not sure if that was the intended outcome (if so, just use parse=FALSE and the paste0(...) expression from below for your annotation. If you wish a global footnote, you could a caption like so:
library(ggiraphExtra)
library(ggplot2)
set.seed(1234)
mydf <- data.frame(
id = 1:50,
tst = sample(7:9, 50, replace = T),
age = sample(40:90,50, replace = T),
bmi = sample(20:30,50, replace = T),
sex = sample(1:2,50, replace = T)
)
##Overall model##
model <- lm( tst ~ age*sex + bmi , data = mydf)
summary(model)
#>
#> Call:
#> lm(formula = tst ~ age * sex + bmi, data = mydf)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.24030 -0.67286 -0.07152 0.62489 1.27281
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.99309 1.66745 2.994 0.00446 **
#> age 0.01975 0.02389 0.827 0.41274
#> sex 1.21860 0.95986 1.270 0.21077
#> bmi 0.06602 0.03805 1.735 0.08955 .
#> age:sex -0.01852 0.01532 -1.209 0.23307
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.7507 on 45 degrees of freedom
#> Multiple R-squared: 0.125, Adjusted R-squared: 0.04721
#> F-statistic: 1.607 on 4 and 45 DF, p-value: 0.189
model.df <- ggPredict(model, terms = c("age", "sex"))
model.plot <- model.df +
theme(legend.position="none",
plot.title = element_text(hjust = 0.5)) +
labs(caption = paste0(
"Main effect of age: β = ",
sprintf("%.2g", coef(model)[2])))
model.plot
Created on 2022-06-30 by the reprex package (v2.0.1)

How to plot the recursive partitioning from the rpart package

I want to plot a partition of a two-dimensional covariate space constructed by recursive binary splitting. To be more precise, I would like to write a function that replicates the following graph (taken from Elements of Statistical Learning, pag. 306):
Displayed above is a two-dimensional covariate space and a partition obtained by recursive binary splitting the space using axis-aligned splits (what is also called a CART algorithm). What I want to implement is a function that takes the output of the rpart function and generates such plot.
It follows some example code:
## Generating data.
set.seed(1975)
n <- 5000
p <- 2
X <- matrix(sample(seq(0, 1, by = 0.01), n * p, replace = TRUE), ncol = p)
Y <- X[, 1] + 2 * X[, 2] + rnorm(n)
## Building tree.
tree <- rpart(Y ~ ., data = data.frame(Y, X), method = "anova", control = rpart.control(cp = 0, maxdepth = 2))
Navigating SO I found this function:
rpart_splits <- function(fit, digits = getOption("digits")) {
splits <- fit$splits
if (!is.null(splits)) {
ff <- fit$frame
is.leaf <- ff$var == "<leaf>"
n <- nrow(splits)
nn <- ff$ncompete + ff$nsurrogate + !is.leaf
ix <- cumsum(c(1L, nn))
ix_prim <- unlist(mapply(ix, ix + c(ff$ncompete, 0), FUN = seq, SIMPLIFY = F))
type <- rep.int("surrogate", n)
type[ix_prim[ix_prim <= n]] <- "primary"
type[ix[ix <= n]] <- "main"
left <- character(nrow(splits))
side <- splits[, 2L]
for (i in seq_along(left)) {
left[i] <- if (side[i] == -1L)
paste("<", format(signif(splits[i, 4L], digits)))
else if (side[i] == 1L)
paste(">=", format(signif(splits[i, 4L], digits)))
else {
catside <- fit$csplit[splits[i, 4L], 1:side[i]]
paste(c("L", "-", "R")[catside], collapse = "", sep = "")
}
}
cbind(data.frame(var = rownames(splits),
type = type,
node = rep(as.integer(row.names(ff)), times = nn),
ix = rep(seq_len(nrow(ff)), nn),
left = left),
as.data.frame(splits, row.names = F))
}
}
Using this function, I am able to recover all the splitting variables and points:
splits <- rpart_splits(tree)[rpart_splits(tree)$type == "main", ]
splits
# var type node ix left count ncat improve index adj
# 1 X2 main 1 1 < 0.565 5000 -1 0.18110662 0.565 0
# 3 X2 main 2 2 < 0.265 2814 -1 0.06358597 0.265 0
# 6 X1 main 3 5 < 0.645 2186 -1 0.07645851 0.645 0
The column var tells me the splitting variables for each non-terminal node, and the column left tells the associated splitting points. However, I do not know how to use this information to produce my desired plots.
Of course if you have any alternative strategy that do not involve the use of rpart_splits feel free to suggest it.
You could use the (unpublished) parttree package, which you can install from GitHub via:
remotes::install_github("grantmcdermott/parttree")
This allows:
library(parttree)
ggplot() +
geom_parttree(data = tree, aes(fill = path)) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
scale_fill_brewer(palette = "Pastel1", name = "Partitions") +
theme_bw(base_size = 16) +
labs(x = "X2", y = "X1")
Incidentally, this package also contains the function parttree, which returns something very similar to your
rpart_splits function:
parttree(tree)
node Y path xmin xmax ymin ymax
1 4 0.7556079 X2 < 0.565 --> X2 < 0.265 -Inf 0.265 -Inf Inf
2 5 1.3087679 X2 < 0.565 --> X2 >= 0.265 0.265 0.565 -Inf Inf
3 6 1.8681143 X2 >= 0.565 --> X1 < 0.645 0.565 Inf -Inf 0.645
4 7 2.4993361 X2 >= 0.565 --> X1 >= 0.645 0.565 Inf 0.645 Inf

multinomial-hmm using R

I have the following kind of data:
df <- data.frame(id = rep(1, each = 40),
year = seq(1961,2000),
x1 = rbinom(40, size = 1, prob = 1 - 0.6) * rpois(40, lambda = 4),
X2 = rbinom(40, size = 1, prob = 1 - 0.7) * rpois(40, lambda = 4),
X3 = rbinom(40, size = 1, prob = 1 - 0.6) * rpois(40, lambda = 5),
X4 = rbinom(40, size = 1, prob = 1 - 0.7) * rpois(40, lambda = 6))
As you can see in my data there are four count variables.
I want to estimate a multinomial-HMM, as I expect that there is a latent variable, C, with 3 possible states that affect Pr(X_t=xt) (each vector X_t is assumed to be mutually independent conditional on the Markov chain C_t). For example, I expect that if C_t=1 we would observe a vector of X_t more like this (4,1,0,0), if C_T=2 a vector of X_t more like this (0,1,1,0) and if C=3 it is more likely to observe a vector of X_t like this (0,0,1,5).
I haven't found a package able to estimate this type of model, so currently, I am using the depmixS4 package.
library(depmixS4)
mod<-depmix(list(X1 ~ 1, X2 ~ 1, X3 ~ 1, X4 ~ 1), data=df, nstates=3,
family=list(poisson(), poisson(), poisson(),poison()))
However, I am not sure this would be a correct model according to my theoretical expectations. Can I use depmix differently to be more suitable to my model?
You could simply use a multinomial distribution for the response. I'm assuming that you mean to let X1, ..., X4 refer to four levels of a single categorical variable? And each of these variables then contains the count of a level in a particular year? One option is then to fit the following model:
> library(depmixS4)
Loading required package: nnet
Loading required package: MASS
Loading required package: Rsolnp
Loading required package: nlme
> set.seed(1)
> df <- data.frame(id = rep(1, each = 40),
+ year = seq(1961,2000),
+ X1 = rbinom(40, size = 1, prob = 1 - 0.6) * rpois(40, lambda = 4),
+ X2 = rbinom(40, size = 1, prob = 1 - 0.7) * rpois(40, lambda = 4),
+ X3 = rbinom(40, size = 1, prob = 1 - 0.6) * rpois(40, lambda = 5),
+ X4 = rbinom(40, size = 1, prob = 1 - 0.7) * rpois(40, lambda = 6))
> # matrix for single multinomial response variable
> X <- as.matrix(df[,c("X1", "X2", "X3", "X4")])
>
> # formulate model
> mod<-depmix(X ~ 1, data=df, nstates=3,
+ family=multinomial("identity"))
>
> # fit model
> fmod <- fit(mod)
converged at iteration 22 with logLik: -161.5714
>
> # show results
> summary(fmod)
Initial state probabilities model
pr1 pr2 pr3
0 0 1
Transition matrix
toS1 toS2 toS3
fromS1 0.290 0.355 0.356
fromS2 0.132 0.328 0.540
fromS3 0.542 0.191 0.267
Response parameters
Resp 1 : multinomial
Re1.pr1 Re1.pr2 Re1.pr3 Re1.pr4
St1 0.092 0.56 0.288 0.061
St2 0.033 0.00 0.423 0.544
St3 0.608 0.00 0.392 0.000

simulating a simple linear model

I'm trying to simulate a simple linear model 100 times and find the LS estimation of B1 from the linear model.
set.seed(123498)
x<-rnorm(z, 0, 1)
e<-rnorm(z, 0 ,2)
y<-0.5 + 2*x + e
model<- lm(y~x)
simulaten=100
z=10
for (i in 1:simulaten){
e<-rnorm(n, 0 ,2)
x<-rnorm(n, 0, 1)
y<-0.5 + 2*x + e
model<- lm(y~x)}
summary(model)
Is that what my for loop is achieving or have i missed the mark?
Here is a replicate solution. I have set n (forgotten in the question) and simulaten to a smaller value.
n <- 100
simulaten <- 4
set.seed(123498)
model_list <- replicate(simulaten, {
e <- rnorm(n, 0, 2)
x <- rnorm(n, 0, 1)
y <- 0.5 + 2*x + e
lm(y ~ x)
}, simplify = FALSE)
model_list
Edit
Several statistics can be obtained from the models list. The coefficients are extracted with function coef applied to each model.
Done with sapply, the returned object is a 2 rows matrix.
betas <- sapply(model_list, coef)
str(betas)
# num [1:2, 1:1000] 0.671 1.875 0.374 2.019 0.758 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "(Intercept)" "x"
# ..$ : NULL
As for the graph, here is an example. Note that in order for the x axis to reach all the x values, in the first call to hist argument xlim is set to range(betas).
lgd <- c(expression(beta[0]), expression(beta[1]))
hist(betas[1, ], freq = FALSE, col = "lightblue", xlim = range(betas), ylim = c(0, 2.5), xlab = "betas", main = "")
hist(betas[2, ], freq = FALSE, col = "blue", add = TRUE)
legend("top", legend = lgd, fill = c("lightblue", "blue"), horiz = TRUE)
The model is updated in each of the iteration. So the summary is returning the summary output of the last 'model'. We could store it in a list.
# // initialize empty list of length equals length of simulaten
modellst <- vector('list', simulaten)
for(i in seq_len(simulaten)) {
e <- rnorm(n, 0 ,2)
x <- rnorm(n, 0, 1)
y <- 0.5 + 2*x + e
# // assign the model output to the corresponding list element
modellst[[i]] <- lm(y~x)
}

Slopes for lme linear b-spline model

I was wondering how to obtain slope estimates with SE and p-values for each segment, for a lme model using linear b-splines.
I can get slope estimates using predict, but not SE and p-values.
Here is an example:
rm(list = ls())
library(splines)
library(nlme)
getY <- function(x) ifelse(x < 7, x * 1.3, x * 0.6) + rnorm(length(x))
set.seed(123)
data <- data.frame(Id = numeric(0), X = numeric(0), Y = numeric(0))
for (i in 1:10) {
X <- sample(1:10, 4)
Y <- getY(X) + rnorm(1, 0.5)
Id <- rep(i, 4)
data <- rbind(data, cbind(Id = Id, X = X, Y = Y))
}
gdata <- groupedData(Y ~ X | Id, data)
mod <- lme(fixed = Y ~ bs(X, degree = 1, knots = 7), data = gdata, random = ~1 |
Id)
summary(mod)
Linear mixed-effects model fit by REML
Data: gdata
AIC BIC logLik
158.2 166.2 -74.09
Random effects:
Formula: ~1 | Id
(Intercept) Residual
StdDev: 1.217 1.389
Fixed effects: Y ~ bs(X, degree = 1, knots = 7)
Value Std.Error DF t-value p-value
(Intercept) 3.098 0.5817 28 5.326 0e+00
bs(X, degree = 1, knots = 7)1 4.031 0.7714 28 5.225 0e+00
bs(X, degree = 1, knots = 7)2 3.253 0.7258 28 4.481 1e-04
Correlation:
(Intr) b(X,d=1,k=7)1
bs(X, degree = 1, knots = 7)1 -0.597
bs(X, degree = 1, knots = 7)2 -0.385 0.233
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.469915 -0.628202 0.005586 0.541398 1.748387
Number of Observations: 40
Number of Groups: 10
plot(augPred(mod))
pred1 <- predict(mod, data.frame(X = 1:2), level = 0)
pred2 <- predict(mod, data.frame(X = 8:9), level = 0)
(slope1 <- diff(pred1))
1 0.6718
(slope2 <- diff(pred2))
1 -0.2594
Wouldn't you just take the differences of a predict result?
predict(mod, newdata=data.frame(X=1:10, Id=1) )
1 1 1 1 1 1 1 1 1
3.449572 4.121362 4.793152 5.464941 6.136731 6.808521 7.480311 7.220928 6.961544
1
6.702161
attr(,"label")
[1] "Predicted values"
So:
plot( predict(mod, newdata=data.frame(X=1:10, Id=1) ), ylim=c(-2,8))
lines( 1:9, diff(predict(mod, newdata=data.frame(X=1:10, Id=1) ), ylim=c(-2,8)) )

Resources