Ho to run stratified bootstrapped linear regression in R? - r

Into my model x is categorical variable with 3 categories: 0,1 & 2, where 0 is reference category. However 0 categories are larger than others (1,2), so to avoid biased sample I want to to stratified bootstrapping, but could not find any relevant method for that
df <- data.frame (x = c(0,0,0,0,0,1,1,2,2),
y = c(10,11,10,10,12,17,16,20,19),
m = c(6,5,6,7,2,10,14,8,11)
)
df$x <- as.factor(df$x)
df$x <- relevel(df$x,ref = "0")
fit <- lm(y ~ x*m, data = df)
summary(fit)

Expanding on Roland's answer in the comments, you can harvest the confidence intervals from bootstrapping using boot.ci:
library(boot)
b <- boot(df, \(DF, i) coef(lm(y ~ x*m, data = df[i,])), strata = df$x, R = 999)
result <- do.call(rbind, lapply(seq_along(b$t0), function(i) {
m <- boot.ci(b, type = 'norm', index = i)$normal
data.frame(estimate = b$t0[i], lower = m[2], upper = m[3])
}))
result
#> estimate lower upper
#> (Intercept) 12.9189189 10.7166127 15.08403731
#> x1 6.5810811 2.0162637 8.73184665
#> x2 9.7477477 6.9556841 11.37390826
#> m -0.4459459 -0.8010925 -0.07451434
#> x1:m 0.1959459 -0.1842914 0.55627896
#> x2:m 0.1126126 -0.2572955 0.48352616
And even plot the results like this:
ggplot(within(result, var <- rownames(result)), aes(estimate, var)) +
geom_vline(xintercept = 0, color = 'gray') +
geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1) +
geom_point(color = 'red') +
theme_light()

Related

R Different Prediction Result for Formula Containing "%>%"

In R, when I use "predict" to get a confidence interval for a certain x (x=42) under the model: y = (centered x) + (centered x)^2. I found two possible ways:
model1 = lm(y ~ scale(x, center=T, scale=F) + I( (scale(x, center=T, scale=F))^2 ), data=data)
model2 = lm(y ~ (x %>% scale(center=T, scale=F)) + I( (x %>% scale(center=T, scale=F))^2 ), data=data)
The summary results for the two models are the same. But when I ran:
predict(model1, data.frame(x=42), interval="confidence", level=0.95)
predict(model2, data.frame(x=42), interval="confidence", level=0.95)
The results are different. I am wondering why. Does R treat the above two formulas differently because of the usage of "%>%"?
The dataset is a practice dataset from Kutner's textbook SENIC.txt, y is the 11th column, x is the 12th column.
The issue here is with scale and not %>%. scale returns a matrix which seems to affect the outcome.
One solution is to write vector --> vector equivalent of scale and use it:
library("magrittr") # for `%>%`
set.seed(1)
dataset = data.frame(x = rnorm(30))
dataset[["y"]] = 1 + (3 * dataset[["x"]]) + rnorm(30, mean = 0, sd = 0.1)
scale_vector = function(x, ...){
stopifnot(inherits(x, "numeric"))
scale(x, ...)[, 1]
}
lm(y ~ scale_vector(x, center=T, scale=F) + I( (scale_vector(x, center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ scale_vector(x, center = T, scale = F) + I((scale_vector(x,
#> center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> scale_vector(x, center = T, scale = F)
#> 2.99296
#> I((scale_vector(x, center = T, scale = F))^2)
#> -0.01645
lm(y ~ (x %>% scale_vector(center=T, scale=F)) + I( (x %>% scale_vector(center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ (x %>% scale_vector(center = T, scale = F)) +
#> I((x %>% scale_vector(center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> x %>% scale_vector(center = T, scale = F)
#> 2.99296
#> I((x %>% scale_vector(center = T, scale = F))^2)
#> -0.01645
Besides, if you do not mind using tidyverse, this might be cleaner:
library("magrittr") # for `%>%`
set.seed(1)
dataset = tibble::tibble(x = rnorm(30),
y = 1 + (3 * x) + rnorm(30, mean = 0, sd = 0.1)
)
dataset %>%
dplyr::mutate(x_scaled = scale(x, center = TRUE, scale = FALSE)[, 1]) %>%
lm(y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Call:
#> lm(formula = y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Coefficients:
#> (Intercept) x_scaled I(x_scaled^2)
#> 1.27423 2.99296 -0.01645

Plotting with GLMMadaptive for zero-inflated, semi-continuous data?

I'm trying to utilize the effectPlotData as described here: https://cran.r-project.org/web/packages/GLMMadaptive/vignettes/Methods_MixMod.html
But, I'm trying to apply it to a model (two-part mixed model for zero-inflated semi-continuous data) that includes random/fixed effects for both a linear and logistic portion (hurdle lognormal). I get the following error:
'Error in Qs[1, ] : incorrect number of dimensions'
Which, I think is from having more than one set of random/fixed effect outcomes, but if anyone else has come across this error or can advise, it would be appreciated! I've tried changing the terms in the new data frame and tried a couple of different options with length.out (attempted this as number of subjects and then number of total observations across all subjects), but get the same error each time.
Code below, specifies the model into m and new data frame into nDF:
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3,
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X), max(X), length.out = 908),
Y = levels(Y)))
effectPlotData(m, nDF)
It seems to work for with the following example:
library("GLMMadaptive")
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients non-zero part
sigma <- 0.5 # standard deviation error terms non-zero part
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.1 # variance of random slopes non-zero part
D33 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)), rnorm(n, sd = sqrt(D33)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1:2, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 3, drop = FALSE]))
# we simulate log-normal longitudinal data
DF$y <- exp(rnorm(n * K, mean = eta_y, sd = sigma))
# we set the zeros from the logistic regression
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
###############################################################################
km1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
family = hurdle.lognormal(),
zi_fixed = ~ sex)
km1
nDF <- with(DF, expand.grid(time = seq(min(time), max(time), length.out = 15),
sex = levels(sex)))
plot_data <- effectPlotData(km1, nDF)
library("lattice")
xyplot(pred + low + upp ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
local({
km1$Funs$mu_fun <- function (eta) {
pmax(exp(eta + 0.5 * exp(2 * km1$phis)), .Machine$double.eps)
}
km1$family$linkfun <- function (mu) log(mu)
plot_data <- effectPlotData(km1, nDF)
xyplot(exp(pred) + exp(low) + exp(upp) ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
})
In case someone comes across the same error, I was filtering data from my data frame within the model -- which caused the dimensions of the model and the variable from the data frame to not match. I applied the same filtering to the new data frame (I've also moved forward with a completely new data frame that only includes trials that are actually used by the model so that no filtering has to be used at any step).
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3[data_combined_temp_Fix_Num3$Z>=4 &
data_combined_temp_Fix_Num3$ZZ>= 4,],
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)`
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])),
max(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])), length.out = 908),
Y = levels(Y)))`
effectPlotData(m, nDF)

How to display standardized y-scores in interaction plot

I am trying to plot a two-way interaction of standardized data in R using the package "interplot". However, the displayed y-scores are not standardized anymore. Why is that and how can I fix that?
I have tried to change the y-limits and to use the "scale_y_continuous()" function.
# generate data
x <- rnorm(100, 0, 1)
y <- x + rnorm(100, 0, 1)
z <- y + rnorm(100, 0, 1)
df <- as.data.frame(cbind(x,y,z))
# build model with interaction term
model1 <- glm(y ~ x*z, data=df)
# plot interaction
require(interplot)
interplot(model1, var1 = "x",var2 = "z", ci = 0.95, predPro = TRUE,
var2_vals = c(-1, 1), hist=F) + xlim(-3, 3) +
theme_classic()
I expect the y-scale to display values between -3 and +3, since the scores are standardized. However, the displayed y-values are between 20 and 80.
With the help of ?interplot example :
set.seed(123)
# generate data
x <- rnorm(100, 0, 1)
y <- x + rnorm(100, 0, 1)
z <- y + rnorm(100, 0, 1)
df <- as.data.frame(cbind(x,y,z))
# build model with interaction term
model1 <- glm(y ~ x*z, data=df)
# lm(y ~ x*z, data=df) # glm => is a linear model
# plot interaction
require(interplot, quietly = TRUE, warn.conflicts = FALSE)
interplot(model1, var1 = "x",var2 = "z", ci = 0.95,
predPro = TRUE, var2_vals = c(-1,1)) +
xlim(-3, 3) +
xlab("x values") +
ylab("Estimated Coefficient for z") +
ggtitle('Estimated Coefficient of z by x conditionally to y in c(-1,1)') +
theme_classic()
interplot(model1, var1 = "x",var2 = "z", ci = 0.95) +
xlim(-3, 3) +
xlab("x values") +
ylab("Estimated Coefficient for z") +
ggtitle('Estimated Coefficient of z by x') +
theme_classic()
#> Warning: Removed 28 rows containing missing values (geom_path).

How do I propagate the error of a linear regression when projecting from Y to X?

I'm trying to figure out how to propagate errors in the following case
I am calibrating a machine with a couple of standards (a, b, c) with
accepted values x. My machine measures y for these standards, with a
certain error (standard deviation of 1 in this example).
Then I measure replicates of a sample, yielding ynew. Now I want to
convert these values to the accepted measurement scale (the x-axis).
To do this, I can of course do some linear algebra and convert the slope and
intercept that I got from my standard measurements to a reversed slope and
intercept as follows
This works nicely to convert the input values, but how do I get proper estimates of the errors?
In R, I've tried the following:
library(broom) # for tidy lm
library(ggplot2) # for plotting
library(dplyr) # to allow piping
# find confidence value
cv <- function(x, level = 95) {
qt(1 - ((100 - level) / 100) / 2, df = length(x) - 1) * sd(x) / sqrt(length(x))
}
# find confidence interval
ci <- function(x, level = 95) {
xbar <- mean(x)
xci <- cv(x, level = level)
c(fit = xbar, lwr = xbar - xci, upr = xbar + xci)
}
set.seed(1337)
# create fake data
dat <- data.frame(id = rep(letters[1:3], 20),
x = rep(c(1, 7, 10), 20)) %>%
mutate(y = rnorm(n(), -20 + 1.25 * x, 1))
# generate linear model
mod <- lm(y ~ x, dat)
# tidy
mod_aug <- augment(mod)
# these are the new samples that my machine measures
ynew <- rnorm(10, max(dat$y) + 3)
# predict new x-value based on y-value that is outside of range
## predict(mod, newdata = data.frame(y = ynew), interval = "predict")
# Error in eval(predvars, data, env) : object 'x' not found
# or tidy
## augment(mod, newdata = data.frame(y = ynew))
# 50 row df that doesn't make sense
# found this function that should do the job, but it doesn't extrapolate
## approx(x = mod$fitted.values, y = dat$x, xout = ynew)$y
# [1] NA NA NA NA NA NA NA NA NA NA
# this one from Hmisc does allow for extrapolation
with_approx <- Hmisc::approxExtrap(x = mod_aug$.fitted, y = mod_aug$x, xout = ynew)$y
# but in case of lm, isn't using the slope and intercept of a model okay too?
with_itc_slp <- (- coef(mod)[1] / coef(mod)[2]) + (1 / coef(mod)[2] * ynew)
# this would be the 95% prediction interval of the model at the average
# sample position. Could also use "confidence" but this is more correct?
avg_prediction <- predict(mod,
newdata = data.frame(x = mean(with_itc_slp)),
interval = "prediction")
# plot it
ggplot(dat, aes(x = x, y = y, col = id)) +
geom_point() +
geom_hline(yintercept = ynew, col = "gray") +
geom_smooth(aes(group = 1), method = "lm", se = F, fullrange = T,
col = "lightblue") +
geom_smooth(aes(group = 1), method = "lm") +
# 95% CI of the new sample
annotate("pointrange", x = 1, y = mean(ynew),
ymin = ci(ynew)[2], ymax = ci(ynew)[3], col = "green") +
# 95% prediction interval of the linear model at the average transformed
# x-position
annotate("pointrange", x = mean(with_approx), y = mean(ynew),
ymin = avg_prediction[2], ymax = avg_prediction[3], col = "green") +
# transformed using approx
annotate("point", x = with_approx, y = ynew, size = 3, col = "blue",
shape = 1) +
# transformed using intercept and slope
annotate("point", x = with_itc_slp, y = ynew, size = 3, col = "red",
shape = 2) +
# it's pretty
coord_fixed()
resulting in this plot:
Now how do I go from these 95% CIs in the y-direction to transformed sample
x-values with a confidence interval in the x-direction?

2 polynomial regressions in a ggplot() graph

This is my Dataset:
As you can see, there are two quantitative variables (X, Y) and 1 categorical variable (molar, with two factors: M1, M2).
I would like to represent in one single graph two polynomial regressions and their respective prediction intervals: one for the M1 factor and one for the M2 factor. Each polynomial regression has its own degree (M1 is a 4 degree polynomial regression, and M2 is a 6 degree).
I want to use ggplot() function (which is in package ggplot2 in R). I have actually performed this figure but with all data merged (I mean, with no distinction between factors). This is the code I used:
# Fit a linear model
m <- lm(Y ~ X+I(X^2)+I(X^3)+I(X^4), data = Dataset)
# cbind the predictions to Dataset
mpi <- cbind(Dataset, predict(m, interval = "prediction"))
ggplot(mpi, aes(x = X)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y = Y)) +
geom_line(aes(y = fit), colour = "blue", size = 1)
With this result:
So, I would like to have two different-grade polynomial regressions (one for the M1 and one for the M2), taking into account their respective predictions intervals. Which would be the exact code?
UPDATE - New code! I run this code with no success:
M1=subset(Dataset,Dataset$molar=="M1",select=X:Y)
M2=subset(Dataset,Dataset$molar=="M2",select=X:Y)
M1.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M1",select=X:Y))
M2.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M2",select=X:Y))
newdf <- data.frame(x = seq(0, 1, c(408,663)))
M1.P <- cbind(data=subset(Dataset,Dataset$molar=="M1",select=X:Y), predict(M1.R, interval = "prediction"))
M2.P <- cbind(data=subset(Dataset,Dataset$molar=="M2",select=X:Y), predict(M2.R, interval = "prediction"))
p = cbind(as.data.frame(rbind(M1.P, M2.P)), f = factor(rep(1:2, c(408,663)), x = rep(newdf$x, 2))
mdf = with(Dataset, data.frame(x = rep(x, 2), y = c(subset(Dataset,Dataset$molar=="M1",select=Y), subset(Dataset,Dataset$molar=="M2",select=Y),
f = factor(rep(1:2, c(408,663))))
ggplot(mdf, aes(x = x, y = y, colour = f)) + geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
These are the messages I get now:
[98] WARNING: Warning in if (n < 0L) stop("wrong sign in 'by' argument") :
the condition has length > 1 and only the first element will be used
Warning in if (n > .Machine$integer.max) stop("'by' argument is much too small") :
the condition has length > 1 and only the first element will be used
Warning in 0L:n :
numerical expression has 2 elements: only the first used
Warning in if (by > 0) pmin(x, to) else pmax(x, to) :
the condition has length > 1 and only the first element will be used
[99] WARNING: Warning in predict.lm(M1.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[100] WARNING: Warning in predict.lm(M2.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[101] ERROR: <text>
I think I am closer but still can't see it. Help!
Here is one way. If you have more than two models/levels in the factor you should look into code that will work over the levels of the factor and fit the models that way.
Anyway, first some dummy data:
set.seed(100)
x <- runif(100)
y1 <- 2 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) + rnorm(100)
y2 <- -1 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) +
(-0.3 * x^5) + (2.4 * x^6) + rnorm(100)
df <- data.frame(x, y1, y2)
Fit our two models:
m1 <- lm(y1 ~ poly(x, 4), data = df)
m2 <- lm(y2 ~ poly(x, 6), data = df)
Now precict at some new locations x and stick it together with x and f, a factor indexing the model, into a tidy format:
newdf <- data.frame(x = seq(0, 1, length = 100))
p1 <- predict(m1, newdata = newdf, interval = "prediction")
p2 <- predict(m2, newdata = newdf, interval = "prediction")
p <- cbind(as.data.frame(rbind(p1, p2)), f = factor(rep(1:2, each = 100)),
x = rep(newdf$x, 2))
Melt the original data into tidy form
mdf <- with(df, data.frame(x = rep(x, 2), y = c(y1, y2),
f = factor(rep(1:2, each = 100))))
Draw the plot, using colour to distinguish the models/data
ggplot(mdf, aes(x = x, y = y, colour = f)) +
geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
This gets us

Resources