Fast post hoc computation using R - r

I have a large dataset which I would like to perform post hoc computation:
dat = as.data.frame(matrix(runif(10000*300), ncol = 10000, nrow = 300))
dat$group = rep(letters[1:3], 100)
Here is my code:
start <- Sys.time()
vars <- names(dat)[-ncol(dat)]
aov.out <- lapply(vars, function(x) {
lm(substitute(i ~ group, list(i = as.name(x))), data = dat)})
TukeyHSD.out <- lapply(aov.out, function(x) TukeyHSD(aov(x)))
Sys.time() - start
Time difference of 4.033335 mins
It takes about 4 min, are there more efficient and elegant ways to perform post hoc using R?
Thanks a lot

Your example is too big. For illustration of the idea I use a small one.
set.seed(0)
dat = as.data.frame(matrix(runif(2*300), ncol = 2, nrow = 300))
dat$group = rep(letters[1:3], 100)
Why do you call aov on a fitted "lm" model? That basically refits the same model.
Have a read on Fitting a linear model with multiple LHS first. lm is the workhorse of aov, so you can pass a multiple LHS formula to aov. The model has class c("maov", "aov", "mlm", "lm").
response_names <- names(dat)[-ncol(dat)]
form <- as.formula(sprintf("cbind(%s) ~ group", toString(response_names)))
fit <- do.call("aov", list(formula = form, data = quote(dat)))
Now the issue is: there is no "maov" method for TuckyHSD. So we need a hacking.
TuckyHSD relies on the residuals of the fitted model. In c("aov", "lm") case the residuals is a vector, but in c("maov", "aov", "mlm", "lm") case it is a matrix. The following demonstrates the hacking.
aov_hack <- fit
aov_hack[c("coefficients", "fitted.values")] <- NULL ## don't need them
aov_hack[c("contrasts", "xlevels")] <- NULL ## don't need them either
attr(aov_hack$model, "terms") <- NULL ## don't need it
class(aov_hack) <- c("aov", "lm") ## drop "maov" and "mlm"
## the following elements are mandatory for `TukeyHSD`
## names(aov_hack)
#[1] "residuals" "effects" "rank" "assign" "qr"
#[6] "df.residual" "call" "terms" "model"
N <- length(response_names) ## number of response variables
result <- vector("list", N)
for (i in 1:N) {
## change response variable in the formula
aov_hack$call[[2]][[2]] <- as.name(response_names[i])
## change residuals
aov_hack$residuals <- fit$residuals[, i]
## change effects
aov_hack$effects <- fit$effects[, i]
## change "terms" object and attribute
old_tm <- terms(fit) ## old "terms" object in the model
old_tm[[2]] <- as.name(response_names[i]) ## change response name in terms
new_tm <- terms.formula(formula(old_tm)) ## new "terms" object
aov_hack$terms <- new_tm ## replace `aov_hack$terms`
## replace data in the model frame
aov_hack$model[1] <- data.frame(fit$model[[1]][, i])
names(aov_hack$model)[1] <- response_names[i]
## run `TukeyHSD` on `aov_hack`
result[[i]] <- TukeyHSD(aov_hack)
}
result[[1]] ## for example
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
#Fit: aov(formula = V1 ~ group, data = dat)
#
#$group
# diff lwr upr p adj
#b-a -0.012743870 -0.1043869 0.07889915 0.9425847
#c-a -0.022470482 -0.1141135 0.06917254 0.8322109
#c-b -0.009726611 -0.1013696 0.08191641 0.9661356
I have used a "for" loop. Replace it with a lapply if you want.

Related

Get standard errors from the output of the "ols" function

How do you get standard errors of the coefficients from the output of the "ols" function (package "rms") in R? I know that "coef" gets the coefficients of the ols object but did not find a way to get the standard errors of those coefficients.
You should use summary
# Example Data
Fact1 <- runif(200)
Fact2 <- sample(0:3, 200, TRUE)
distance <- (Fact1 + Fact2/3 + rnorm(200))^2
d <- rms::datadist(Fact1, Fact2)
# Model
ols_model <- rms::ols(sqrt(distance) ~ rms::rcs(Fact1,4) + rms::scored(Fact2), x = TRUE)
#Summary
model_summary <- summary(ols_model)
# Isolate SEs
model_se <- model_summary[,5]

Dummies not included in summary

I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.

How to use the replicate function in R to repeat the function

I have a problem when using replicate to repeat the function.
I tried to use the bootstrap to fit
a quadratic model using concentration as the predictor and Total_lignin as the response and going to report an estimate of the maximum with a corresponding standard error.
My idea is to create a function called bootFun that essentially did everything within one iteration of a for loop. bootFun took in only the data set the predictor, and the response to use (both variable names in quotes).
However, the SD is 0, not correct. I do not know where is the wrong place. Could you please help me with it?
# Load the libraries
library(dplyr)
library(tidyverse)
# Read the .csv and only use M.giganteus and S.ravennae.
dat <- read_csv('concentration.csv') %>%
filter(variety == 'M.giganteus' | variety == 'S.ravennae') %>%
arrange(variety)
# Check the data
head(dat)
# sample size
n <- nrow(dat)
# A function to do one iteration
bootFun <- function(dat, pred, resp){
# Draw the sample size from the dataset
sample <- sample_n(dat, n, replace = TRUE)
# A quadratic model fit
formula <- paste0('resp', '~', 'pred', '+', 'I(pred^2)')
fit <- lm(formula, data = sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
max <- bootFun(dat = dat, pred = 'concentration', resp = 'Total_lignin' )
# Iterated times
N <- 5000
# Use 'replicate' function to do a loop
maxs <- replicate(N, max)
# An estimate of the max of predictor and corresponding SE
mean(maxs)
sd(maxs)
Base package boot, function boot, can ease the job of calling the bootstrap function repeatedly. The first argument must be the data set, the second argument is an indices argument, that the user does not set and other arguments can also be passed toit. In this case those other arguments are the predictor and the response names.
library(boot)
bootFun <- function(dat, indices, pred, resp){
# Draw the sample size from the dataset
dat.sample <- dat[indices, ]
# A quadratic model fit
formula <- paste0(resp, '~', pred, '+', 'I(', pred, '^2)')
formula <- as.formula(formula)
fit <- lm(formula, data = dat.sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
N <- 5000
set.seed(1234) # Make the bootstrap results reproducible
results <- boot(dat, bootFun, R = N, pred = 'concentration', resp = 'Total_lignin')
results
#
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = dat, statistic = bootFun, R = N, pred = "concentration",
# resp = "Total_lignin")
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -0.4629808 -0.0004433889 0.03014259
#
results$t0 # this is the statistic, not bootstrapped
#concentration
# -0.4629808
mean(results$t) # bootstrap value
#[1] -0.4633233
Note that to fit a polynomial, function poly is much simpler than to explicitly write down the polynomial terms one by one.
formula <- paste0(resp, '~ poly(', pred, ',2, raw = TRUE)')
Check the distribution of the bootstrapped statistic.
op <- par(mfrow = c(1, 2))
hist(results$t)
qqnorm(results$t)
qqline(results$t)
par(op)
Test data
set.seed(2020) # Make the results reproducible
x <- cumsum(rnorm(100))
y <- x + x^2 + rnorm(100)
dat <- data.frame(concentration = x, Total_lignin = y)

Predict segmented lm outside of package

I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer

R: Creating a loop for two models?

I currently have following code with two functions that calculate the model fit for two distinct models. The difference is in the lm function, where + log(v2) has been added in model 2.
R code
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100), v2 = runif(30, 1, 100))
p0 <- 1 # number of parameters in lm()
p1 <- 2 # number of parameters in lm()
n <- nrow(dat) - 1
## Model 1 Loop
model1 <- function(x) {
fit <- lm(log(clicks) ~ log(v1), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ])
c(summary(fit)$r.squared)
}
## Model 1 Regression
result_m1 <- t(sapply(p0:n, model1))
data.frame(result_m1)
## Model 2 Loop
model2 <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v2), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ])
c(summary(fit)$r.squared)
}
## Model 2 Regression
result_m2 <- t(sapply(p1:n, model2))
data.frame(result_m2)
Question: Can I somehow create a function that implements a loop for the different models only, instead of repeating the calculation for every model?
I have something like this in mind but weren't able to implement it .http://www.ats.ucla.edu/stat/r/pages/looping_strings.htm
I don't see a point in recreating a function that can be easily done with model-selection functions in available packages.
library(leaps)
library(dplyr)
b <- regsubsets(clicks ~ ., data=dat, nbest=10, nvmax=2) # carries out exhaustive model selection (10 best models; 2 maximum predictors)
coef(b, 1:3) # returns coefficient for the 3 models in this case
[[1]]
(Intercept) v1
60.8067570 -0.2665699
[[2]]
(Intercept) v2
49.96974177 -0.05227489
[[3]]
(Intercept) v1 v2
62.02323816 -0.26422966 -0.02676747
summary(b)$rsq #provide r.squared value for 3 models
[1] 0.067952759 0.002366681 0.068568059
To run prediction is a tad more complicated.
all.mods <- summary(b)$which[,-1] # gives logic output of predictors combination
all.mods
v1 v2
1 TRUE FALSE
1 FALSE TRUE
2 TRUE TRUE
RHS <- lapply(seq(nrow(all.mods)), function(m) summary(b)$which[m,-1] %>% which %>% names %>% paste(., collapse="+"))
RHS
[[1]]
[1] "v1"
[[2]]
[1] "v2"
[[3]]
[1] "v1+v2"
lm.form <- lapply(RHS, function(m)parse(text=paste("lm(clicks ~", m, ", data=dat)")))
lm.mods <- lapply(lm.form, eval) # return list of all lm.mods generated
The list of lm.mods can subsequently be used for predict with new.data.

Resources