Caret: customizing feature selection using matrix-wise operations - r

Short question: is it possible to use matrix-wise operations in caretSBF$score function?
Motivation: When working with big matrices in R, operations that work natively matrix-wise [e.g. rowMeans(X) ] are often much faster than one-row-at-a-time approaches [e.g. apply(X, 1, mean) ]. Here is a benchmark example, using a matrix with a million rows and 100 columns:
rows = 1000000
cols = 100
X <- matrix(rnorm(rows*cols),nrow = rows)
ptm <- proc.time()
tt <- apply(X, 1, function(x) { t.test(x[1:50],x[51:100], var.equal = FALSE)$p.value })
proc.time() - ptm
# user system elapsed
# 312.420 0.685 313.633
library(genefilter)
ptm <- proc.time()
ftt <- rowFtests(X, fac = factor(c(rep(0,50), rep(1,50))), var.equal=FALSE)
proc.time() - ptm
# user system elapsed
# 21.400 1.336 23.257
Details: In the caret package, the caretSBF functions score and filter can be used to select features for cross-validated modeling. I want to use a custom scoring function in place of caretSBF$score (this part I can do), but I want it to be matrix-wise (like above -- this part I can't do). When I first looked at the functions, I couldn't see obvious reason why this wouldn't work. I want to do something like this:
mySBF$score <- function(x, y) {
genefilter::rowFtests(x, fac = y)$p.value
}
In place of the default:
$score
function (x, y)
{
if (is.factor(y))
anovaScores(x, y)
else gamScores(x, y)
}
<environment: namespace:caret>
But I can't make it work. Are matrix-wise operations just not supported by caretSBF?

Are matrix-wise operations just not supported by caretSBF?
No, not really. The score function is only served one predictor at a time.
However, you can get there using custom models in train. Here is an example that conducts feature extraction prior to modeling. You can adapt this with a multivariate filter and use the subset to fit the model. Here is a really crappy example:
> library(caret)
> set.seed(1)
> training <- LPH07_1(200)
>
> crappy <- getModelInfo("lm", regex = FALSE)[[1]]
> crappy$fit <- function (x, y, wts, param, lev, last, classProbs, ...) {
+ dat <- if (is.data.frame(x)) x else as.data.frame(x)
+ ## randomly filter all but 3 predictors
+ dat <- dat[, sample(1:ncol(dat), 3)]
+ dat$.outcome <- y
+ lm(.outcome ~ ., data = dat, ...)
+ }
> crappy$predict <- function (modelFit, newdata, submodels = NULL) {
+ if (!is.data.frame(newdata))
+ newdata <- as.data.frame(newdata)
## make sure to apply the subsetting part here too
+ predict(modelFit, newdata[, predictors(modelFit$terms)])
+ }
>
>
> mod <- train(y ~ ., data = training,
+ method = crappy)
> mod
Linear Regression
200 samples
10 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
3.08 0.077 0.258 0.0864
> predictors(mod)
[1] "Var08" "Var03" "Var04"
Max

Related

How do I calculate cronbach's alpha on multiply imputed data?

I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913

How to use the replicate function in R to repeat the function

I have a problem when using replicate to repeat the function.
I tried to use the bootstrap to fit
a quadratic model using concentration as the predictor and Total_lignin as the response and going to report an estimate of the maximum with a corresponding standard error.
My idea is to create a function called bootFun that essentially did everything within one iteration of a for loop. bootFun took in only the data set the predictor, and the response to use (both variable names in quotes).
However, the SD is 0, not correct. I do not know where is the wrong place. Could you please help me with it?
# Load the libraries
library(dplyr)
library(tidyverse)
# Read the .csv and only use M.giganteus and S.ravennae.
dat <- read_csv('concentration.csv') %>%
filter(variety == 'M.giganteus' | variety == 'S.ravennae') %>%
arrange(variety)
# Check the data
head(dat)
# sample size
n <- nrow(dat)
# A function to do one iteration
bootFun <- function(dat, pred, resp){
# Draw the sample size from the dataset
sample <- sample_n(dat, n, replace = TRUE)
# A quadratic model fit
formula <- paste0('resp', '~', 'pred', '+', 'I(pred^2)')
fit <- lm(formula, data = sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
max <- bootFun(dat = dat, pred = 'concentration', resp = 'Total_lignin' )
# Iterated times
N <- 5000
# Use 'replicate' function to do a loop
maxs <- replicate(N, max)
# An estimate of the max of predictor and corresponding SE
mean(maxs)
sd(maxs)
Base package boot, function boot, can ease the job of calling the bootstrap function repeatedly. The first argument must be the data set, the second argument is an indices argument, that the user does not set and other arguments can also be passed toit. In this case those other arguments are the predictor and the response names.
library(boot)
bootFun <- function(dat, indices, pred, resp){
# Draw the sample size from the dataset
dat.sample <- dat[indices, ]
# A quadratic model fit
formula <- paste0(resp, '~', pred, '+', 'I(', pred, '^2)')
formula <- as.formula(formula)
fit <- lm(formula, data = dat.sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
N <- 5000
set.seed(1234) # Make the bootstrap results reproducible
results <- boot(dat, bootFun, R = N, pred = 'concentration', resp = 'Total_lignin')
results
#
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = dat, statistic = bootFun, R = N, pred = "concentration",
# resp = "Total_lignin")
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -0.4629808 -0.0004433889 0.03014259
#
results$t0 # this is the statistic, not bootstrapped
#concentration
# -0.4629808
mean(results$t) # bootstrap value
#[1] -0.4633233
Note that to fit a polynomial, function poly is much simpler than to explicitly write down the polynomial terms one by one.
formula <- paste0(resp, '~ poly(', pred, ',2, raw = TRUE)')
Check the distribution of the bootstrapped statistic.
op <- par(mfrow = c(1, 2))
hist(results$t)
qqnorm(results$t)
qqline(results$t)
par(op)
Test data
set.seed(2020) # Make the results reproducible
x <- cumsum(rnorm(100))
y <- x + x^2 + rnorm(100)
dat <- data.frame(concentration = x, Total_lignin = y)

Fast post hoc computation using R

I have a large dataset which I would like to perform post hoc computation:
dat = as.data.frame(matrix(runif(10000*300), ncol = 10000, nrow = 300))
dat$group = rep(letters[1:3], 100)
Here is my code:
start <- Sys.time()
vars <- names(dat)[-ncol(dat)]
aov.out <- lapply(vars, function(x) {
lm(substitute(i ~ group, list(i = as.name(x))), data = dat)})
TukeyHSD.out <- lapply(aov.out, function(x) TukeyHSD(aov(x)))
Sys.time() - start
Time difference of 4.033335 mins
It takes about 4 min, are there more efficient and elegant ways to perform post hoc using R?
Thanks a lot
Your example is too big. For illustration of the idea I use a small one.
set.seed(0)
dat = as.data.frame(matrix(runif(2*300), ncol = 2, nrow = 300))
dat$group = rep(letters[1:3], 100)
Why do you call aov on a fitted "lm" model? That basically refits the same model.
Have a read on Fitting a linear model with multiple LHS first. lm is the workhorse of aov, so you can pass a multiple LHS formula to aov. The model has class c("maov", "aov", "mlm", "lm").
response_names <- names(dat)[-ncol(dat)]
form <- as.formula(sprintf("cbind(%s) ~ group", toString(response_names)))
fit <- do.call("aov", list(formula = form, data = quote(dat)))
Now the issue is: there is no "maov" method for TuckyHSD. So we need a hacking.
TuckyHSD relies on the residuals of the fitted model. In c("aov", "lm") case the residuals is a vector, but in c("maov", "aov", "mlm", "lm") case it is a matrix. The following demonstrates the hacking.
aov_hack <- fit
aov_hack[c("coefficients", "fitted.values")] <- NULL ## don't need them
aov_hack[c("contrasts", "xlevels")] <- NULL ## don't need them either
attr(aov_hack$model, "terms") <- NULL ## don't need it
class(aov_hack) <- c("aov", "lm") ## drop "maov" and "mlm"
## the following elements are mandatory for `TukeyHSD`
## names(aov_hack)
#[1] "residuals" "effects" "rank" "assign" "qr"
#[6] "df.residual" "call" "terms" "model"
N <- length(response_names) ## number of response variables
result <- vector("list", N)
for (i in 1:N) {
## change response variable in the formula
aov_hack$call[[2]][[2]] <- as.name(response_names[i])
## change residuals
aov_hack$residuals <- fit$residuals[, i]
## change effects
aov_hack$effects <- fit$effects[, i]
## change "terms" object and attribute
old_tm <- terms(fit) ## old "terms" object in the model
old_tm[[2]] <- as.name(response_names[i]) ## change response name in terms
new_tm <- terms.formula(formula(old_tm)) ## new "terms" object
aov_hack$terms <- new_tm ## replace `aov_hack$terms`
## replace data in the model frame
aov_hack$model[1] <- data.frame(fit$model[[1]][, i])
names(aov_hack$model)[1] <- response_names[i]
## run `TukeyHSD` on `aov_hack`
result[[i]] <- TukeyHSD(aov_hack)
}
result[[1]] ## for example
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
#Fit: aov(formula = V1 ~ group, data = dat)
#
#$group
# diff lwr upr p adj
#b-a -0.012743870 -0.1043869 0.07889915 0.9425847
#c-a -0.022470482 -0.1141135 0.06917254 0.8322109
#c-b -0.009726611 -0.1013696 0.08191641 0.9661356
I have used a "for" loop. Replace it with a lapply if you want.

R: Clustered robust standard errors using miceadds lm.cluster - error with subset and weights

I am trying to use the lm.cluster function in the package miceadds to get robust clustered standard errors for a multiply imputed dataset.
I am able to get the standard version of it to run but I get the following error when I try to add a subset or weights:
Error in eval(substitute(subset), data, env) :
..1 used in an incorrect context, no ... to look in
Example that works without subset or weights:
require("mice")
require("miceadds")
data(data.ma01)
# imputation of the dataset: use six imputations
dat <- data.ma01[ , - c(1:2) ]
imp <- mice::mice( dat , maxit=3 , m=6 )
datlist <- miceadds::mids2datlist( imp )
# linear regression with cluster robust standard errors
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool )} )
# extract parameters and covariance matrix
betas <- lapply( mod , FUN = function(rr){ coef(rr) } )
vars <- lapply( mod , FUN = function(rr){ vcov(rr) } )
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
Example that breaks with subset:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool, subset=
(data.ma01$urban==1))} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
Example that breaks with weights:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool,
weights=data.ma01$studwgt)} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
From searching, I think I am encountering similar issues as others when passing these commands through an lm or glm wrapper (such as: Passing Argument to lm in R within Function or R : Pass argument to glm inside an R function or Passing the weights argument to a regression function inside an R function)
However, I am not sure how to address the issue with the imputed datasets & existing lm.cluster command.
Thanks
This works fine with the estimatr package which is on CRAN and the estimatr::lm_robust() function. Two notes: (1) you can change the type of standard errors using se_type = and (2) I keep idschool in the data because we like the clusters to be in the same data.frame as we fit the model on.
library(mice)
library(miceadds)
library(estimatr)
# imputation of the dataset: use six imputations
data(data.ma01)
dat <- data.ma01[, -c(1)] # note I keep idschool in data
imp <- mice::mice( dat , maxit = 3, m = 6)
datlist <- miceadds::mids2datlist(imp)
# linear regression with cluster robust standard errors
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool)
}
)
# subset
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, subset = urban == 1)
}
)
# weights
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, weights = studwgt)
}
)
# note that you can use the `se_type` argument of lm_robust()
# to change the vcov estimation
# extract parameters and covariance matrix
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
I'm no expert, but there is an issue with the passing of the weights to lm(). I know this is not an ideal situation, but I managed to get it to work by modifying the lm.cluster() function to hard code the weights pass and then just used my own.
lm.cluster <- function (data, formula, cluster, wgts=NULL, ...)
{
TAM::require_namespace_msg("multiwayvcov")
if(is.null(wgts)) {
mod <- stats::lm(data = data, formula = formula)
} else {
data$.weights <- wgts
mod <- stats::lm(data = data, formula = formula, weights=data$.weights)
}
if (length(cluster) > 1) {
v1 <- cluster
}
else {
v1 <- data[, cluster]
}
dfr <- data.frame(cluster = v1)
vcov2 <- multiwayvcov::cluster.vcov(model = mod, cluster = dfr)
res <- list(lm_res = mod, vcov = vcov2)
class(res) <- "lm.cluster"
return(res)
}

formula error inside function

I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
library(survival)
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
set.seed(13)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
Z2=sample(0:1,n,replace=TRUE),
Z3=sample(0:1,n,replace=TRUE))
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
mean(event)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
}
BaseFun(fmla)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
basehaz(fit)
It is a problem of scoping.
Notice that the environment of basehaz is:
environment(basehaz)
<environment: namespace:survival>
meanwhile:
environment(BaseFun)
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
assign('x',x,pos=.GlobalEnv)
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
rm(x)
}
BaseFun(fmla)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
})
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
exists("train_data")
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
Trying to run this function throws an error similar to OP's.
library(rpart)
library(partykit)
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in is.data.frame(data) : object 'train_data' not found
# 10. is.data.frame(data)
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3. as.party.rpart(ct_train)
# 2. partykit::as.party(ct_train)
# 1. badFunc(data = data, my_formula = my_formula)
print(bad_out)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
model=TRUE,
##
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
good_out <- goodFunc(data=data, my_formula = my_formula)
print(good_out)
# Model formula:
# Trt ~ W + X
#
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
model:
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!

Resources