How can I use `I()` with `paste0`? - r

I have multiple dataframes and I would like to evaluate (multiple) different models on each. MWE
df1 <- data.frame(A3 = c(-5, 5, 1),
B3 = c(0, 10, 1))
df2 <- data.frame(A4 = c(5, 15, 1))
B4 = c(10, 20, 1))
myfun <- function(arg1, arg2){ # arg1 =1 or 2
if (arg2 == 1){
eqn <- paste0("A", arg1+2) ~ paste0("B", arg1+2) + I(as.name(paste0("B", arg1+2))^2)
} else {
eqn <- paste0("A", arg1+2) ~ paste0("B", arg1+2) + I(as.name(paste0("B", arg1+2))^2) +I(as.name(paste0("B", arg1+2))^3)
}
return (lm(formula = eqn, data = eval(as.name(paste0("df", arg1)))
)
)
}
For example if I run myfun(1,2) I would like to get lm(A4 ~ B4 + I(B4^2) + I(B4^3), data = df2). But whatever I try I get the following error message Error in (paste0("B", arg1 +2))^2 : non-numeric argument to binary operator. From what I read in ?I, I imagine this is because R isolates whatever is passed into I(), so it doesn't realize I am trying to transform a variable: is that what is going on, and is it something I can fix? Also, is there a better way to estimate multiple models quickly? All the similar questions I found used the same data.frame across models, while I have to account for the response (and predictor) variables coming from different dataframes for different models.

Maybe this is what you are looking for:
The issue is that your are doing a math operation on a string, i.e with (paste0("B", arg1 +2))^2 you try to square a string, that's why you get the error. Inytead you can simply glue you formula together as a string an d converted it to a formula via as.formula:
df1 <- data.frame(A3 = c(-5, 5, 1),
B3 = c(0, 10, 1))
df2 <- data.frame(A4 = c(5, 15, 1))
B4 = c(10, 20, 1)
myfun <- function(arg1, arg2){ # arg1 =1 or 2
if (arg2 == 1){
eqn <- paste0("A", arg1+2, " ~ B", arg1+2," + I(B", arg1+2, "^2)")
} else {
eqn <- paste0("A", arg1+2, " ~ B", arg1+2," + I(B", arg1+2, "^2) + I(B", arg1+2, "^3)")
}
return (lm(formula = as.formula(eqn), data = eval(as.name(paste0("df", arg1)))
)
)
}
myfun(2, 1)
#>
#> Call:
#> lm(formula = as.formula(eqn), data = eval(as.name(paste0("df",
#> arg1))))
#>
#> Coefficients:
#> (Intercept) B4 I(B4^2)
#> 0.84795 0.12281 0.02924

An option is also to construct the formula with glue
myfun <- function(arg1, arg2){
eqn <- switch(arg2,
`1` = glue::glue("A{arg1 + 2}~ B{arg1+2} + I(B{arg1+2}^2)"),
glue::glue("A{arg1 + 2}~ B{arg1+2}",
"+ I(B{arg1+2}^2) + I(B{arg1+2}^3)")
)
model <- lm(eqn, data = get(paste0('df', arg1), envir = .GlobalEnv))
model$call <- as.formula(eqn)
return(model)
}
myfun(2, 1)
#Call:
#A4 ~ B4 + I(B4^2)
#Coefficients:
#(Intercept) B4 I(B4^2)
# 0.84795 0.12281 0.02924

Related

Solving for an input value of an R function

In my attached R function, I was wondering how to solve for mdes (suppose it is unknown) which is currently one of the input values IF everything else is known?
Is it also possible to solve for mdes and power (both currently input values) IF everything else is known?
foo <- function(A = 200, As = 15, B = 100,Bs = 10,iccmax = 0.15,mdes = .25,SD = 1.2,power = 80)
{
tail <- 2
alpha <- 5
inv_d <- function(mdes) {
c(mean_dif = 1, Vmax = 2/mdes^2)
}
SDr <- 1/SD
pars <- inv_d(mdes)
mean_dif <- pars[[1]]
Vmax <- pars[[2]]
zbeta <- qnorm((power/100))
zalpha <- qnorm(1-(alpha/(100*tail)))
maxvarmean_difhat <- (mean_dif / (zbeta + zalpha))**2
ntreat <- sqrt((A/As)*((1-iccmax)/iccmax))
ncont <- sqrt((B/Bs)*((1-iccmax)/iccmax))
costpertreatcluster <- A + (As*ntreat)
costperconcluster <- B + (Bs*ncont)
gtreat <- (sqrt(A*iccmax) + sqrt(As*(1-iccmax)))**2
gcon <- (sqrt(B*iccmax) + sqrt(Bs*(1-iccmax)))**2
pratio <- sqrt(gtreat/gcon)
budgetratio <- 99999
budgetratio <- ifelse( ((pratio <= SD) & (pratio >= SDr)), pratio**2, ifelse((pratio > SD), pratio*SD, pratio*SDr))
fraction <- budgetratio/(1 + budgetratio)
mmvnumer <- 99999
mmvnumer <- ifelse( ((pratio <= SD) & (pratio >= SDr)),
gcon*Vmax*(1+(pratio**2)),
ifelse((pratio > SD),
gcon*Vmax*(((pratio*SD)+1)**2/((SD**2)+1)),
gcon*Vmax*(((pratio*SDr)+1)**2/((SDr**2) + 1))) )
budget <- mmvnumer/maxvarmean_difhat
treatbudget <- fraction*budget
conbudget <- (1-fraction)*budget
ktreat <- treatbudget/costpertreatcluster
kcont <- conbudget/costperconcluster
ktreatrup <- ceiling(ktreat)
kcontrup <- ceiling(kcont)
ktreatplus <- ifelse(pmin(ktreatrup,kcontrup) < 8, ktreatrup + 3, ktreatrup + 2)
kcontplus <- ifelse(pmin(ktreatrup,kcontrup) < 8, kcontrup + 3, kcontrup + 2)
budgetplus <- (ktreatplus*costpertreatcluster) + (kcontplus*costperconcluster)
return(c(ncont = ncont, kcont = kcontplus,
ntreat = ntreat, ktreat = ktreatplus, budget = budgetplus))
}
#--------------------------------------------------------------------------------
# EXAMPLE OF USE:
foo()
ncont kcont ntreat ktreat budget
7.527727 73.000000 8.692270 62.000000 33279.051347
Define a function of one variable as
p0 = foo()
fn1 = function(x) sum((foo(mdes=x) - p0)^2)
and find a minimum that should be 0, and which corresponds to your mdes = 0.25 input!
optimize(fn1, c(0.0, 1.0))
## $minimum
## [1] 0.2497695
## $objective
## [1] 0
For two variables, this is more difficult, as the function has many local minima and is ill-defined outside certain regions. Applying optim() you will need well-chosen starting points.

How to pass the response variable to lm which is inside an expression within my own function

I try to pass the repsonse variable tv as a function argument into lm within an expression. I hope the code below makes it clearer what I try to achieve.
I preferrably would like to do that using tidy evaluation.
Furthermore, I tried to replace expression from base R with tidyeval terminology but I did not succeed to do so.
library(tidyverse)
library(mice)
data <- boys[boys$age >= 8, -4]
imp <- mice(data, seed = 28382, m = 10, print = FALSE)
choose_vars <- function(predictor_vars) {
predictors <- my_vars %>%
str_c(collapse = " + ") %>%
str_c("~", .) %>%
rlang::parse_expr(.)
scope <- list(upper = predictors, lower = ~1)
my_expression <- expression(
f1 <- lm(tv ~ 1),
f2 <- step(f1, scope = scope))
fit <- with(imp, my_expression)
formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
}
my_vars <- c("age", "hgt", "wgt", "hc", "gen", "phb", "reg")
choose_vars(predictor_vars = my_vars)
I would like to be able to pass tv via my own function.
choose_vars(predictor_vars = my_vars, response_var = tv)
The original code derives from Stef van Buuren's book Flexible Imputation of Missing Data.
data <- boys[boys$age >= 8, -4]
imp <- mice(data, seed = 28382, m = 10, print = FALSE)
scope <- list(upper = ~ age + hgt + wgt + hc + gen + phb + reg,
lower = ~1)
expr <- expression(f1 <- lm(tv ~ 1),
f2 <- step(f1, scope = scope))
fit <- with(imp, expr)
formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
Not exactly what I wanted but I found a way to pass the response variable into the function. The result is the same as in the example from the book.
library(tidyverse)
library(mice)
data <- boys[boys$age >= 8, -4]
imp <- mice(data, seed = 28382, m = 10, print = FALSE)
My code
choose_vars <- function(imp_data, predictor_vars, response_var) {
predictors <- predictor_vars %>%
str_c(collapse = " + ") %>%
str_c("~", .) %>%
rlang::parse_expr(.)
scope <- list(upper = predictors, lower = ~1)
form <- str_c(response_var, " ~ 1")
fit <- imp_data %>%
mice::complete("all") %>%
lapply(function(x) { step(lm(formula = as.formula(form), data = x), scope = scope) } )
formulas <- lapply(fit, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
}
my_vars <- c("age", "hgt", "wgt", "hc", "gen", "phb", "reg")
my_table <- choose_vars(imp_data = imp, predictor_vars = my_vars, response_var = "tv")
Book example
scope <- list(upper = ~ age + hgt + wgt + hc + gen + phb + reg,
lower = ~1)
expr <- expression(f1 <- lm(tv ~ 1),
f2 <- step(f1, scope = scope))
fit <- with(imp, expr)
formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
stefs_table <- table(votes)
Compare results
identical(my_table, stefs_table)
[1] TRUE

Why does R update function changes glmer fit when called with 1 argument?

Below is a simplified code reproducing what looks to me like a problem with R update function:
library('lme4')
f <- function(formula) {
data <- data.frame(a = c(4, 5), rowi = c(1, 2), b = c(2, 2))
fit0 <- glmer(formula, data = data, family = poisson(log))
fit1 <- update(fit0)
cat('f likelihoods: ', logLik(fit0), logLik(fit1), '\n')
}
g <- function() {
f(a ~ -1 + (1|rowi) + offset(b))
data <- data.frame(a = c(4, 5), rowi = c(1, 2), b = c(20, 40))
f(a ~ -1 + (1|rowi) + offset(b))
cat('g likelihood: ', logLik(glmer(a ~ -1 + (1|rowi) + offset(b),
data = data, family = poisson(log))), '\n')
}
g()
data <- data.frame(a = c(4, 5), rowi = c(1, 2), b = c(50, 80))
g()
cat('global likelihood: ', logLik(glmer(a ~ -1 + (1|rowi) + offset(b),
data = data, family = poisson(log))), '\n')
This code outputs:
f likelihoods: -4.712647 -4.712647
f likelihoods: -4.712647 -12.6914
g likelihood: -12.6914
f likelihoods: -4.712647 -14.22997
f likelihoods: -4.712647 -12.6914
g likelihood: -12.6914
global likelihood: -14.22997
The surprising (to me) thing is that update(fit0) operation changes the model when data is defined in the environment of the formula. Why is that? How to use update properly to avoid pitfalls like this?
I've run into this, too. The short answer is that update.merMod(model) employs environment(formula(model)) to determine which environment to use when refitting the model (if that fails, then it'll try the enclosing environment, and so on). The result is that update() refits the model using the environment that the formula was created in, not the environment that the original merMod object was created in. This is consistent with the behavior of the example you cooked up.
My clumsy way around this issue would be to pass the formulas around as strings, and be sure to convert to formulas inside the same function body where the model is originally fit; e.g.
f <- function(formula_string) {
formula <- as.formula(formula_string)
data <- data.frame(a = c(4, 5), rowi = c(1, 2), b = c(2, 2))
fit0 <- glmer(formula, data = data, family = poisson(log))
fit1 <- update(fit0)
cat('f likelihoods: ', logLik(fit0), logLik(fit1), '\n')
}
g <- function() {
f("a ~ -1 + (1|rowi) + offset(b)")
data <- data.frame(a = c(4, 5), rowi = c(1, 2), b = c(20, 40))
f("a ~ -1 + (1|rowi) + offset(b)")
cat('g likelihood: ', logLik(glmer(a ~ -1 + (1|rowi) + offset(b),
data = data, family = poisson(log))), '\n')
}
I'm not sure if the current behavior is desirable for some reason (that's a question for #benbolker and the other lme4 developers), or what a lower-level fix might look like ... aside from either explicitly setting/saving the environment of the merMod object at creation, or a recursive frame search that uses identical() (ala where() in pryr). There are probably good arguments against these.

R: using bootstrap prediction on mixed model

library(nlme)
library(bootstrap)
y = Loblolly$height
x = Loblolly
theta.fit = function(x, y){
nlme(height ~ SSasymp(age, Asym, R0, lrc),
data = x,
fixed = Asym + R0 + lrc ~ 1,
random = Asym ~ 1,
start = c(Asym = 103, R0 = -8.5, lrc = -3.3))
}
theta.predict = function(fit, x){
(fit$fitted)[,1]
}
sq.err <- function(y,yhat) { (y-yhat)^2}
results <- bootpred(x,y,20,theta.fit,theta.predict,
err.meas=sq.err)
I am using the bootpred function to obtain estimates of prediction error. However, when I run the last line, I get the following error:
Error in model.frame.default(formula = ~height + age, data = c(" 4.51", :
'data' must be a data.frame, not a matrix or an array
I then tried x = data.frame(x) but that did not solve my problem.
The problem comes about because the example dataset used is a groupedData:
library(nlme)
library(bootstrap)
y = Loblolly$height
x = Loblolly
class(x)
[1] "nfnGroupedData" "nfGroupedData" "groupedData" "data.frame"
And inside the bootpred function, it is converted into a matrix again. It can be quite a mess converting back and forth, especially when you need the factor column for linear mixed models.
What you can do write theta.fit and theta.predict to take in a data.frame:
theta.fit = function(df){
nlme(height ~ SSasymp(age, Asym, R0, lrc),
data = df,
fixed = Asym + R0 + lrc ~ 1,
random = Asym ~ 1,
start = c(Asym = 103, R0 = -8.5, lrc = -3.3))
}
theta.predict = function(fit, df){
predict(fit,df)
}
sq.err <- function(y,yhat) { (y-yhat)^2}
And now alter the bootpred function and use df, I guess you can provide y again, or specific the column to use in the data.frame:
bootpred_df = function (df,y,nboot, theta.fit, theta.predict, err.meas, ...)
{
call <- match.call()
n <- length(y)
saveii <- NULL
fit0 <- theta.fit(df, ...)
yhat0 <- theta.predict(fit0, df)
app.err <- mean(err.meas(y, yhat0))
err1 <- matrix(0, nrow = nboot, ncol = n)
err2 <- rep(0, nboot)
for (b in 1:nboot) {
ii <- sample(1:n, replace = TRUE)
saveii <- cbind(saveii, ii)
fit <- theta.fit(df[ii, ], ...)
yhat1 <- theta.predict(fit, df[ii, ])
yhat2 <- theta.predict(fit, df)
err1[b, ] <- err.meas(y, yhat2)
err2[b] <- mean(err.meas(y[ii], yhat1))
}
optim <- mean(apply(err1, 1, mean,na.rm=TRUE) - err2)
junk <- function(x, i) {
sum(x == i)
}
e0 <- 0
for (i in 1:n) {
o <- apply(saveii, 2, junk, i)
if (sum(o == 0) == 0)
cat("increase nboot for computation of the .632 estimator",
fill = TRUE)
e0 <- e0 + (1/n) * sum(err1[o == 0, i])/sum(o == 0)
}
err.632 <- 0.368 * app.err + 0.632 * e0
return(list(app.err, optim, err.632, call = call))
}
We can run it now.. but because of the nature of this data, there will be instances where the group (Seed) has an uneven distribution making some of the variables hard to estimate.. Most likely this problem might be better addressed by refining the code. In any case, if you are lucky it works like below:
bootpred_df(Loblolly,Loblolly$height,20,theta.fit,theta.predict,err.meas=sq.err)
[[1]]
[1] 0.4337236
[[2]]
[1] 0.1777644
[[3]]
[1] 0.6532417
$call
bootpred_df(df = Loblolly, y = Loblolly$height, nboot = 20, theta.fit = theta.fit,
theta.predict = theta.predict, err.meas = sq.err)

Running existing function with non-default option

The code pasted below from ResourceSelection::hoslem.test performs a Hosmer and Lemeshow goodness of fit test. While investigating why the output that does not agree exactly with that performed by another software (Stata), I found that the difference relates to use of default R argument for the quantile function (type=7). I would like to use this function with a different default for calculation of quantiles (type=6).
FWIW, the reference to the 9 possible methods used by R can be found at:
https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf
The Stata manual for pctile refers to a default method and an 'altdef' method. I found it difficult to map these two methods to corresponding R types.
However,
hoslem.test(yhat, y, type=6)
Produces:
> hl <- hoslem.test(y, yhat, type=6)
Error in hoslem.test(y, yhat, type = 6) : unused argument (type = 6)
Is there a way to run the function below with a non-default argument for the quantile function?
Ie. allows the following line adding ', type=6':
qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g), type=6))
The function in question is:
> ResourceSelection::hoslem.test
function (x, y, g = 10)
{
DNAME <- paste(deparse(substitute(x)), deparse(substitute(y)),
sep = ", ")
METHOD <- "Hosmer and Lemeshow goodness of fit (GOF) test"
yhat <- y
y <- x
qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g)))
cutyhat <- cut(yhat, breaks = qq, include.lowest = TRUE)
observed <- xtabs(cbind(y0 = 1 - y, y1 = y) ~ cutyhat)
expected <- xtabs(cbind(yhat0 = 1 - yhat, yhat1 = yhat) ~
cutyhat)
chisq <- sum((observed - expected)^2/expected)
PVAL = 1 - pchisq(chisq, g - 2)
PARAMETER <- g - 2
names(chisq) <- "X-squared"
names(PARAMETER) <- "df"
structure(list(statistic = chisq, parameter = PARAMETER,
p.value = PVAL, method = METHOD, data.name = DNAME, observed = observed,
expected = expected), class = "htest")
}
We can modify pieces of functions. Look at the body of the function
as.list(body(hoslem.test))
See that the element we want to modify is the 6th element in the body
[[1]]
`{`
[[2]]
DNAME <- paste(deparse(substitute(x)), deparse(substitute(y)),
sep = ", ")
[[3]]
METHOD <- "Hosmer and Lemeshow goodness of fit (GOF) test"
[[4]]
yhat <- y
[[5]]
y <- x
[[6]]
qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g)))
Modify the 6th element to what you want
body(hoslem.test)[[6]] = substitute(qq <- unique(quantile(yhat,
probs = seq(0, 1, 1/g), type = 6)))
The easiest way would be to reenter the function as your own:
myhoslem.test<-function(x, y, g = 10, mytype = 6){
DNAME <- paste(deparse(substitute(x)), deparse(substitute(y)),
sep = ", ")
METHOD <- "Hosmer and Lemeshow goodness of fit (GOF) test"
yhat <- y
y <- x
qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g), type = mytype))
cutyhat <- cut(yhat, breaks = qq, include.lowest = TRUE)
observed <- xtabs(cbind(y0 = 1 - y, y1 = y) ~ cutyhat)
expected <- xtabs(cbind(yhat0 = 1 - yhat, yhat1 = yhat) ~
cutyhat)
chisq <- sum((observed - expected)^2/expected)
PVAL = 1 - pchisq(chisq, g - 2)
PARAMETER <- g - 2
names(chisq) <- "X-squared"
names(PARAMETER) <- "df"
structure(list(statistic = chisq, parameter = PARAMETER,
p.value = PVAL, method = METHOD, data.name = DNAME, observed = observed,
expected = expected), class = "htest")
}
The key change here is :
qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g), type = mytype))
and allowing mytype as a argument to the function with default as 6
The two answers suggest a wrapper function to flexibly modify hoslem.test
myhoslem.test<-function(x, y, g = 10, mytype = 6){
body(hoslem.test)[[6]] = substitute(qq <- unique(quantile(yhat,
probs = seq(0, 1, 1/g), type = mytype)))
hoslem.test(x,y, g=10)
}

Resources