Let's say I have the following formula:
myformula<-formula("depVar ~ Var1 + Var2")
How to reliably get dependent variable name from formula object?
I failed to find any built-in function that serves this purpose.
I know that as.character(myformula)[[2]] works, as do
sub("^(\\w*)\\s~\\s.*$","\\1",deparse(myform))
It just seems to me, that these methods are more a hackery, than a reliable and standard method to do it.
Does anyone know perchance what exactly method the e.g. lm use? I've seen it's code, but it is a little to cryptic to me... here is a quote for your convenience:
> lm
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <- x
ret.y <- y
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
if (method == "model.frame")
return(mf)
else if (method != "qr")
warning(gettextf("method = '%s' is not supported. Using 'qr'",
method), domain = NA)
mt <- attr(mf, "terms")
y <- model.response(mf, "numeric")
w <- as.vector(model.weights(mf))
if (!is.null(w) && !is.numeric(w))
stop("'weights' must be a numeric vector")
offset <- as.vector(model.offset(mf))
if (!is.null(offset)) {
if (length(offset) != NROW(y))
stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
length(offset), NROW(y)), domain = NA)
}
if (is.empty.model(mt)) {
x <- NULL
z <- list(coefficients = if (is.matrix(y)) matrix(, 0,
3) else numeric(), residuals = y, fitted.values = 0 *
y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
0) else if (is.matrix(y)) nrow(y) else length(y))
if (!is.null(offset)) {
z$fitted.values <- offset
z$residuals <- y - offset
}
}
else {
x <- model.matrix(mt, mf, contrasts)
z <- if (is.null(w))
lm.fit(x, y, offset = offset, singular.ok = singular.ok,
...)
else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
...)
}
class(z) <- c(if (is.matrix(y)) "mlm", "lm")
z$na.action <- attr(mf, "na.action")
z$offset <- offset
z$contrasts <- attr(x, "contrasts")
z$xlevels <- .getXlevels(mt, mf)
z$call <- cl
z$terms <- mt
if (model)
z$model <- mf
if (ret.x)
z$x <- x
if (ret.y)
z$y <- y
if (!qr)
z$qr <- NULL
z
}
Try using all.vars:
all.vars(myformula)[1]
I suppose you could also cook your own function to work with terms():
getResponse <- function(formula) {
tt <- terms(formula)
vars <- as.character(attr(tt, "variables"))[-1] ## [1] is the list call
response <- attr(tt, "response") # index of response var
vars[response]
}
R> myformula <- formula("depVar ~ Var1 + Var2")
R> getResponse(myformula)
[1] "depVar"
It is just as hacky as as.character(myformyula)[[2]] but you have the assurance that you get the correct variable as the ordering of the call parse tree isn't going to change any time soon.
This isn't so good with multiple dependent variables:
R> myformula <- formula("depVar1 + depVar2 ~ Var1 + Var2")
R> getResponse(myformula)
[1] "depVar1 + depVar2"
as they'll need further processing.
I found an useful package 'formula.tools' which is suitable for your task.
code Example:
f <- as.formula(a1 + a2~a3 + a4)
lhs.vars(f) #get dependent variables
[1] "a1" "a2"
rhs.vars(f) #get independent variables
[1] "a3" "a4"
Based on your edit to get the actual response, not just its name, we can use the nonstandard evaluation idiom employed by lm() and most other modelling functions with a formula interface in base R
form <- formula("depVar ~ Var1 + Var2")
dat <- data.frame(depVar = rnorm(10), Var1 = rnorm(10), Var2 = rnorm(10))
getResponse <- function(form, data) {
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
y
}
> getResponse(form, dat)
1 2 3 4 5
-0.02828573 -0.41157817 2.45489291 1.39035938 -0.31267835
6 7 8 9 10
-0.39945771 -0.09141438 0.81826105 0.37448482 -0.55732976
As you see, this gets the actual response variable data from the supplied data frame.
How this works is that the function first captures the function call without expanding the ... argument as that contains things not needed for the evaluation of the data for the formula.
Next, the "formula" and "data" arguments are matched with the call. The line mf[c(1L, m)] selects the function name from the call (1L) and the locations of the two matched arguments. The drop.unused.levels argument of model.frame() is set to TRUE in the next line, and then the call is updated to switch the function name in the call from lm to model.frame. All the above code does is takes the call to lm() and processes that call into a call to the model.frame() function.
This modified call is then evaluated in the parent environment of the function - which in this case is the global environment.
The last line uses the model.response() extractor function to take the response variable from the model frame.
This should always give you all dependent vars:
myformula<-formula("depVar1 + depVar2 ~ Var1 + Var2")
as.character(myformula[[2]])[-1]
#[1] "depVar1" "depVar2"
And I wouldn't consider this particularly "hacky".
Edit:
Something strange happens with 3 dependents:
myformula<-formula("depVar1 + depVar2 + depVar3 ~ Var1 + Var2")
as.character(myformula[[2]])
#[1] "+" "depVar1 + depVar2" "depVar3"
So this might not be as reliable as I thought.
Edit2:
Okay, myformula[[2]] is a language object and as.character seems to do something similar as languageEl.
length(myformula[[2]])
#[1] 3
languageEl(myformula[[2]],which=1)
#`+`
languageEl(myformula[[2]],which=2)
#depVar1 + depVar2
languageEl(myformula[[2]],which=3)
#depVar3
languageEl(languageEl(myformula[[2]],which=2),which=2)
#depVar1
If you check the length of each element, you could create your own extraction function. But this is probably too much of a hack.
Edit3:
Based on the answer by #seancarmody all.vars(myformula[[2]]) is the way to go.
Using all.vars is very tricky as it won't detect the response from a one-sided formula. For example
all.vars(~x+1)
[1] "x"
that is wrong.
Here is the most reliable way of getting the response:
getResponseFromFormula = function(formula) {
if (attr(terms(as.formula(formula)) , which = 'response'))
all.vars(formula)[1]
else
NULL
}
getResponseFromFormula(~x+1)
NULL
getResponseFromFormula(y~x+1)
[1] "y"
Note that you can replace all.vars(formula)[1] in the function with formula[2] if the formula contains more than one variable for the response.
I know this question is quite old, but I thought I'd add a base R answer which doesn't require indexing, doesn't depend on the order of the variables listed in a call to all.vars, and which gives the response variables as separate elements when there is more than one:
myformula <- formula("depVar1 + depVar2 ~ Var1 + Var2")
all_vars <- all.vars(myformula)
response <- all_vars[!(all_vars %in% labels(terms(myformula)))]
> response
[1] "depVar1" "depVar2"
Related
I have a global function of roughly the form:
demo_fcn <- function(f, x1,x2){
r = x1 - x2
return(f(r))
}
I want to create this function in a general way so that users can add their own f with their own custom inputs, so long as there is an input slot for r. Say we take f to be the following function
f <- function(input, factor){
out = input^factor
return(out)
}
In this case, input = r, so that the user is able to call
demo_fcn(f(factor=2),x1=2,x2=3)
I get the error
Error in f(factor = 2) : argument "input" is missing, with no default
The desired outcome here should be the following code running
r = 2-3
f(input=r, factor=2)
The end goal is to implement this in a more complicated function, with multiple arguments for both demo_fcn and f
demo_fcn <- function(f, x1,x2){
r1 = x1 - x2
r2 = x1+x2
return(f(r1,r2))
}
f <- function(input1, input2, factor1,factor2){
out = input^factor1 + input2^factor2
return(out)
}
One way is to pass a function (not a function call), and use ... in the top function to pass additional arguments.
demo_fcn <- function(f, x1, x2, ...) {
r = x1 - x2
f(r, ...)
}
f <- function(input, factor){
out = input^factor
out
}
demo_fcn(f, x1=2, x2=5, factor=2)
# [1] 9
If you want to have multiple such functions, then you can do:
demo_fcn <- function(f1, f2, x1, x2, f1opts = NULL) {
r = x1 - x2
do.call(f, c(list(r), f1opts))
}
demo_fcn(f, x1=2, x2=5, f1opts=list(factor=2))
Yet another alternative, taking from curve, which may match more closely what you're hoping for.
demo_fcn <- function(expr, x1, x2, xname = "x") {
r = x1 - x2
sexpr <- substitute(expr)
if (is.name(sexpr)) {
expr <- call(as.character(sexpr), as.name(xname))
} else {
if (!((is.call(sexpr) || is.expression(sexpr)) && xname %in%
all.vars(sexpr)))
stop(gettextf("'expr' must be a function, or a call or an expression containing '%s'",
xname), domain = NA)
expr <- sexpr
}
ll <- list(x = r)
names(ll) <- xname
eval(expr, envir = ll, enclos = parent.frame())
}
demo_fcn(f(x, factor=2), x1=2, x2=5)
# [1] 9
See ?curve for more explanation of xname=, but in short: use x in your call to f(.) though it does not use any object named x in the local or other environment, it is just a placeholder. If you prefer, you can change to xname="input" and demo_fcn(f(input,factor=2),...) for the same effect, but realize that in that call, input is still a placeholder, not a reference to an object.
library(GLMsData)
data(fluoro)
lambda <- seq(-2,2,0.5)
lm.out <- list()
for(i in length(lambda)){
if(i != 0){
y <- (fluoro$Dose^lambda-1)/lambda
} else {
y <- log(fluoro$Dose)
}
lm.out[[i]] <- lm(y[i]~Time, data = fluoro, na.exclude = T)
}
print(lm.out)
Error in model.frame.default(formula = y[i] ~ Time, data = fluoro, drop.unused.levels = TRUE) : variable lengths differ (found for 'Time')
I am trying to use various transformations of the response variable and fit these corresponding models, and obtain residual plots for each model.
I need a help. Thanks
Here is a corrected version of the for loop in the question.
data(fluoro, package = "GLMsData")
lambda <- seq(-2, 2, 0.5)
lm.out <- list()
for(i in 1:length(lambda)){
if(lambda[i] != 0){
y <- (fluoro$Dose^lambda[i]-1)/lambda[i]
} else {
y <- log(fluoro$Dose)
}
lm.out[[i]] <- lm(y ~ Time, data = fluoro, na.action = na.exclude)
}
print(lm.out)
And a version with a boxcox function defined and used in a lapply loop.
boxcox <- function(x, lambda, na.rm = FALSE){
if(na.rm) x <- x[!is.na(x)]
if(lambda == 0){
log(x)
} else {
(x^lambda - 1)/lambda
}
}
lm_out2 <- lapply(lambda, \(l){
lm(boxcox(Dose, lambda = l) ~ Time, data = fluoro, na.action = na.exclude)
})
Check that both ways above produce the same results.
coef_list <- sapply(lm.out, coef)
coef_list2 <- sapply(lm_out2, coef)
identical(coef_list, coef_list2)
#[1] TRUE
smry_list <- lapply(lm.out, summary)
smry_list2 <- lapply(lm_out2, summary)
pval_list <- sapply(smry_list, \(fit) fit$coefficients[, "Pr(>|t|)"])
pval_list2 <- sapply(smry_list2, \(fit) fit$coefficients[, "Pr(>|t|)"])
identical(pval_list, pval_list2)
#[1] TRUE
R2_list <- sapply(smry_list, "[[", "r.squared")
R2_list2 <- sapply(smry_list2, "[[", "r.squared")
identical(R2_list, R2_list2)
#[1] TRUE
I've inherited some code that uses what I think is a common R idiom for libraries, but I'm not sure what is achieved by writing in such a verbose way. Ultimately I intend to re-write but I would first like to know why before I do something stupid.
ecd <-
function(formula, data, subset, weights, offset, ...) {
cl = match.call()
mf = match.call(expand.dots = FALSE)
m =
match(c("formula", "data", "subset", "weights", "offset"),
names(mf),
0L)
mf = mf[c(1L, m)]
mf$drop.unused.levels = TRUE
mf[[1L]] = quote(stats::model.frame)
mf = eval(mf, parent.frame())
mt = attr(mf, "terms")
y = stats::model.response(mf, "numeric")
w = as.vector(stats::model.weights(mf))
offset = as.vector(stats::model.offset(mf))
x = stats::model.matrix(mt, mf, contrasts)
z = ecd.fit(x, y, w, offset, ...)
My current understanding is that it constructs a function call object (type?) from the original arguments to the function and then manually calls it, rather than just calling stats::model.frame directly. Any insights would be appreciated.
I think this should answer everything, explanations are in the code :
# for later
FOO <- function(x) 1000 * x
y <- 1
foo <- function(...) {
cl = match.call()
message("cl")
print(cl)
message("as.list(cl)")
print(as.list(cl))
message("class(cl)")
print(class(cl))
# we can modify the call is if it were a list
cl[[1]] <- quote(FOO)
message("modified call")
print(cl)
y <- 2
# now I want to call it, if I call it here or in the parent.frame might
# give a different output
message("evaluate it locally")
print(eval(cl))
message("evaluate it in the parent environment")
print(eval(cl, parent.frame()))
message("eval.parent is equivalent and more idiomatic")
print(eval.parent(cl))
invisible(NULL)
}
foo(y)
# cl
# foo(y)
# as.list(cl)
# [[1]]
# foo
#
# [[2]]
# y
#
# class(cl)
# [1] "call"
# modified call
# FOO(y)
# evaluate it locally
# [1] 2000
# evaluate it in the parent environment
# [1] 1000
# eval.parent is equivalent and more idiomatic
# [1] 1000
I would like to perform Deming regression (or any equivalent of a regression method with uncertainties in both X and Y variables, such as York regression).
In my application, I have a very good scientific justification to deliberately set the intercept as zero. However, I can't find way of setting it to zero, either in R package deming, it gives an error when I use -1 in the formula:
df=data.frame(x=rnorm(10), y=rnorm(10), sx=runif(10), sy=runif(10))
library(deming)
deming(y~x-1, df, xstd=sy, ystd=sy)
Error in lm.wfit(x, y, wt/ystd^2) : 'x' must be a matrix
In other packages (like mcr::mcreg or IsoplotR::york or MethComp::Deming), the input are two vectors x and y, so there is no way I can input a model matrix or modify the formula.
Do you have any idea on how to achieve that? Thanks.
There is a bug in the function when you remove the intercept. I'm going to report it.
It is easy to fix, you just have to change 2 lines in the original function.
The print does not work correctly, but it is possible to interpret the output.
deming.aux <-
function (formula, data, subset, weights, na.action, cv = FALSE,
xstd, ystd, stdpat, conf = 0.95, jackknife = TRUE, dfbeta = FALSE,
x = FALSE, y = FALSE, model = TRUE)
{
deming.fit1 <- getAnywhere(deming.fit1)[[2]][[1]]
deming.fit2 <- getAnywhere(deming.fit2)[[2]][[1]]
Call <- match.call()
indx <- match(c("formula", "data", "weights", "subset", "na.action", "xstd", "ystd"), names(Call), nomatch = 0)
if (indx[1] == 0)
stop("A formula argument is required")
temp <- Call[c(1, indx)]
temp[[1]] <- as.name("model.frame")
mf <- eval(temp, parent.frame())
Terms <- terms(mf)
n <- nrow(mf)
if (n < 3)
stop("less than 3 non-missing observations in the data set")
xstd <- model.extract(mf, "xstd")
ystd <- model.extract(mf, "ystd")
Y <- as.matrix(model.response(mf, type = "numeric"))
if (is.null(Y))
stop("a response variable is required")
wt <- model.weights(mf)
if (length(wt) == 0)
wt <- rep(1, n)
usepattern <- FALSE
if (is.null(xstd)) {
if (!is.null(ystd))
stop("both of xstd and ystd must be given, or neither")
if (missing(stdpat)) {
if (cv)
stdpat <- c(0, 1, 0, 1)
else stdpat <- c(1, 0, 1, 0)
}
else {
if (any(stdpat < 0) || all(stdpat[1:2] == 0) || all(stdpat[3:4] ==
0))
stop("invalid stdpat argument")
}
if (stdpat[2] > 0 || stdpat[4] > 0)
usepattern <- TRUE
else {
xstd <- rep(stdpat[1], n)
ystd <- rep(stdpat[3], n)
}
}
else {
if (is.null(ystd))
stop("both of xstd and ystd must be given, or neither")
if (!is.numeric(xstd))
stop("xstd must be numeric")
if (!is.numeric(ystd))
stop("ystd must be numeric")
if (any(xstd <= 0))
stop("xstd must be positive")
if (any(ystd <= 0))
stop("ystd must be positive")
}
if (conf < 0 || conf >= 1)
stop("invalid confidence level")
if (!is.logical(dfbeta))
stop("dfbeta must be TRUE or FALSE")
if (dfbeta & !jackknife)
stop("the dfbeta option only applies if jackknife=TRUE")
X <- model.matrix(Terms, mf)
if (ncol(X) != (1 + attr(Terms, "intercept")))
stop("Deming regression requires a single predictor variable")
xx <- X[, ncol(X), drop = FALSE]
if (!usepattern)
fit <- deming.fit1(xx, Y, wt, xstd, ystd, intercept = attr(Terms, "intercept"))
else fit <- deming.fit2(xx, Y, wt, stdpat, intercept = attr(Terms, "intercept"))
yhat <- fit$coefficients[1] + fit$coefficients[2] * xx
fit$residuals <- Y - yhat
if (x)
fit$x <- X
if (y)
fit$y <- Y
if (model)
fit$model <- mf
na.action <- attr(mf, "na.action")
if (length(na.action))
fit$na.action <- na.action
fit$n <- length(Y)
fit$terms <- Terms
fit$call <- Call
fit
}
deming.aux(y ~ x + 0, df, xstd=sy, ystd=sy)
$`coefficients`
[1] 0.000000 4.324481
$se
[1] 0.2872988 0.7163073
$sigma
[1] 2.516912
$residuals
[,1]
1 9.19499513
2 2.13037957
3 3.00064886
4 2.16751905
5 0.00168729
6 4.75834265
7 3.44108236
8 6.40028085
9 6.63531039
10 -1.48624851
$model
y x (xstd) (ystd)
1 2.1459817 -1.6300251 0.48826221 0.48826221
2 1.3163362 -0.1882407 0.46002166 0.46002166
3 1.5263967 -0.3409084 0.55771660 0.55771660
4 -0.9078000 -0.7111417 0.81145673 0.81145673
5 -1.6768719 -0.3881527 0.01563191 0.01563191
6 -0.6114545 -1.2417205 0.41675425 0.41675425
7 -0.7783790 -0.9757150 0.82498713 0.82498713
8 1.1240046 -1.2200946 0.84072712 0.84072712
9 -0.3091330 -1.6058442 0.35926078 0.35926078
10 0.7215432 0.5105333 0.23674788 0.23674788
$n
[1] 10
$terms
y ~ x + 0
...
To adapt the function I have performed, these steps:
1 .- Load the internal functions of the package.
deming.fit1 <- getAnywhere(deming.fit1)[[2]][[1]]
deming.fit2 <- getAnywhere(deming.fit2)[[2]][[1]]
2 .- Locate the problem and solve it, executing the function step by step with an example.
Y <- as.matrix(model.response(mf, type = "numeric"))
...
xx <- X[, ncol(X), drop = FALSE]
3 .- Fix other possible error generated by the changes.
In this case, delete the class of the output to avoid an error in the print.
Bug report:
Terry Therneau (the author of deming) uploaded a new version to CRAN, with this problem solved.
I know the theoretical answer to the question of my title, which is discussed here or in this previous question on Stack Overflow. My problem is that, even considering some numerical roundings, the probability weights I compute using the coefficients fitted in the R function multinom are quite different from the weights directly obtained from the same function (through predict(fit, newdata = dat, "probs")). I tried to numerically compute these weights in Java and R, and in both implementations I obtain the same results, which are in fact different from the values returned from predict.
Do you know how I may discover the implementation of the function predict(..., "probs") for the R function multinom?
I first install nnet and open the help page for nnet function. I see that the function creates a nnet object.
I trypredict.nnet but nothing comes up. This means either the package is not loaded, the function doesn't exist or it's hidden. methods("predict") reveals that the object is actually hidden (indicated by the *).
> methods("predict")
[1] predict.ar* predict.Arima* predict.arima0* predict.glm
[5] predict.HoltWinters* predict.lm predict.loess* predict.mlm
[9] predict.multinom* predict.nls* predict.nnet* predict.poly
[13] predict.ppr* predict.prcomp* predict.princomp* predict.smooth.spline*
[17] predict.smooth.spline.fit* predict.StructTS*
Calling this function explicitly reveals its code.
> nnet:::predict.nnet
function (object, newdata, type = c("raw", "class"), ...)
{
if (!inherits(object, "nnet"))
stop("object not of class \"nnet\"")
type <- match.arg(type)
if (missing(newdata))
z <- fitted(object)
else {
if (inherits(object, "nnet.formula")) {
newdata <- as.data.frame(newdata)
rn <- row.names(newdata)
Terms <- delete.response(object$terms)
m <- model.frame(Terms, newdata, na.action = na.omit,
xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, m)
keep <- match(row.names(m), rn)
x <- model.matrix(Terms, m, contrasts = object$contrasts)
xint <- match("(Intercept)", colnames(x), nomatch = 0L)
if (xint > 0L)
x <- x[, -xint, drop = FALSE]
}
else {
if (is.null(dim(newdata)))
dim(newdata) <- c(1L, length(newdata))
x <- as.matrix(newdata)
if (any(is.na(x)))
stop("missing values in 'x'")
keep <- 1L:nrow(x)
rn <- rownames(x)
}
ntr <- nrow(x)
nout <- object$n[3L]
.C(VR_set_net, as.integer(object$n), as.integer(object$nconn),
as.integer(object$conn), rep(0, length(object$wts)),
as.integer(object$nsunits), as.integer(0L), as.integer(object$softmax),
as.integer(object$censored))
z <- matrix(NA, nrow(newdata), nout, dimnames = list(rn,
dimnames(object$fitted.values)[[2L]]))
z[keep, ] <- matrix(.C(VR_nntest, as.integer(ntr), as.double(x),
tclass = double(ntr * nout), as.double(object$wts))$tclass,
ntr, nout)
.C(VR_unset_net)
}
switch(type, raw = z, class = {
if (is.null(object$lev)) stop("inappropriate fit for class")
if (ncol(z) > 1L) object$lev[max.col(z)] else object$lev[1L +
(z > 0.5)]
})
}
<bytecode: 0x0000000009305fd8>
<environment: namespace:nnet>