I'm trying to make a small R package with my limited knowledge in R programming. I am trying to use the following argument:
formula=~a+b*X
where X is vector, 'a' and 'b' are constants in a function call.
What I'm wondering is once I input the formula, I want to extract (a,b) and X separately and use them for other data manipulations inside the function call. Is there a way to do it in R?
I would really appreciate any guidance.
Note: Edited my question for clarity
I'm looking for something similar to model.matrix() output. The above mentioned formula can be more generalized to accommodate 'n' number of variables, say,
~2+3*X +4*Y+...+2*Z
In the output, I need the coefficients (2 3 4 ...2) as a vector and [1 X Y ... Z] as a covariate matrix.
The question is not completely clear so we will assume that the question is, given a formula using standard formula syntax, how do we parse out the variables names (or in the second answer the variable names and constants) giving as output a character vector containing them.
1) all.vars Try this:
fo <- a + b * X # input
all.vars(fo)
giving:
[1] "a" "b" "X"
2) strapplyc Also we could do it with string manipulation. In this case it also parses out the constants.
library(gsubfn)
fo <- ~ 25 + 35 * X # input
strapplyc(gsub(" ", "", format(fo)), "-?[0-9.]+|[a-zA-Z0-9._]+", simplify = unlist)
giving:
[1] "25" "35" "X"
Note: If all you are trying to do is to evaluate the RHS of the formula as an R expression then it is just:
X <- 1:3
fo <- ~ 1 + 2 * X
eval(fo[[2]])
giving:
[1] 3 5 7
Update: Fixed and added second solution and Note.
A call is a list of symbols and/or other calls and its elements can be accessed through normal indexing operations, e.g.
f <- ~a+bX
f[[1]]
#`~`
f[[2]]
#a + bX
f[[2]][[1]]
#`+`
f[[2]][[2]]
#a
However notice that in your formula bX is one symbol, you probably meant b * X instead.
f <- ~a + b * X
Then a and b typically would be stored in an unevaluated list.
vars <- call('list', f[[2]][[2]], f[[2]][[3]][[2]])
vars
#list(a, b)
and vars would be passed to eval at some point.
Related
When does R evaluate the formula argument passed into lm(). For example:
tmp <- rnorm(1000)
y <- rnorm(1000) + tmp
df = data.frame(x=tmp)
lm(y~x, data=df)$coefficients
returns
(Intercept) x
0.01713098 0.98073687
This suggests that y~x is evaluated after control is passed to lm() because there's no x variable in the calling context. However, it also suggests it is evaluated before control is passed to lm() because there's no y column in df.
In most languages the arguments are fully evaluated before being passed down the stack. I'm probably missing something subtle here. Any help would be appreciated.
The formula indicator ~ is really just a function that captures unevaluated symbols. It's a lot like quote() where you can run quote(f(x)+1) and get back the unevaluated R expression for the syntax you pass in. The main difference is that ~ allows you to have two sets of unevaluated expressions, and that the formula will keep track of the environment where it was created. So these are the same
a ~ b
`~`(a, b)
So you can pass any valid R syntax into a formula, it doesn't have to be a variable name. The values are stored as un-evaluated symbols or calls. So the formula call itself is evaluated, but its parameters are not.
Other functions can use these formulas however they like. The lm() function for example will pass the formula to model.matrix. Then model.matrix will attempt to evaluate the variables found in the formula in the data.frame that was passed to the function, and, if not present there, the environment where the formula was created. Different functions may choose to evaluate formulas in different ways so it's really not possible to say with certainty when the evaluation will happen.
It usually happens something like this
myformula <- a~b
b <- 4
dd <- data.frame(a=1:3)
eval(myformula[[2]], dd, environment(myformula)) + # a
eval(myformula[[3]], dd, environment(myformula)) # b
# [1] 5 6 7
The symbols are looked up in the environment/list passed to the evaluator.
The environment part is also relevant because you can also have situations like this
b <- 10
foo <- function(myformula = a~b) {
b <- 4
dd <- data.frame(a=1:3)
eval(myformula[[2]], dd, environment(myformula)) +
eval(myformula[[3]], dd, environment(myformula))
}
foo()
# [1] 5 6 7
foo(a~b)
# [1] 11 12 13
If you call foo without a parameter, the formula is created by default inside the function body so it will look for the value b there. But if you call foo with a formula, that formula is evaulated in the calling environment so R will look for the value of b outside the function for most modeling functions. Again, this is just a convention and non-base R package may choose to do something else.
I am trying to do the following but can't figure it out. Could someone please help me?
f <- expression(x^3+4*y)
df <- D(f,'x')
x <-0
df0 <- eval(df)
df0 should be a function of y!
If you take the derivative of f with respect to x you get 3 * x^2. The 4*y is a constant as far as x is concerned. So you don't have a function of y as such, your df is a constant as far as y is concerned (although it is a function of x).
Assigning to x doesn't change df; it remains the expression 3 * x^2 and is still a function of x if you wanted to treat it as such.
If you want to substitute a variable in an expression, then substitute() is what you are looking for.
> substitute(3 * x^2, list(x = 0))
3 * 0^2
It is a blind substitute with no simplification of the expression--we probably expected zero here, but we get zero times 3--but that is what you get.
Unfortunately, substituting in an expression you have in a variable is a bit cumbersome, since substitute() thinks its first argument is the verbatim expression, so you get
> substitute(df, list(x = 0))
df
The expression is df, there is no x in that so nothing is substituted, and you just get df back.
You can get around that with two substitutions and an eval:
> df0 <- eval(
+ substitute(substitute(expr, list(x = 0)),
+ list(expr = df)))
> df0
3 * 0^2
> eval(df0)
[1] 0
The outermost substitute() puts the value of df into expr, so you get the right expression there, and the inner substitute() changes the value of x.
There are nicer functions for manipulating expressions in the Tidyverse, but I don't remember them off the top of my head.
I'm trying to override the binary operators like + or - that receive two integers or two numeric without setting class attribute.
First, I tried setMethod. But It doesn't re-define sealed operator
Second, I tried to write Ops.{class} like this link
But it didn't work without setting class to S3 objects.
So, I want to know how to override + and - methods that takes integers or numerics without a class attributes.
If you just want to override + and - for numerics you can do that. Here's an example:
`+` <- function(x,y) x * y
2 + 3
[1] 6
Of course, after you've done this, you can't use + in the normal way anymore (but for reasons beyond me it seems that's what you want).
If you need some special arithmetics for numerics,it is easier to define infix operators with the %<operator>% notation. Here's an example defining the operations from max-plus algebra
`%+%` <- function(x,y) pmax(x,y) #(use pmax for vectorization)
`%*%` <- function(x,y) x + y
2 %+% 3
[1] 3
2 %*% 3
[1] 5
Another option is to define a special number class. (I'll call it tropical in the following example since max-plus algebra is a variant of tropical algebra)
setClass("tropical",slots = c(x="numeric"))
# a show method always comes in handy
setMethod("show","tropical",function(object){
cat("tropical vector\n")
print(object#x)
})
# its also nice to have an extractor
setMethod("[","tropical",function(x,i,j,...,drop) new("tropical",x=x#x[i]) )
setMethod("+",c("tropical","tropical")
, function(e1,e2) new("tropical", x=pmax(e1#x,e2#x))
setMethod("*",c("tropical","tropical")
, function(e1,e2) new("tropical", x= e1#x + e2#x))
# try it out
tr1 <- new("tropical",x=c(1,2,3))
tr2 <- new("tropical",x=c(3,2,1))
tr1 + tr2
tr1 * tr2
# this gives a warning about recycling
tr1[1:2] + tr2
The following function is used to multiply a sequence 1:x by y
f1<-function(x,y){return (lapply(1:x, function(a,b) b*a, b=y))}
Looks like a is used to represent the element in the sequence 1:x, but I do not know how to understand this parameter passing mechanism. In other OO languages, like Java or C++, there have call by reference or call by value.
Short answer: R is call by value. Long answer: it can do both.
Call By Value, Lazy Evaluation, and Scoping
You'll want to read through: the R language definition for more details.
R mostly uses call by value but this is complicated by its lazy evaluation:
So you can have a function:
f <- function(x, y) {
x * 3
}
If you pass in two big matrixes to x and y, only x will be copied into the callee environment of f, because y is never used.
But you can also access variables in parent environments of f:
y <- 5
f <- function(x) {
x * y
}
f(3) # 15
Or even:
y <- 5
f <- function() {
x <- 3
g <- function() {
x * y
}
}
f() # returns function g()
f()() # returns 15
Call By Reference
There are two ways for doing call by reference in R that I know of.
One is by using Reference Classes, one of the three object oriented paradigms of R (see also: Advanced R programming: Object Oriented Field Guide)
The other is to use the bigmemory and bigmatrix packages (see The bigmemory project). This allows you to create matrices in memory (underlying data is stored in C), returning a pointer to the R session. This allows you to do fun things like accessing the same matrix from multiple R sessions.
To multiply a vector x by a constant y just do
x * y
The (some prefix)apply functions works very similar to each other, you want to map a function to every element of your vector, list, matrix and so on:
x = 1:10
x.squared = sapply(x, function(elem)elem * elem)
print(x.squared)
[1] 1 4 9 16 25 36 49 64 81 100
It gets better with matrices and data frames because you can now apply a function over all rows or columns, and collect the output. Like this:
m = matrix(1:9, ncol = 3)
# The 1 below means apply over rows, 2 would mean apply over cols
row.sums = apply(m, 1, function(some.row) sum(some.row))
print(row.sums)
[1] 12 15 18
If you're looking for a simple way to multiply a sequence by a constant, definitely use #Fernando's answer or something similar. I'm assuming you're just trying to determine how parameters are being passed in this code.
lapply calls its second argument (in your case function(a, b) b*a) with each of the values of its first argument 1, 2, ..., x. Those values will be passed as the first parameter to the second argument (so, in your case, they will be argument a).
Any additional parameters to lapply after the first two, in your case b=y, are passed to the function by name. So if you called your inner function fxn, then your invocation of lapply is making calls like fxn(1, b=4), fxn(2, b=4), .... The parameters are passed by value.
You should read the help of lapply to understand how it works. Read this excellent answer to get and a good explanation of different xxpply family functions.
From the help of laapply:
lapply(X, FUN, ...)
Here FUN is applied to each elementof X and ... refer to:
... optional arguments to FUN.
Since FUN has an optional argument b, We replace the ... by , b=y.
You can see it as a syntax sugar and to emphasize the fact that argument b is optional comparing to argument a. If the 2 arguments are symmetric maybe it is better to use mapply.
Consider the following simple function:
f <- function(x, value){print(x);print(substitute(value))}
Argument x will eventually be evaluated by print, but value never will. So we can get results like this:
> f(a, a)
Error in print(x) : object 'a' not found
> f(3, a)
[1] 3
a
> f(1+1, 1+1)
[1] 2
1 + 1
> f(1+1, 1+"one")
[1] 2
1 + "one"
Everything as expected.
Now consider the same function body in a replacement function:
'g<-' <- function(x, value){print(x);print(substitute(value))}
(the single quotes should be fancy quotes)
Let's try it:
> x <- 3
> g(x) <- 4
[1] 3
[1] 4
Nothing unusual so far...
> g(x) <- a
Error: object 'a' not found
This is unexpected. Name a should be printed as a language object.
> g(x) <- 1+1
[1] 4
1 + 1
This is ok, as x's former value is 4. Notice the expression passed unevaluated.
The final test:
> g(x) <- 1+"one"
Error in 1 + "one" : non-numeric argument to binary operator
Wait a minute... Why did it try to evaluate this expression?
Well the question is: bug or feature? What is going on here? I hope some guru users will shed some light about promises and lazy evaluation on R. Or we may just conclude it's a bug.
We can reduce the problem to a slightly simpler example:
g <- function(x, value)
'g<-' <- function(x, value) x
x <- 3
# Works
g(x, a)
`g<-`(x, a)
# Fails
g(x) <- a
This suggests that R is doing something special when evaluating a replacement function: I suspect it evaluates all arguments. I'm not sure why, but the comments in the C code (https://github.com/wch/r-source/blob/trunk/src/main/eval.c#L1656 and https://github.com/wch/r-source/blob/trunk/src/main/eval.c#L1181) suggest it may be to make sure other intermediate variables are not accidentally modified.
Luke Tierney has a long comment about the drawbacks of the current approach, and illustrates some of the more complicated ways replacement functions can be used:
There are two issues with the approach here:
A complex assignment within a complex assignment, like
f(x, y[] <- 1) <- 3, can cause the value temporary
variable for the outer assignment to be overwritten and
then removed by the inner one. This could be addressed by
using multiple temporaries or using a promise for this
variable as is done for the RHS. Printing of the
replacement function call in error messages might then need
to be adjusted.
With assignments of the form f(g(x, z), y) <- w the value
of z will be computed twice, once for a call to g(x, z)
and once for the call to the replacement function g<-. It
might be possible to address this by using promises.
Using more temporaries would not work as it would mess up
replacement functions that use substitute and/or
nonstandard evaluation (and there are packages that do
that -- igraph is one).
I think the key may be found in this comment beginning at line 1682 of "eval.c" (and immediately followed by the evaluation of the assignment operation's RHS):
/* It's important that the rhs get evaluated first because
assignment is right associative i.e. a <- b <- c is parsed as
a <- (b <- c). */
PROTECT(saverhs = rhs = eval(CADR(args), rho));
We expect that if we do g(x) <- a <- b <- 4 + 5, both a and b will be assigned the value 9; this is in fact what happens.
Apparently, the way that R ensures this consistent behavior is to always evaluate the RHS of an assignment first, before carrying out the rest of the assignment. If that evaluation fails (as when you try something like g(x) <- 1 + "a"), an error is thrown and no assignment takes place.
I'm going to go out on a limb here, so please, folks with more knowledge feel free to comment/edit.
Note that when you run
'g<-' <- function(x, value){print(x);print(substitute(value))}
x <- 1
g(x) <- 5
a side effect is that 5 is assigned to x. Hence, both must be evaluated. But if you then run
'g<-'(x,10)
both the values of x and 10 are printed, but the value of x remains the same.
Speculation:
So the parser is distinguishing between whether you call g<- in the course of making an actual assignment, and when you simply call g<- directly.