Base R defines an identity function, a trivial identity function returning its argument (quoting from ?identity).
It is defined as :
identity <- function (x){x}
Why would such a trivial function ever be useful? Why would it be included in base R?
Don't know about R, but in a functional language one often passes functions as arguments to other functions. In such cases, the constant function (which returns the same value for any argument) and the identity function play a similar role as 0 and 1 in multiplication, so to speak.
I use it from time to time with the apply function of commands.
For instance, you could write t() as:
dat <- data.frame(x=runif(10),y=runif(10))
apply(dat,1,identity)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
x 0.1048485 0.7213284 0.9033974 0.4699182 0.4416660 0.1052732 0.06000952
y 0.7225307 0.2683224 0.7292261 0.5131646 0.4514837 0.3788556 0.46668331
[,8] [,9] [,10]
x 0.2457748 0.3833299 0.86113771
y 0.9643703 0.3890342 0.01700427
One use that appears on a simple code base search is as a convenience for the most basic type of error handling function in tryCatch.
tryCatch(...,error = identity)
which is identical (ha!) to
tryCatch(...,error = function(e) e)
So this handler would catch an error message and then simply return it.
For whatever it's worth, it is located in funprog.R (the functional programming stuff) in the source of the base package, and it was added as a "convenience function" in 2008: I can imagine (but can't give an immediate example!) that there would be some contexts in the functional programming approach (i.e. using Filter, Reduce, Map etc.) where it would be convenient to have an identity function ...
r45063 | hornik | 2008-04-03 12:40:59 -0400 (Thu, 03 Apr 2008) | 2 lines
Add higher-order functions Find() and Position(), and convenience
function identity().
Stepping away from functional programming, identity is also used in another context in R, namely statistics. Here, it is used to refer to the identity link function in generalized linear models. For more details about this, see ?family or ?glm. Here is an example:
> x <- rnorm(100)
> y <- rpois(100, exp(1+x))
> glm(y ~x, family=quasi(link=identity))
Call: glm(formula = y ~ x, family = quasi(link = identity))
Coefficients:
(Intercept) x
4.835 5.842
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 6713
Residual Deviance: 2993 AIC: NA
However, in this case parsing it as a string instead of a function will achieve the same: glm(y ~x, family=quasi(link="identity"))
EDIT: As noted in the comments below, the function base::identity is not what is used by the link constructor, and it is just used for parsing the link name. (Rather than deleting this answer, I'll leave it to help clarify the difference between the two.)
Here is usage example:
Map<Integer, Long> m = Stream.of(1, 1, 2, 2, 3, 3)
.collect(Collectors.groupingBy(Function.identity(),
Collectors.counting()));
System.out.println(m);
output:
{1=2, 2=2, 3=2}
here we are grouping ints into a int/count map. Collectors.groupingBy accepts a Function. In our case we need a function which returns the argument. Note that we could use e->e lambda instead
I just used it like this:
fit_model <- function(lots, of, parameters, error_silently = TRUE) {
purrr::compose(ifelse(test = error_silently, yes = tryNA, no = identity),
fit_model_)(lots, of, parameters)
}
tryNA <- function(expr) {
suppressWarnings(tryCatch(expr = expr,
error = function(e) NA,
finally = NA))
}
As this question has already been viewed 8k times it maybe worth updating even 9 years after it has been written.
In a blog post called "Simple tricks for Debugging Pipes (within magrittr, base R or ggplot2)" the author points out how identity() can be very usefull at the end of different kinds of pipes. The blogpost with examples can be found here: https://rstats-tips.net/2021/06/06/simple-tricks-for-debugging-pipes-within-magrittr-base-r-or-ggplot2/
If pipe chains are written in a way, that each "pipe" symbol is at the end of a line, you can exclude any line from execution by commenting it out. Except for the last line. If you add identity() as the last line, there will never be a need to comment that out. So you can temporarily exclude any line that changes the data by commenting it out.
Related
I've seen a couple of people using [<- as a function with Polish notation, for example
x <- matrix(1:4, nrow = 2)
`[<-`(x, 1, 2, 7)
which returns
[,1] [,2]
[1,] 1 7
[2,] 2 4
I've tried playing around with [<- a little, and it looks like using it this way prints the result of something like x[1,2] <- 7 without actually performing the assignment. But I can't figure out for sure what this function actually does, because the documentation given for ?"[" only mentions it in passing, and I can't search google or SO for "[<-".
And yes, I know that actually using it is probably a horrible idea, I'm just curious for the sake of a better understanding of R.
This is what you would need to do to get the assignment to stick:
`<-`( `[`( x, 1, 2), 7) # or x <- `[<-`( x, 1, 2, 7)
x
[,1] [,2]
[1,] 1 7
[2,] 2 4
Essentially what is happening is that [ is creating a pointer into row-col location of x and then <- (which is really a synonym for assign that can also be used in an infix notation) is doing the actual "permanent" assignment. Do not be misled into thinking this is a call-by-reference assignment. I'm reasonably sure there will still be a temporary value of x created.
Your version did make a subassignment (as can be seen by what it returned) but that assignment was only in the local environment of the call to [<- which did not encompass the global environment.
Since `[`(x, y) slices an object, and `<-`(x, z) performs assignment, it seems like `[<-`(x,y,z) would perform the assignment x[y] <- y. #42-'s answer is a great explanation of what [<- actually does, and the top answer to `levels<-`( What sorcery is this? provides some insight into why R works this way.
To see what [<- actually does under the hood, you have to go to the C source code, which for [<- can be found at http://svn.r-project.org/R/trunk/src/main/subassign.c (the relevant parts start at around line 1470). You can see that x, the object being "assigned to" is protected so that only the local version is mutated. Instead, we're using VectorAssign, MatrixAssign, ArrayAssign, etc. to perform assignment locally and then returning the result.
I saw:
“To understand computations in R, two slogans are helpful:
• Everything that exists is an object.
• Everything that happens is a function call."
— John Chambers
But I just found:
a <- 2
is.object(a)
# FALSE
Actually, if a variable is a pure base type, it's result is.object() would be FALSE. So it should not be an object.
So what's the real meaning about 'Everything that exists is an object' in R?
The function is.object seems only to look if the object has a "class" attribute. So it has not the same meaning as in the slogan.
For instance:
x <- 1
attributes(x) # it does not have a class attribute
NULL
is.object(x)
[1] FALSE
class(x) <- "my_class"
attributes(x) # now it has a class attribute
$class
[1] "my_class"
is.object(x)
[1] TRUE
Now, trying to answer your real question, about the slogan, this is how I would put it. Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. I think this is better understood with functions and expressions, which are not usually thought as data.
Taking a quote from Chambers (2008):
The central computation in R is a function call, defined by the
function object itself and the objects that are supplied as the
arguments. In the functional programming model, the result is defined
by another object, the value of the call. Hence the traditional motto
of the S language: everything is an object—the arguments, the value,
and in fact the function and the call itself: All of these are defined
as objects. Think of objects as collections of data of all kinds. The data contained and the way the data is organized depend on the class from which the object was generated.
Take this expression for example mean(rnorm(100), trim = 0.9). Until it is is evaluated, it is an object very much like any other. So you can change its elements just like you would do it with a list. For instance:
call <- substitute(mean(rnorm(100), trim = 0.9))
call[[2]] <- substitute(rt(100,2 ))
call
mean(rt(100, 2), trim = 0.9)
Or take a function, like rnorm:
rnorm
function (n, mean = 0, sd = 1)
.Call(C_rnorm, n, mean, sd)
<environment: namespace:stats>
You can change its default arguments just like a simple object, like a list, too:
formals(rnorm)[2] <- 100
rnorm
function (n, mean = 100, sd = 1)
.Call(C_rnorm, n, mean, sd)
<environment: namespace:stats>
Taking one more time from Chambers (2008):
The key concept is that expressions for evaluation are themselves
objects; in the traditional motto of the S language, everything is an
object. Evaluation consists of taking the object representing an
expression and returning the object that is the value of that
expression.
So going back to our call example, the call is an object which represents another object. When evaluated, it becomes that other object, which in this case is the numeric vector with one number: -0.008138572.
set.seed(1)
eval(call)
[1] -0.008138572
And that would take us to the second slogan, which you did not mention, but usually comes together with the first one: "Everything that happens is a function call".
Taking again from Chambers (2008), he actually qualifies this statement a little bit:
Nearly everything that happens in R results from a function call.
Therefore, basic programming centers on creating and refining
functions.
So what that means is that almost every transformation of data that happens in R is a function call. Even a simple thing, like a parenthesis, is a function in R.
So taking the parenthesis like an example, you can actually redefine it to do things like this:
`(` <- function(x) x + 1
(1)
[1] 2
Which is not a good idea but illustrates the point. So I guess this is how I would sum it up: Everything that exists in R is an object because they are data which can be manipulated. And (almost) everything that happens is a function call, which is an evaluation of this object which gives you another object.
I love that quote.
In another (as of now unpublished) write-up, the author continues with
R has a uniform internal structure for representing all objects. The evaluation process keys off that structure, in a simple form that is essentially
composed of function calls, with objects as arguments and an object as the
value. Understanding the central role of objects and functions in R makes
use of the software more effective for any challenging application, even those where extending R is not the goal.
but then spends several hundred pages expanding on it. It will be a great read once finished.
Objects For x to be an object means that it has a class thus class(x) returns a class for every object. Even functions have a class as do environments and other objects one might not expect:
class(sin)
## [1] "function"
class(.GlobalEnv)
## [1] "environment"
I would not pay too much attention to is.object. is.object(x) has a slightly different meaning than what we are using here -- it returns TRUE if x has a class name internally stored along with its value. If the class is stored then class(x) returns the stored value and if not then class(x) will compute it from the type. From a conceptual perspective it matters not how the class is stored internally (stored or computed) -- what matters is that in both cases x is still an object and still has a class.
Functions That all computation occurs through functions refers to the fact that even things that you might not expect to be functions are actually functions. For example when we write:
{ 1; 2 }
## [1] 2
if (pi > 0) 2 else 3
## [1] 2
1+2
## [1] 3
we are actually making invocations of the {, if and + functions:
`{`(1, 2)
## [1] 2
`if`(pi > 0, 2, 3)
## [1] 2
`+`(1, 2)
## [1] 3
I have an example where I am not sure I understand scoping in R, nor I think it's doing the Right Thing. The example is modified from "An R and S-PLUS Companion to Applied Regression" by J. Fox
> make.power = function(p) function(x) x^p
> powers = lapply(1:3, make.power)
> lapply(powers, function(p) p(2))
What I expected in the list powers where three functions that compute the identity, square and cube functions respectively, but they all cube their argument. If I don't use an lapply, it works as expected.
> id = make.power(1)
> square = make.power(2)
> cube = make.power(3)
> id(2)
[1] 2
> square(2)
[1] 4
> cube(2)
[1] 8
Am I the only person to find this surprising or disturbing? Is there a deep satisfying reason why it is so? Thanks
PS: I have performed searches on Google and SO, but, probably due to the generality of the keywords related to this problem, I've come out empty handed.
PPS: The example is motivated by a real bug in the package quickcheck, not by pure curiosity. I have a workaround for the bug, thanks for your concern. This is about learning something.
After posting the question of course I get an idea for a different example that could clarify the issue.
> p = 1
> id = make.power(p)
> p = 2
> square = make.power(p)
> id(2)
[1] 4
p has the same role as the loop variable hidden in an lapply. p is passed by a method that in this case looks like reference to make.power. Make.power doesn't evaluate it, just keeps a pointer to it. Am I on the right track?
This fixes the problem
make.power = function(p) {force(p); function(x) x^p}
powers = lapply(1:3, make.power)
lapply(powers, function(p) p(2))
This issue is that function parameters are passed as "promises" that aren't evaluated until they are actually used. Here, because you never actually use p when calling make.power(), it remains in the newly created environment as a promise that points to the variable passed to the function. When you finally call powers(), that promise is finally evaluated and the most recent value of p will be from the last iteration of the lapply. Hence all your functions are cubic.
The force() here forces the evaluation of the promise. This allows the newly created function each to have a different reference to a specific value of p.
I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.
I am trying to optimise my likelihood function of R_j and R_m using optim to estimate al_j, au_j, b_j and sigma_j. This is what I did.
a = read.table("D:/ff.txt",header=T)
attach(a)
a
R_j R_m
1 2e-03 0.026567295
2 3e-03 0.009798475
3 5e-02 0.008497274
4 -1e-02 0.012464578
5 -9e-04 0.002896023
6 9e-02 0.000879473
7 1e-02 0.003194435
8 6e-04 0.010281122
The parameters al_j, au_j, b_j and sigma_j need to be estimated.
llik=function(R_j,R_m)
if(R_j< 0)
{
sum[log(1/(2*pi*(sigma_j^2)))-(1/(2*(sigma_j^2))*(R_j+al_j-b_j*R_m))^2]
}else if(R_j>0)
{
sum[log(1/(2*pi*(sigma_j^2)))-(1/(2*(sigma_j^2))*(R_j+au_j-b_j*R_m))^2]
}else if(R_j==0)
{
sum(log(pnorm(au_j,mean=b_j*R_m,sd=sigma_j)-pnorm(al_j,mean=b_j*R_m,sd=sigma_j)))
}
start.par=c(al_j=0,au_j=0,sigma_j=0.01,b_j=1)
out1=optim(llik,par=start.par,method="Nelder-Mead")
Error in pnorm(au_j, mean = b_j * R_m, sd = sigma_j) :
object 'au_j' not found
It is difficult to tell where to start on this.
As #mac said, your code is difficult to read. It also contains errors.
For example, if you try sum[c(1,2)] you will get an error: you should use sum(c(1,2)). In any case, you seem to be taking the sum in the wrong place. You cannot use if and else if on vectors, and need to use ifelse. You have nothing to stop the standard deviation going negative. There is more.
The following code runs without errors or warnings. You will still have to decide whether it does what you want.
a <- data.frame( R_j = c(0.002,0.003,0.05,-0.01,-0.0009,0.09,0.01,0.0006),
R_m = c(0.026567295,0.009798475,0.008497274,0.012464578,
0.002896023,0.000879473,0.003194435,0.010281122) )
llik = function(x)
{
al_j=x[1]; au_j=x[2]; sigma_j=x[3]; b_j=x[4]
sum(
ifelse(a$R_j< 0, log(1/(2*pi*(sigma_j^2)))-
(1/(2*(sigma_j^2))*(a$R_j+al_j-b_j*a$R_m))^2,
ifelse(a$R_j>0 , log(1/(2*pi*(sigma_j^2)))-
(1/(2*(sigma_j^2))*(a$R_j+au_j-b_j*a$R_m))^2,
log(pnorm(au_j,mean=b_j*a$R_m,sd=sqrt(sigma_j^2))-
pnorm(au_j,mean=b_j*a$R_m,sd=sqrt(sigma_j^2)))))
)
}
start.par = c(0, 0, 0.01, 1)
out1 = optim(llik, par=start.par, method="Nelder-Mead")
Let's start with the error message:
Error in pnorm(au_j, mean = b_j * R_m, sd = sigma_j) :
object 'au_j' not found
So R is telling you that when it got to the pnorm call, it couldn't find anything called 'au_j' to use in that call. Your next step should be to look at your function, llik, and try to identify how you expect the variable 'au_j' to be defined within that function.
At this point, the answer should be fairly clear (maybe!). Nowhere in llik is the variable 'au_j' assigned a value. So it won't be 'created' inside the function. R's scoping rules will then cause it to look outside the function in the global environment for something called 'au_j'.
And you might say that here is where things should work, since you assigned 'au_j' a value within start.par. But that's a list, and R can't find the named object 'au_j' inside a list like that.
So the solution here is most likely to rework your function llik so that it takes as arguments everything that it will use, so you're going to add everything in start.par to the arguments of llik. Something like:
llik <- function(par=c(al_j,au_j,sigma_j,b_j),R_j,R_m){...}
and then within llik you'll refer to al_j using par[1] and so forth. Then the optim call should look something like:
optim(start.par,llik,R_j=a$R_j,R_m=a$R_m)
Since you've attached your data, in a, you probably don't have explicitly pass the arguments R_j and R_m in the optim call, but it's probably good practice to do so.
I think I've reconstructed what you're trying to accomplish here (modulo the math, which I haven't even glanced at), but I confess that your code is a bit hard to parse. I would suggest spending some time with the examples in ?optim to make sure you understand how that function is called.