Why subset doesn't mind missing subset argument for dataframes? - r

Normally I wonder where mysterious errors come from but now my question is where a mysterious lack of error comes from.
Let
numbers <- c(1, 2, 3)
frame <- as.data.frame(numbers)
If I type
subset(numbers, )
(so I want to take some subset but forget to specify the subset-argument of the subset function) then R reminds me (as it should):
Error in subset.default(numbers, ) :
argument "subset" is missing, with no default
However when I type
subset(frame,)
(so the same thing with a data.frame instead of a vector), it doesn't give an error but instead just returns the (full) dataframe.
What is going on here? Why don't I get my well deserved error message?

tl;dr: The subset function calls different functions (has different methods) depending on the type of object it is fed. In the example above, subset(numbers, ) uses subset.default while subset(frame, ) uses subset.data.frame.
R has a couple of object-oriented systems built-in. The simplest and most common is called S3. This OO programming style implements what Wickham calls a "generic-function OO." Under this style of OO, an object called a generic function looks at the class of an object and then applies the proper method to the object. If no direct method exists, then there is always a default method available.
To get a better idea of how S3 works and the other OO systems work, you might check out the relevant portion of the Advanced R site. The procedure of finding the proper method for an object is referred to as method dispatch. You can read more about this in the help file ?UseMethod.
As noted in the Details section of ?subset, the subset function "is a generic function." This means that subset examines the class of the object in the first argument and then uses method dispatch to apply the appropriate method to the object.
The methods of a generic function are encoded as
< generic function name >.< class name >
and can be found using methods(<generic function name>). For subset, we get
methods(subset)
[1] subset.data.frame subset.default subset.matrix
see '?methods' for accessing help and source code
which indicates that if the object has a data.frame class, then subset calls the subset.data.frame the method (function). It is defined as below:
subset.data.frame
function (x, subset, select, drop = FALSE, ...)
{
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
else {
e <- substitute(subset)
r <- eval(e, x, parent.frame())
if (!is.logical(r))
stop("'subset' must be logical")
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(x))
names(nl) <- names(x)
eval(substitute(select), nl, parent.frame())
}
x[r, vars, drop = drop]
}
Note that if the subset argument is missing, the first lines
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
produce a vector of TRUES of the same length as the data.frame, and the last line
x[r, vars, drop = drop]
feeds this vector into the row argument which means that if you did not include a subset argument, then the subset function will return all of the rows of the data.frame.
As we can see from the output of the methods call, subset does not have methods for atomic vectors. This means, as your error
Error in subset.default(numbers, )
that when you apply subset to a vector, R calls the subset.default method which is defined as
subset.default
function (x, subset, ...)
{
if (!is.logical(subset))
stop("'subset' must be logical")
x[subset & !is.na(subset)]
}
The subset.default function throws an error with stop when the subset argument is missing.

Related

combining do.call() and debug() prints all arguments contents

I am using R and trying to debug a function that I call with do.call() for convenience.
Combining do.call() and browser() is problematic. Basically, all the elements of the list of arguments passed to do.call() are printed, which, if the list contains for example a very large data table, is not sustainable.
Here is a reprex. I create a simple getsum() function that sums elements of a vector. I create a func() function that calls getsum() for a list of vectors.
#getsum returns the sum of a vector's elements
#func returns the vector of the sum of a list of vectors
func <- function(vec_list){
browser()
sums = lapply(FUN=getsum, X=vec_list)
sum = unlist(sums)
return(sum)
}
getsum <- function(vec){
sum = sum(vec)
return(sum)
}
args = list(vec_list=list(rnorm(5), rnorm(5)))
do.call(func, args)
That's the output I get :
Called from: (function(vec_list){
browser()
sums = lapply(FUN=getsum, X=vec_list)
sum = unlist(sums)
return(sum)
})(vec_list = list(c(-0.0801864335418185, 0.448324209935905,
-2.86518616779484, -0.359284963520417, -0.620062639582574), c(1.74835180362954,
-0.904288222154223, 0.746007117029027, 0.625889703799832, -0.908748727897187
)))
Browse[1]>
One might tell me "why are you using do.call()?". Indeed, if I simply call the function myself, the problem does not arise (see below). In this example I don't need to use do.call(), but sometimes it's extremely convenient.
#instead of do.call() use :
func(vec_list=args$vec_list)
The output is then :
Called from: func(vec_list = args$vec_list)
Browse[1]>
EDIT :
I have tried the argument browser(skipCalls=TRUE) that solves the problem but defies the purpose of browser(). It makes R executing all the function's command at once. Other suggestions welcome.

R: How do you pass the value of an expression to a function that rejects an expression argument?

MY code, below, now fixed, suffered from a different defect than I thought, unrelated to my title question. The real problem, and its solution, was supplied by #Roland, below.
I have a pair of functions, shown below, which together are supposed to return a list of the values of the column attribute named in attrC. When run, R objects that " Error in attr(x, which = get("attrC"), exact = TRUE): 'which' must be of mode character".
I have tried replacing attrC with get("attrC") and eval(attrC). Neither works.
I have three closely related questions. A good answer would answer all three.
How do I make this particular function work?
How can I tell, from the form of an R function or its documentation, when an argument which is required to be of a given type will accept a variable or expression that evaluates to that type?
If the answer to (1.) has not already supplied it: Is there a generic way to provide a function that requires an argument of given type and does not accept a name or expression which evaluates to that type, with the value from such a name or expression.
.
ColAttr <- function(x, attrC, ifIsNull) {
# Returns column attribute named in attrC, if present, else isNullC.
atr <- attr(x, attrC, exact = TRUE)
atr <- if (is.null(atr)) {ifIsNull} else {atr}
atr
}
AtribLst <- function(df, attrC, isNullC){
# Returns list of values of the col attribute attrC, if present, else isNullC
lapply(df, ColAttr, attrC=attrC, ifIsNull=isNullC)
}
stub93 <- AtribLst(cps_00093.df, attrC="label", isNullC=NA)

input variables in R functions from third party libraries

I am a reasonably proficient python programmer messing around with some R.
On this website, for the third party library ICC, I'm confused about input variables for the function ICCest.
Located here:
http://www.inside-r.org/packages/cran/ICC/docs/ICCest
I can use:
ICCest(Chick, weight, data=ChickWeight, CI.type="S")
And I got this to work. Chick and weight are column names for the data frame variable called ChickWeight. All is well and good.
Except, that, what type of variables are "Chick" and "weight"?? They aren't in my R namespace. They aren't strings because they don't have quotes around them.
Doing:
ICCest(Chick, "weight", data=ChickWeight, CI.type="S")
yields:
In ICCest(Chick, "weight", data = ChickWeight, CI.type = "S") :
passing a character string to 'y' is deprecated since ICC vesion 2.3.0 and will not be supported in future versions. The argument to 'y' should either be an unquoted column name of 'data' or an object
So again in my nice friendly python land you can't pass in unquoted characters strings that are not objects in your namespace so I am quite confused.
What is happening here?
You can take a look at the function's code by typing ICCest (without the parantheses):
> ICCest
Object with tracing code, class "functionWithTrace"
Original definition:
function (x, y, data = NULL, alpha = 0.05, CI.type = c("THD", "Smith")){
square <- function(z) {
z^2
}
icall <- list(y = substitute(y), x = substitute(x))
if (is.character(icall$y)) {
warning("passing a character string to 'y' is deprecated since ICC vesion 2.3.0 and will not be supported in future versions. The argument to 'y' should either be an unquoted column name of 'data' or an object")
if (missing(data))
stop("Supply either the unquoted name of the object containing 'y' or supply both 'data' and then 'y' as an unquoted column name to 'data'")
icall$y <- eval(as.name(y), data, parent.frame())
} ...
what happens after the square function block, is that the input is stored in icall in a parse tree, which you can think of as a set of unevaluated expressions. So there's no error when you pass plain weight without the quotation marks, because at this point, there hasn't been an attempt to evaluate the expressions yet. (I'm a bit unsure about this last statement. I hope someone can confirm if it is technically correct)
Inside the if block (where your warning is raised), you can see that they are using eval to update the local variable icall$y. What eval does is essentially evaluating an expression within an environment. Specifically, in the environment of a dataframe, the column names are considered part of the environment.
Now it says in the documentation, that eval takes an expression as its first input. This is why y is cast to an object with as.name before being passed to eval (remember that we are in the if block for string input y)
eval(expr, envir = parent.frame(),...)
And expressions and strings are different in R. So in the last line of code shown above, the y input (here, weight) is being evaluated in the data environment --which, here, is ChickWeight.
To get a better feeling, try this:
> eval(weight, ChickWeight)
Error in eval(weight, ChickWeight) : object 'weight' not found
But if you make an unevaluated expression first, it will work:
> expr <- quote(weight)
> eval(expr, ChickWeight)
Here, quote is doing roughly the same thing as substitute in the 4th line of the function. Check here for more on quote and substitute\.
Why are you passing your y as a quoted string. The function doesn't appear to require quoted strings for variable names. Doing
str(ChickWeight)
will give you the types for the variables. They aren't in a 'name space' because they are variable names in the data.frame ChickWeight.

R: passing by parameter to function and using apply instead of nested loop and recursive indexing failed

I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.

Error message in R: "Arguments to methods() must be named, or one named list"

I'm new to creating classes and methods in R and am running into a problem that I haven't found much documentation on. I have created a class, 'DataImport', and am trying to add the method below:
DataImport$methods(reducedImport <- function(filePathOne, dataFrame)
{
}
)
When I run this code I'm getting the following error:
Error in DataImport$methods(reducedImport <- function(filePathOne, :
Arguments to methods() must be named, or one named list
I was able to add a method directly before this one and it worked fine but this one doesn't. I don't quite understand why that would be the case or how to fix it.
As Dason mentioned in the comment, your problem is with assignment. Let's create a simple example:
c1 = setRefClass("c1", fields = list( data = "numeric"))
c1$methods(m1 = function(a) a)
Now a quick test:
x = c1$new(data=10)
x$m1(1)
However,
R> c1$methods(m2 <- function(a) a)
Error in c1$methods(m2 <- function(a) a) :
Arguments to methods() must be named, or one named list
gives the error you see. The reason for this is that the <- operator is slightly different from the = operator. This in general doesn't matter (but it does here).

Resources