make function detect nonexistent column when specified as df$x - r

I have functions that operate on a single vector (for example, a column in a data frame). I want users to be able to use $ to specify the columns that they pass to these functions; for example, I want them to be able to write myFun(df$x), where df is a data frame. But in such cases, I want my functions to detect when x isn't in df. How may I do this?
Here is a minimal illustration of the problem:
myFun <- function (x) sum(x)
data(iris)
myFun(iris$Petal.Width) # returns 180
myFun(iris$XXX) # returns 0
I don't want the last line to return 0. I want it to throw an error message, as XXX isn't a column in iris. How may I do this?
One way is to run as.character(match.call()) inside the function. I could then use the parts of the resulting string to determine the name of df, and in turn, I could check for the existence of x. But this seems like a not–so–robust solution.
It won't suffice to throw an error whenever x has length 0: I want to detect whether the vector exists, not whether it has length 0.
I searched for related posts on Stack Overflow, but I didn't find any.

The iris$XXX returns NULL and NULL is passed to sum
sum(NULL)
#[1] 0
Note that either iris$XXX or iris[['XXX']] returns NULL as value. If we need to get an error either subset or dplyr::select gives that
iris %>%
select(XXX)
Error: Can't subset columns that don't exist.
✖ Column XXX doesn't exist.
Run rlang::last_error() to see where the error occurred.
Or with pull
iris %>%
pull(XXX)
Error: object 'XXX' not found Run rlang::last_error() to see where
the error occurred.
subset(iris, select = XXX)
Error in eval(substitute(select), nl, parent.frame()) :
object 'XXX' not found
>
We could make the function to return an error if NULL is passed. Based on the way the function takes arguments, it is taking the value and not any info about the object.
myFun <- function (x) {
stopifnot(!is.null(x))
sum(x)
}
However, this would be non-specific error because NULL values can be passed to the function from other cases as well i.e. consider if the column exists and the value is NULL.
If we need to check if the column is valid, then the data and the column name should be passed into
myFun2 <- function(data, colnm) {
stopifnot(exists(colnm, data))
sum(data[[colnm]])
}
myFun2(iris, 'XXX')
#Error in myFun2(iris, "XXX") : exists(colnm, data) is not TRUE

Related

Why does using paste in for loop return error?

I have a few problems concerning the same topic.
(1) I am trying to loop over:
premium1999 <- as.data.frame(coef(summary(data1999_mod))[c(19:44), 1])
for 10 years, in which I wrote:
for (year in seq(1999,2008)) {
paste0('premium',year) <- as.data.frame(coef(summary(paste0('data',year,'_mod')))[c(19:44), 1])
}
Note:
for data1999_mod is regression results that I want extract some of its estimators as a dataframe vector.
The coef(summary(data1999_mod)) looks like this:
#A matrix: ... of type dbl
Estimate Std. Error t value Pr(>|t|)
age 0.0388573570 2.196772e-03 17.6883885 3.362887e-6
age_sqr -0.0003065876 2.790296e-05 -10.9876373 5.826926e-28
relation 0.0724525759 9.168118e-03 7.9026659 2.950318e-15
sex -0.1348453659 8.970138e-03 -15.0326966 1.201003e-50
marital 0.0782049161 8.928773e-03 8.7587533 2.217825e-18
reg 0.1691004469 1.132230e-02 14.9351735 5.082589e-50
...
However, it returns Error: $ operator is invalid for atomic vectors, even if I did not use $ operator here.
(2) Also,
I want to create a column 'year' containing repeated values of the associated year and am trying to loop over this:
premium1999$year <- 1999
In which I wrote:
for (i in seq(1999,2008)) {
assign(paste0('premium',i)[['year']], i)
}
In this case, it returns Error in paste0("premium", i)[["year"]]: subscript out of bounds
(3) Moreover, I'd like to repeat some rows and loop over:
premium1999 <- rbind(premium1999, premium1999[rep(1, 2),])
for 10 years again and I wrote:
for (year in seq(1999,2008)) {
paste0('premium',year) <- rbind(paste0('premium',year), paste0('premium',year)[rep(1, 2),])
}
This time it returns Error in paste0("premium", year)[rep(1, 2), ]: incorrect number of dimensions
I also tried to loop over a few other similar things but I always get Error.
Each code works fine individually.
I could not find what I did wrong. Any help or suggestions would be very highly appreciated.
The problem with the code is that the paste0() function returns the character and not calling the object that is having the name as this character. For example, paste0('data',year,'_mod') returns a character vector of length 1, i.e., "data1999_mod" and not calling the object data1999_mod.
For easy understanding, there is huge a difference between, "data1999_mod"["Estimate"] and data1999_mod["Estimate"]. Subsetting as data frame merely by paste0() function returns the former, however, the expected output will be given by the latter only. That is why you are getting, Error: $ operator is invalid for atomic vectors.
The same error is found in all of your codes. On order to call the object by the output of a paste0() function, we need to enclose is by get().
As, you have not supplied the reproducible sample, I couldn't test it. However, you can try running these.
#(1)
for (year in seq(1999,2008)) {
paste0('premium',year) <- as.data.frame(coef(summary(get(paste0('data',year,'_mod'))))[c(19:44), 1])
}
#(2)
for (i in seq(1999,2008)) {
assign(get(paste0('premium',i))[['year']], i)
}
#(3)
for (year in seq(1999,2008)) {
paste0('premium',year) <- rbind(get(paste0('premium',year)), get(paste0('premium',year))[rep(1, 2),])
}

Dynamic scoping questions in R

I'm reading the AdvancedR by Hadley and am testing the following code on this URL
subset2 = function(df, condition){
condition_call = eval(substitute(condition),df )
df[condition_call,]
}
df = data.frame(a = 1:10, b = 2:11)
condition = 3
subset2(df, a < condition)
Then I got the following error message:
Error in eval(substitute(condition), df) : object 'a' not found
I read the explanation as follows but don't quite understand:
If eval() can’t find the variable inside the data frame (its second argument), it looks in the environment of subset2(). That’s obviously not what we want, so we need some way to tell eval() where to look if it can’t find the variables in the data frame.
In my opinion, while "eval(substitute(condition),df )", the variable they cannot find is condition, then why object "a" cannot be found?
On the other hand, why the following code won't make any error?
subset2 = function(df, condition){
condition_call = eval(substitute(condition),df )
df[condition_call,]
}
df = data.frame(a = 1:10, b = 2:11)
y = 3
subset2(df, a < y)
This more stripped down example may make it easier for you to see what's going on in Hadley's example. The first thing to note is that the symbol condition appears here in four different roles, each of which I've marked with a numbered comment.
## Role of symbol `condition`
f <- function(condition) { #1 -- formal argument
a <- 100
condition + a #2 -- symbol bound to formal argument
}
condition <- 3 #3 -- symbol in global environment
f(condition = condition + a) #4 -- supplied argument (on RHS)
## Error in f(condition = condition + a) (from #1) : object 'a' not found
The other important thing to understand is that symbols in supplied arguments (here the right hand side part of condition = condition + a at #4) are searched for in the evaluation frame of the calling function. From Section 4.3.3 Argument Evaluation of the R Language Definition:
One of the most important things to know about the evaluation of arguments to a function is that supplied arguments and default arguments are treated differently. The supplied arguments to a function are evaluated in the evaluation frame of the calling function. The default arguments to a function are evaluated in the evaluation frame of the function.
In the example above, the evaluation frame of the call to f() is the global environment, .GlobalEnv.
Taking this step by step, here is what happens when you call (condition = condition + a). During function evaluation, R comes across the expression condition + a in the function body (at #2). It searches for values of a and condition, and finds a locally assigned symbol a. It finds that the symbol condition is bound to the formal argument named condition (at #1). The value of that formal argument, supplied during the function call, is condition + a (at #4).
As noted in the R Language Definition, the values of the symbols in the expression condition + a are searched for in the environment of the calling function, here the global environment. Since the global environment contains a variable named condition (assigned at #3) but no variable named a, it is unable to evaluate the expression condition + a (at #4), and fails with the error that you see.
I want to add some details in case someone stumbles on this question. The problematic line is
condition_call = eval(substitute(condition),df )
The condition object in substitute() function is a promise object, its expression slot is "a < condition" and substitute(condition) takes expression and returns a call object with expression as "a < condition".
Then eval() function start to evaluate the "a < condition" in the df environment. Its target is finding both a and condition.
a is found in df successfully, and this is not where the bug generated.
Then R starts searching condition in df and cannot find it.
So R goes up to the execution environment of subset2, and finds condition in the execution environment.
The variable it finds is actually the promise object mentioned before with expression slot as "a < condition".
To evaluate this expression, R has to find a again, and now it cannot find a any more because it has passed the df environment. This is the part that really generates the error.
To summarize the problem here:
R does find a in the df for once.
The bug arises when R tries to look for condition and then R takes the promise object condition instead of the 4 assigned outside as the argument and tries to evaluate it.
Then R runs into the problem:
it tries to evaluate "a < condition" and it cannot find a either in the execution environment of subset2() or global environment.
For my second example, R cannot find y in the execution environment and then finds y in the calling environment of subset2() as 4, generating no errors. In this case, the name of y is different from the promise object condition and R won't try to evaluate "a < y" and no bugs generated.

R: How do you pass the value of an expression to a function that rejects an expression argument?

MY code, below, now fixed, suffered from a different defect than I thought, unrelated to my title question. The real problem, and its solution, was supplied by #Roland, below.
I have a pair of functions, shown below, which together are supposed to return a list of the values of the column attribute named in attrC. When run, R objects that " Error in attr(x, which = get("attrC"), exact = TRUE): 'which' must be of mode character".
I have tried replacing attrC with get("attrC") and eval(attrC). Neither works.
I have three closely related questions. A good answer would answer all three.
How do I make this particular function work?
How can I tell, from the form of an R function or its documentation, when an argument which is required to be of a given type will accept a variable or expression that evaluates to that type?
If the answer to (1.) has not already supplied it: Is there a generic way to provide a function that requires an argument of given type and does not accept a name or expression which evaluates to that type, with the value from such a name or expression.
.
ColAttr <- function(x, attrC, ifIsNull) {
# Returns column attribute named in attrC, if present, else isNullC.
atr <- attr(x, attrC, exact = TRUE)
atr <- if (is.null(atr)) {ifIsNull} else {atr}
atr
}
AtribLst <- function(df, attrC, isNullC){
# Returns list of values of the col attribute attrC, if present, else isNullC
lapply(df, ColAttr, attrC=attrC, ifIsNull=isNullC)
}
stub93 <- AtribLst(cps_00093.df, attrC="label", isNullC=NA)

Why subset doesn't mind missing subset argument for dataframes?

Normally I wonder where mysterious errors come from but now my question is where a mysterious lack of error comes from.
Let
numbers <- c(1, 2, 3)
frame <- as.data.frame(numbers)
If I type
subset(numbers, )
(so I want to take some subset but forget to specify the subset-argument of the subset function) then R reminds me (as it should):
Error in subset.default(numbers, ) :
argument "subset" is missing, with no default
However when I type
subset(frame,)
(so the same thing with a data.frame instead of a vector), it doesn't give an error but instead just returns the (full) dataframe.
What is going on here? Why don't I get my well deserved error message?
tl;dr: The subset function calls different functions (has different methods) depending on the type of object it is fed. In the example above, subset(numbers, ) uses subset.default while subset(frame, ) uses subset.data.frame.
R has a couple of object-oriented systems built-in. The simplest and most common is called S3. This OO programming style implements what Wickham calls a "generic-function OO." Under this style of OO, an object called a generic function looks at the class of an object and then applies the proper method to the object. If no direct method exists, then there is always a default method available.
To get a better idea of how S3 works and the other OO systems work, you might check out the relevant portion of the Advanced R site. The procedure of finding the proper method for an object is referred to as method dispatch. You can read more about this in the help file ?UseMethod.
As noted in the Details section of ?subset, the subset function "is a generic function." This means that subset examines the class of the object in the first argument and then uses method dispatch to apply the appropriate method to the object.
The methods of a generic function are encoded as
< generic function name >.< class name >
and can be found using methods(<generic function name>). For subset, we get
methods(subset)
[1] subset.data.frame subset.default subset.matrix
see '?methods' for accessing help and source code
which indicates that if the object has a data.frame class, then subset calls the subset.data.frame the method (function). It is defined as below:
subset.data.frame
function (x, subset, select, drop = FALSE, ...)
{
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
else {
e <- substitute(subset)
r <- eval(e, x, parent.frame())
if (!is.logical(r))
stop("'subset' must be logical")
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(x))
names(nl) <- names(x)
eval(substitute(select), nl, parent.frame())
}
x[r, vars, drop = drop]
}
Note that if the subset argument is missing, the first lines
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
produce a vector of TRUES of the same length as the data.frame, and the last line
x[r, vars, drop = drop]
feeds this vector into the row argument which means that if you did not include a subset argument, then the subset function will return all of the rows of the data.frame.
As we can see from the output of the methods call, subset does not have methods for atomic vectors. This means, as your error
Error in subset.default(numbers, )
that when you apply subset to a vector, R calls the subset.default method which is defined as
subset.default
function (x, subset, ...)
{
if (!is.logical(subset))
stop("'subset' must be logical")
x[subset & !is.na(subset)]
}
The subset.default function throws an error with stop when the subset argument is missing.

Simple if-else loop in R

Can someone tell me what is wrong with this if-else loop in R? I frequently can't get if-else loops to work. I get an error:
if(match('SubjResponse',names(data))==NA) {
observed <- data$SubjResponse1
}
else {
observed <- data$SubjResponse
}
Note that data is a data frame.
The error is
Error in if (match("SubjResponse", names(data)) == NA) { :
missing value where TRUE/FALSE needed
This is not a full example as we do not have the data but I see these issues:
You cannot test for NA with ==, you need is.na()
Similarly, the output of match() and friends is usually tested for NULL or length()==0
I tend to write } else { on one line.
As #DirkEddelbuettel noted, you can't test NA that way. But you can make match not return NA:
By using nomatch=0 and reversing the if clause (since 0 is treated as FALSE), the code can be simplified. Furthermore, another useful coding idiom is to assign the result of the if clause, that way you won't mistype the variable name in one of the branches...
So I'd write it like this:
observed <- if(match('SubjResponse',names(data), nomatch=0)) {
data$SubjResponse # match found
} else {
data$SubjResponse1 # no match found
}
By the way if you "frequently" have problems with if-else, you should be aware of two things:
The object to test must not contain NA or NaN, or be a string (mode character) or some other type that can't be coerced into a logical value. Numeric is OK: 0 is FALSE anything else (but NA/NaN) is TRUE.
The length of the object should be exactly 1 (a scalar value). It can be longer, but then you get a warning. If it is shorter, you get an error.
Examples:
len3 <- 1:3
if(len3) 'foo' # WARNING: the condition has length > 1 and only the first element will be used
len0 <- numeric(0)
if(len0) 'foo' # ERROR: argument is of length zero
badVec1 <- NA
if(badVec1) 'foo' # ERROR: missing value where TRUE/FALSE needed
badVec2 <- 'Hello'
if(badVec2) 'foo' # ERROR: argument is not interpretable as logical

Resources