How to report error indices in R? [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Update: Question here is closed, now discussed on RStudio Community Platform.
I'm trying to program defensively in my package development, using a lot input validation.
In particular, I'm relying on a lot of the ready-made assertions in checkmate, testthat and the like, which makes life a lot easier (and code shorter).
Hadley Wickhams's tidyverse style guide for error messages suggests that error messages should point users to the exact source of the problem, like so:
#> Error: Can't find column `b` in `.data`
(Columns are just an example, sometimes it might be rows, or some other index).
I'm now wondering how this can be implemented elegantly and consistently in a package, given that a lot of the existing assertions (from above package, but also base r) don't give you any indices back in their errors.
Here's an example:
m <- matrix(data = c(0, 1, 5, -2), nrow = 2)
# arbitrary assertion
assert_positive <- function(x) {
if (any(x < 0)) {
stop(call. = FALSE,
"All numbers must be non-negative")
} else {
return(invisible(x))
}
}
# (there are *lots* of these in packages such as checkmate, testthat or assertr that should be reused)
assert_positive(m)
gives:
## Error: All numbers must be non-negative
So far so good, but this does not give the desired indices of the errors.
Yes, I know that I could just change the above assert_positive() function to do that, but I would like to reuse a lot of the functions in checkmate, testthat and friends, so I can't touch them, and there's too many of them anyway.
So I should probably wrap something around these existing tests, such as a simple for loop:
# via for-loops
assert_positive2 <- function(x) {
for (r in 1:nrow(x)) {
res <- try(expr = assert_positive(x[r, ]), silent = TRUE)
if (inherits(x = res, what = "try-error")) {
stop(
call. = FALSE,
paste0(
"in row ",
r,
": ",
attr(x = res, which = "condition")$message,
"."
)
)
}
}
}
assert_positive2(m)
gives:
## Error: in row 2: All numbers must be non-negative.
That gets the job done, but it's a lot of clutter and the code is not very expressive.
I've also thought about Reduce() with try(), but that won't give indices, and neither would any apply() action.
I guess, finally, a closure or function factory would be helpful to generalise this to many assertions.
This just feels like a problem that many other people (crafting better error messages) must have already run into, so:
What's an elegant/canonical way to do this?
I know this isn't the place for discussions and opinions; but it's still the best forum for such a problem, so please don't shut this down.

I don't see how wrapping many functions would be less work than just changing them / writing your own versions. Plus, like you say, the way you've wrapped the example is anything but cute.
As a short answer, I could imagine using the assertthat package (which you have not mentioned explicitly) and in particular the functions assert_that() (for basic cases) and on_failure() (for broader user-defined assertion functions).
I don't think the assert_positive example does what you want, so maybe you should not try to recycle it. Similarly, the assert_positive2 might also not do what you want in other cases, because you may want to report the specific indices per row that are in violation, not just the rows. But with your own functions, you can maybe write something more flexible that covers multiple cases.

Related

good practice to use "$" and run a function in one line in R [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Today i have seem a "strange" thing and am wondering if this is a good practice. Basically there is a list:
testList <- list("columnA" = c(1, 2, 3),
"columnB" = c(11,22,33))
and then a function:
calculateMean <- function(input){
out <- lapply(input, mean)
return(out)
}
and the this:
resultTest <- calculateMean(testList)$columnA
Question: Is this a good practice to refer to functions result without storing the results of a function in an intermediate step?
We may use sapply and return a named vector and store it as a single vector and use that for other cases i.e. suppose we want to take the max of that vector, it can be applied directly instead of unlist the list.
calculateMean <- function(input){
out <- sapply(input, mean)
return(out)
}
-ouptut
calculateMean(testList)
columnA columnB
2 22
Regarding storing the output, it depends i.e. if we want to extract the output of 'columnB', we may need to run it again and do $. Instead, save it as a single object and extract as needed
You ask if this is good practice. I'd say there are good and bad aspects to it.
On the positive side, it keeps your code simpler than if you defined a new variable to hold calculateMean(testList) when all you are interested in is one element of it. In some cases (probably not yours though) that could save a lot of memory: that variable might hold a lot of stuff that is of no interest, and it takes up space.
On the negative side, it makes your code harder to debug. Keeping expressions simple makes it easier to see when and why things aren't working. Each line of
temp <- calculateMean(testList)
resultTest <- temp$columnA
is simpler than the one line
resultTest <- calculateMean(testList)$columnA
In some situations you could use an informative name in the two-line version to partially document what you had in mind here (not temp!), making your code easier to understand.
If you were trying to single step through the calculation in a debugger, it would be more confusing, because you'd jump from the calculateMean source to the source for $ (or more likely, to the final result, since that's a primitive function).
Since the one-line version is relatively simple in your case, I'd probably use it, but in other situations I might split it into two lines.

Errors in Executing While loop [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to read a random data from a dataset in while loop but I'm getting errors, can anyone here help me?
How to calculate the percentage of points in the sample that are greater than 100?
I tried following method
dataset = 1:100
i=0
while(dataset[i] > condition) #compare every value in dataset
{
percent_age= dataset[i] + percent_age
i=i+1
if(i=100)
{break}
}
But it gives me only errors.
The while statement is evaluated before anything in the body, so the first time it is evaluated i is equal to 0 and so dataset[i] is dataset[0] which is an empty object (vector of length 0), you also have not defined condition in the code that you give us. So while is looking of a single logical value, but you are giving it the result of comparing a zero-length vector to an undefined value. That is going to give at least one error.
You can fix that by starting i at 1 and defining condition before the while.
In your if statement you have i=100, that is setting i to 100, to compare (and return a logical) it should be i == 100.
Because R can be used interactively, it tries to evaluate code as early as possible, therefore it is best to put the opening curly bracket { on the same line as the keywords like if and while.
A couple of nit-picky things that probably will not resolve errors, but could help for better programming in the future:
Use more whitespace within lines: i = i + 1 can be easier to read than i=i+1 and mistakes like i=100 vs i == 100 are easier to catch when whitespace is used appropriately.
I find the arrow assignment in R i <- 1 reads easier and lessens chances of confusing different uses of =, so I would recommend using it for all assignments.

unit tests and checks in package function: do we do checks in both?

I'm a new to R and package development so bear with me. I am writing test cases to keep package is line with standard practices. But I'm confused if I do the checks in testthat, should I not perform if/else checks in the package function?
my_function<-function(dt_genetic, dt_gene, dt_snpBP){
if((is.data.table(dt_genetic) & is.data.table(dt_gene) & is.data.table(dt_snpBP))== FALSE){
stop("data format unacceptable")
}
## similary more checks on column names and such
} ## function ends
In my test-data_integrity.R
## create sample data.table
test_gene_coord<-data.table(GENE=c("ABC","XYG","alpha"),"START"=c(10,200,320),"END"=c(101,250,350))
test_snp_pos<-data.table(SNP=c("SNP1","SNP2","SNP3"),"BP"=c(101,250,350))
test_snp_gene<-data.table(SNP=c("SNP1","SNP2","SNP3"),"GENE"=c("ABC","BRCA1","gamma"))
## check data type
test_that("data types correct works", {
expect_is(test_data_table,'data.table')
expect_is(test_gene_coord,'data.table')
expect_is(test_snp_pos,'data.table')
expect_is(test_snp_gene,'data.table')
expect_is(test_gene_coord$START, 'numeric')
expect_is(test_gene_coord$END, 'numeric')
expect_is(test_snp_pos$BP, 'numeric')
})
## check column names
test_that("column names works", {
expect_named(test_gene_coord, c("GENE","START","END"))
expect_named(test_snp_pos, c("SNP","BP"))
expect_named(test_snp_gene, c("SNP","GENE"))
})
when I run devtools::test() all tests are passed, but does it mean that I should not test within my function?
Pardon me if this seems naive but this is confusing as this is completely alien to me.
Edited: data.table if check.
(This is an expansion on my comments on the question. My comments are from a quasi-professional programmer; some of what I say here may be good "in general" but not perfectly complete from a theoretical standpoint.)
There are many "types" of tests, but I'll focus on distinguishing between "unit-tests" and "assertions". For me, the main difference is that unit-tests are typically run by the developer(s) only, and assertions are run at run-time.
Assertions
When you mention adding tests to your function, which to me sounds like assertions: a programmatic statement that an object meets specific property assumptions. This is often necessary when the data is provided by the user or from an external source (database), where the size or quality of the data is previously unknown.
There are "formal" packages for assertions, including assertthat, assertr, and assertive; while I have little experience with any of them, there is also sufficient support in base R that these aren't strictly required. The most basic method is
if (!inherits(mtcars, "data.table")) {
stop("'obj' is not 'data.table'")
}
# Error: 'obj' is not 'data.table'
which gives you absolute control at the expense of several lines of code. There's another function which shortens this a little:
stopifnot(inherits(mtcars, "data.table"))
# Error: inherits(mtcars, "data.table") is not TRUE
Multiple conditions can be provided, all must be TRUE to pass. (Unlike many R conditionals such as if, this statement must resolve to exactly TRUE: stopifnot(3) does not pass.) In R < 4.0, the error messages were uncontrolled, but starting in R-4.0 one can now name them:
stopifnot(
"mtcars not data.frame" = inherits(mtcars, "data.frame"),
"mtcars data.table error" = inherits(mtcars, "data.table")
)
# Error: mtcars data.table error
In some programming languages, these assertions are more declarative/deliberate so that compilation can optimize them out of a production executable. In this sense, they are useful during development, but for production it is assumed that some steps that worked before no longer need validation. I believe there is not an automatic way to do this in R (especially since it is generally not "compiled into an executable"), but one could fashion a function in a way to mimic this behavior:
myfunc <- function(x, ..., asserts = getOption("run_my_assertions", FALSE)) {
# this one only runs when the user explicitly says "asserts=TRUE"
if (asserts) stopifnot("'x' not a data.frame" = inherits(x, "data.frame"))
# this assertion runs all the time
stopifnot("'x' not a data.table" = inherits(x, "data.table"))
}
I have not seen that logic or flow often in R packages.
Regardless, my assumption of assertions is that those not optimized out (due to compilation or user arguments) execute every time the function runs. This tends to ensure a "safer" flow, and is a good idea especially for less-experienced developers who do not have the experience ("have not been burned enough") to know how many ways certain calls can go wrong.
Unit Tests
These are a bit different, both in their purpose and runtime effect.
First and foremost, unit-tests are not run every time a function is used. They are typically defined in a completely different file, not within the function at all[^1]. They are deliberate sets of calls to your functions, testing/confirming specific behaviors given certain inputs.
With the testthat package, R scripts (that match certain filename patterns) in the package's ./tests/testthat/ sub-directory will be run on command as unit-tests. (Other unit-test packages exist.) (Unit-tests do not require that they operate on a package; they can be located anywhere, and run on any set of files or directories of files. I'm using a "package" as an example.)
Side note: it is certainly feasible to include some of the testthat tools within your function for runtime validation as well. For instance, one might replace stopifnot(inherits(x, "data.frame")) with expect_is(x, "data.frame"), and it will fail with non-frames, and pass with all three types of frames tested above. I don't know that this is always the best way to go, and I haven't seen its use in packages I use. (Doesn't mean it isn't there. If you see testthat in a package's "Imports:", then it's possible.)
The premise here is not validation of runtime objects. The premise is validation of your function's performance given very specific inputs[^2]. For instance, one might define a unit-test to confirm that your function operates equally well on frames of class "data.frame", "tbl_df", and "data.table". (This is not a throw-away unit-test, btw.)
Consider a meek function that one would presume can work equally well on any data.frame-like object:
func <- function(x, nm) head(x[nm], n = 2)
To test that this accepts various types, one might simply call it on the console with:
func(mtcars, "cyl")
# cyl
# Mazda RX4 6
# Mazda RX4 Wag 6
When a colleague complains that this function isn't working, you might wonder that they're using either the tidyverse (and tibble) or data.table, so you can quickly test on the console:
func(tibble::as_tibble(mtcars), "cyl")
# # A tibble: 2 x 1
# cyl
# <dbl>
# 1 6
# 2 6
func(data.table::as.data.table(mtcars), "cyl")
# Error in `[.data.table`(x, nm) :
# When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
So now you know where the problem lies (if not yet how to fix it). If you test this "as is" with data.table, one might think to try something like this (obviously wrong) fix:
func <- function(x, nm) head(x[,..nm], n = 2)
func(data.table::as.data.table(mtcars), "cyl")
# cyl
# 1: 6
# 2: 6
While this works, unfortunately it now fails for the other two frame-like objects.
The answer to this dilemma is to make tests so that when you make a change to your function, if previously-successful property assumptions now change, you will know immediately. Had all three of those tests been incorporated into a unit-test, one might have done something such as
library(testthat)
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
# Error: Test failed: 'func works with all frame-like objects'
Given some research, you find one method that you think will satisfy all three frame-like objects:
func <- function(x, nm) head(subset(x, select = nm), n = 2)
And then run your unit-tests again:
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
(No output ... silence is golden.)
Similar to many things in programming, there are many opinions on how to organize, fashion, or even when to create these unit-tests. Many of these opinions are right for somebody. One strategy that I tend to start with is this:
since I know that my functions can be used on all three frame-like objects, I often preemptively set up a test given one object of each type (you'd be surprised at some of the lurking differences between them);
when I find or receive a bug report, one of the first things I do after confirming the bug is write a test that triggers that bug, given the minimum inputs required to do so; then I fix the bug, and run my unit-tests to ensure that this new test now passes (and no other test now fails)
Experience will dictate types of tests to write preemptively before the bugs even come.
Tests don't always have to be about "no errors", by the way. They can test for a lot of things:
silence (no errors)
expected messages, warnings, or stop errors (whether internally generated or passed from another function)
output class (matrix or numeric), dimensions, attributes
expected values (returning 3 vice 3.14 might be a problem)
Some will say that unit-tests are no fun to write, and abhor efforts on them. While I don't disagree that unit-tests are not fun, I have burned myself countless times when making a simple fix to a function inadvertently broke several other things ... and since I deployed the "simple fix" without applicable unit-tests, I just shifted the bug reports from "this title has "NA" in it" to "the app crashes and everybody is angry" (true story).
For some packages, unit-testing can be done in moments; for others, it may take minutes or hours. Due to complexity in functions, some of my unit-tests deal with "large" data structures, so a single test takes several minutes to reveal its success. Most of my unit-tests are relatively instantaneous with inputs of vectors of length 1 to 3, or frames/matrices with 2-4 rows and/or columns.
This is by far not a complete document on testing. There are books, tutorials, and countless blogs about different techniques. One good reference is Hadley's book on R Packages, Testing chapter: http://r-pkgs.had.co.nz/tests.html. I like that, but it is far from the only one.
[^1] Tangentially, I believe that one power the roxygen2 package affords is the convenience of storing a function's documentation in the same file as the function itself. Its proximity "reminds" me to update the docs when I'm working on code. It would be nice if we could determine a sane way to similarly add formal testthat (or similar) unit-tests to the function file itself. I've seen (and at times used) informal unit-tests by including specific code in the roxygen2 #examples section: when the file is rendered to an .Rd file, any errors in the example code will alert me on the console. I know that this technique is sloppy and hasty, and in general I only suggest it when more formal unit-testing will not be done. It does tend to make help documentation a lot more verbose than it needs to be.
[^2] I said above "given very specific inputs": an alternative is something called "fuzzing", a technique where functions are called with random or invalid input. I believe this is very useful for searching for stack overflow, memory-access, or similar problems that cause a program to crash and/or execute the wrong code. I've not seen this used in R (ymmv).

Efficiently packing and unpacking function arguments in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have some R code where I'm starting to get too many arguments in my functions, like this
f<-function(a,b,c,d,e,f,g,...){
#do stuff with a,b,c,d,e,f,g
return(list(q=q,r=r,s=s,...))
}
I was thinking of collapsing arguments into lists of related parameters and then extracting out the parameters from the lists inside the function. This is annoying though since I have to use a lot of boilerplate code
list_of_params<-list(a=a,b=b,...)
f<-function(list_of_params){
a<-list_of_params[["a"]]
b<-list_of_params[["b"]]
c<-list_of_params[["c"]]
...
#do stuff with a,b,c,...
return(list(q=q,r=r,s=s,...))
}
I was thinking about using something like list2env to automatically extract the variables from the list into the environment of the function. Does anyone have opinions about whether that is a reasonable approach? I read somewhere that using assign is a bad idea and this seems similar. My proposed function would look like this:
f<-function(list_of_params){
list2env(list_of_params, envir=as.environment(-1)) #-1 means current environment
#do stuff with a,b,c...
return(list(q=q,r=r,s=s,...))
}
I have never used assign() or list2env() before. I am concerned they may have treacherous pitfalls I should watch out for, in the same manner as attach(). Is the use of list2env() here appropriate? If not, what is the appropriate use of this function?
A long list of parameters is probably a code-smell.
The easiest thing to do is to stop, and think about what type of object that should encapsulate your parameters. It's probably not just a simple list.
Another option is if many of the function parameters are held fixed in terms of procedural or lexical scope. Then you could use the fact that functions are R are closures. Example:
make_f <- function(object, params){
e <- calculate_e(object, params)
f <- calculate_f(object, params)
g <- calculate_g(object, params)
f<-function(a,b,c,d,...){
#do stuff with a,b,c,d,e,f,g
return(list(q=q,r=r,s=s,...))
}
return(f)
}

Approaches to preserving object's attributes during extract/replace operations

Recently I encountered the following problem in my R code. In a function, accepting a data frame as an argument, I needed to add (or replace, if it exists) a column with data calculated based on values of the data frame's original column. I wrote the code, but the testing revealed that data frame extract/replace operations, which I've used, resulted in a loss of the object's special (user-defined) attributes.
After realizing that and confirming that behavior by reading R documentation (http://stat.ethz.ch/R-manual/R-patched/library/base/html/Extract.html), I decided to solve the problem very simply - by saving the attributes before the extract/replace operations and restoring them thereafter:
myTransformationFunction <- function (data) {
# save object's attributes
attrs <- attributes(data)
<data frame transformations; involves extract/replace operations on `data`>
# restore the attributes
attributes(data) <- attrs
return (data)
}
This approach worked. However, accidentally, I ran across another piece of R documentation (http://stat.ethz.ch/R-manual/R-patched/library/base/html/Extract.data.frame.html), which offers IMHO an interesting (and, potentially, a more generic?) alternative approach to solving the same problem:
## keeping special attributes: use a class with a
## "as.data.frame" and "[" method:
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
d <- data.frame(i = 0:7, f = gl(2,4),
u = structure(11:18, unit = "kg", class = "avector"))
str(d[2:4, -1]) # 'u' keeps its "unit"
I would really appreciate if people here could help by:
Comparing the two above-mentioned approaches, if they are comparable (I realize that the second approach as defined is for data frames, but I suspect it can be generalized to any object);
Explaining the syntax and meaning in the function definition in the second approach, especially as.data.frame.avector, as well as what is the purpose of the line as.data.frame.avector <- as.data.frame.vector.
I'm answering my own question, since I have just found an SO question (How to delete a row from a data.frame without losing the attributes), answers to which cover most of my questions posed above. However, additional explanations (for R beginners) for the second approach would still be appreciated.
UPDATE:
Another solution to this problem has been proposed in an answer to the following SO question: indexing operation removes attributes. Personally, however, I better like the approach, based on creating a new class, as it's IMHO semantically cleaner.

Resources