This question may seem basic but this has bothered me quite a while. The help document for many functions has ... as one of its argument, but somehow I can never get my head around this ... thing.
For example, suppose I have created a model say model_xgboost and want to make a prediction based on a dataset say data_tbl using the predict() function, and I want to know the syntax. So I look at its help document which says:
?predict
**Usage**
predict (object, ...)
**Arguments**
object a model object for which prediction is desired.
... additional arguments affecting the predictions produced.
To me the syntax and its examples didn't really enlighten me as I still have no idea what the valid syntax/arguments are for the function. In an online course it uses something like below, which works:
data_tbl %>%
predict(model_xgboost, new_data = .)
However, looking across the help doc I cannot find the new_data argument. Instead it mentioned newdata argument in its Details section, which actually didn't work if I displace the new_data = . with newdata = .:
Error in `check_pred_type_dots()`:
! Did you mean to use `new_data` instead of `newdata`?
My questions are:
How do I know exactly what argument(s) / syntax can be used for a function like this?
Why new_data but not newdata in this example?
I might be missing something here, but is there any reference/resource about how to use/interpret a help document, in plain English? (a lot of document, including R help file seem just give a brief sentence like "additional arguments affecting the predictions produced" etc)
#CarlWitthoft's answer is good, I want to add a little bit of nuance about this particular function. The reason the help page for ?predict is so vague is an unfortunate consequence of the fact that predict() is a generic method in R: that is, it's a function that can be applied to a variety of different object types, using slightly different (but appropriate) methods in each case. As such, the ?predict help page only lists object (which is required as the first argument in all methods) and ..., because different predict methods could take very different arguments/options.
If you call methods("predict") in a clean R session (before loading any additional packages) you'll see a list of 16 methods that base R knows about. After loading library("tidymodels"), the list expands to 69 methods. I don't know what class your object is (class("model_xgboost")), but assuming that it's of class model_fit, we look at ?predict.model_fit to see
predict(object, new_data, type = NULL, opts = list(), ...)
This tells us that we need to call the new data new_data (and, reading a bit farther down, that it needs to be "A rectangular data object, such as a data frame")
The help page for predict says
Most prediction methods which are similar to those for linear
models have an argument ‘newdata’ specifying the first place to
look for explanatory variables to be used for prediction
(emphasis added). I don't know why the parsnip authors (the predict.model_fit method comes from the parsnip package) decided to use new_data rather than newdata, presumably in line with the tidyverse style guide, which says
Use underscores (_) (so called snake case) to separate words within a name.
In my opinion this might have been a mistake, but you can see that the parsnip/tidymodels authors have realized that people are likely to make this mistake and added an informative warning, as shown in your example and noted e.g. here
Among other things, the existence of ... in a function definition means you can enter any arguments (values, functions, etc) you want to. There are some cases where the main function does not even use the ... but passes them to functions called inside the main function. Simple example:
foo <- function(x,...){
y <- x^2
plot(x,y,...)
}
I know of functions which accept a function as an input argument, at which point the items to include via ... are specific to the selected input function name.
I'm a new to R and package development so bear with me. I am writing test cases to keep package is line with standard practices. But I'm confused if I do the checks in testthat, should I not perform if/else checks in the package function?
my_function<-function(dt_genetic, dt_gene, dt_snpBP){
if((is.data.table(dt_genetic) & is.data.table(dt_gene) & is.data.table(dt_snpBP))== FALSE){
stop("data format unacceptable")
}
## similary more checks on column names and such
} ## function ends
In my test-data_integrity.R
## create sample data.table
test_gene_coord<-data.table(GENE=c("ABC","XYG","alpha"),"START"=c(10,200,320),"END"=c(101,250,350))
test_snp_pos<-data.table(SNP=c("SNP1","SNP2","SNP3"),"BP"=c(101,250,350))
test_snp_gene<-data.table(SNP=c("SNP1","SNP2","SNP3"),"GENE"=c("ABC","BRCA1","gamma"))
## check data type
test_that("data types correct works", {
expect_is(test_data_table,'data.table')
expect_is(test_gene_coord,'data.table')
expect_is(test_snp_pos,'data.table')
expect_is(test_snp_gene,'data.table')
expect_is(test_gene_coord$START, 'numeric')
expect_is(test_gene_coord$END, 'numeric')
expect_is(test_snp_pos$BP, 'numeric')
})
## check column names
test_that("column names works", {
expect_named(test_gene_coord, c("GENE","START","END"))
expect_named(test_snp_pos, c("SNP","BP"))
expect_named(test_snp_gene, c("SNP","GENE"))
})
when I run devtools::test() all tests are passed, but does it mean that I should not test within my function?
Pardon me if this seems naive but this is confusing as this is completely alien to me.
Edited: data.table if check.
(This is an expansion on my comments on the question. My comments are from a quasi-professional programmer; some of what I say here may be good "in general" but not perfectly complete from a theoretical standpoint.)
There are many "types" of tests, but I'll focus on distinguishing between "unit-tests" and "assertions". For me, the main difference is that unit-tests are typically run by the developer(s) only, and assertions are run at run-time.
Assertions
When you mention adding tests to your function, which to me sounds like assertions: a programmatic statement that an object meets specific property assumptions. This is often necessary when the data is provided by the user or from an external source (database), where the size or quality of the data is previously unknown.
There are "formal" packages for assertions, including assertthat, assertr, and assertive; while I have little experience with any of them, there is also sufficient support in base R that these aren't strictly required. The most basic method is
if (!inherits(mtcars, "data.table")) {
stop("'obj' is not 'data.table'")
}
# Error: 'obj' is not 'data.table'
which gives you absolute control at the expense of several lines of code. There's another function which shortens this a little:
stopifnot(inherits(mtcars, "data.table"))
# Error: inherits(mtcars, "data.table") is not TRUE
Multiple conditions can be provided, all must be TRUE to pass. (Unlike many R conditionals such as if, this statement must resolve to exactly TRUE: stopifnot(3) does not pass.) In R < 4.0, the error messages were uncontrolled, but starting in R-4.0 one can now name them:
stopifnot(
"mtcars not data.frame" = inherits(mtcars, "data.frame"),
"mtcars data.table error" = inherits(mtcars, "data.table")
)
# Error: mtcars data.table error
In some programming languages, these assertions are more declarative/deliberate so that compilation can optimize them out of a production executable. In this sense, they are useful during development, but for production it is assumed that some steps that worked before no longer need validation. I believe there is not an automatic way to do this in R (especially since it is generally not "compiled into an executable"), but one could fashion a function in a way to mimic this behavior:
myfunc <- function(x, ..., asserts = getOption("run_my_assertions", FALSE)) {
# this one only runs when the user explicitly says "asserts=TRUE"
if (asserts) stopifnot("'x' not a data.frame" = inherits(x, "data.frame"))
# this assertion runs all the time
stopifnot("'x' not a data.table" = inherits(x, "data.table"))
}
I have not seen that logic or flow often in R packages.
Regardless, my assumption of assertions is that those not optimized out (due to compilation or user arguments) execute every time the function runs. This tends to ensure a "safer" flow, and is a good idea especially for less-experienced developers who do not have the experience ("have not been burned enough") to know how many ways certain calls can go wrong.
Unit Tests
These are a bit different, both in their purpose and runtime effect.
First and foremost, unit-tests are not run every time a function is used. They are typically defined in a completely different file, not within the function at all[^1]. They are deliberate sets of calls to your functions, testing/confirming specific behaviors given certain inputs.
With the testthat package, R scripts (that match certain filename patterns) in the package's ./tests/testthat/ sub-directory will be run on command as unit-tests. (Other unit-test packages exist.) (Unit-tests do not require that they operate on a package; they can be located anywhere, and run on any set of files or directories of files. I'm using a "package" as an example.)
Side note: it is certainly feasible to include some of the testthat tools within your function for runtime validation as well. For instance, one might replace stopifnot(inherits(x, "data.frame")) with expect_is(x, "data.frame"), and it will fail with non-frames, and pass with all three types of frames tested above. I don't know that this is always the best way to go, and I haven't seen its use in packages I use. (Doesn't mean it isn't there. If you see testthat in a package's "Imports:", then it's possible.)
The premise here is not validation of runtime objects. The premise is validation of your function's performance given very specific inputs[^2]. For instance, one might define a unit-test to confirm that your function operates equally well on frames of class "data.frame", "tbl_df", and "data.table". (This is not a throw-away unit-test, btw.)
Consider a meek function that one would presume can work equally well on any data.frame-like object:
func <- function(x, nm) head(x[nm], n = 2)
To test that this accepts various types, one might simply call it on the console with:
func(mtcars, "cyl")
# cyl
# Mazda RX4 6
# Mazda RX4 Wag 6
When a colleague complains that this function isn't working, you might wonder that they're using either the tidyverse (and tibble) or data.table, so you can quickly test on the console:
func(tibble::as_tibble(mtcars), "cyl")
# # A tibble: 2 x 1
# cyl
# <dbl>
# 1 6
# 2 6
func(data.table::as.data.table(mtcars), "cyl")
# Error in `[.data.table`(x, nm) :
# When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
So now you know where the problem lies (if not yet how to fix it). If you test this "as is" with data.table, one might think to try something like this (obviously wrong) fix:
func <- function(x, nm) head(x[,..nm], n = 2)
func(data.table::as.data.table(mtcars), "cyl")
# cyl
# 1: 6
# 2: 6
While this works, unfortunately it now fails for the other two frame-like objects.
The answer to this dilemma is to make tests so that when you make a change to your function, if previously-successful property assumptions now change, you will know immediately. Had all three of those tests been incorporated into a unit-test, one might have done something such as
library(testthat)
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
# Error: Test failed: 'func works with all frame-like objects'
Given some research, you find one method that you think will satisfy all three frame-like objects:
func <- function(x, nm) head(subset(x, select = nm), n = 2)
And then run your unit-tests again:
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
(No output ... silence is golden.)
Similar to many things in programming, there are many opinions on how to organize, fashion, or even when to create these unit-tests. Many of these opinions are right for somebody. One strategy that I tend to start with is this:
since I know that my functions can be used on all three frame-like objects, I often preemptively set up a test given one object of each type (you'd be surprised at some of the lurking differences between them);
when I find or receive a bug report, one of the first things I do after confirming the bug is write a test that triggers that bug, given the minimum inputs required to do so; then I fix the bug, and run my unit-tests to ensure that this new test now passes (and no other test now fails)
Experience will dictate types of tests to write preemptively before the bugs even come.
Tests don't always have to be about "no errors", by the way. They can test for a lot of things:
silence (no errors)
expected messages, warnings, or stop errors (whether internally generated or passed from another function)
output class (matrix or numeric), dimensions, attributes
expected values (returning 3 vice 3.14 might be a problem)
Some will say that unit-tests are no fun to write, and abhor efforts on them. While I don't disagree that unit-tests are not fun, I have burned myself countless times when making a simple fix to a function inadvertently broke several other things ... and since I deployed the "simple fix" without applicable unit-tests, I just shifted the bug reports from "this title has "NA" in it" to "the app crashes and everybody is angry" (true story).
For some packages, unit-testing can be done in moments; for others, it may take minutes or hours. Due to complexity in functions, some of my unit-tests deal with "large" data structures, so a single test takes several minutes to reveal its success. Most of my unit-tests are relatively instantaneous with inputs of vectors of length 1 to 3, or frames/matrices with 2-4 rows and/or columns.
This is by far not a complete document on testing. There are books, tutorials, and countless blogs about different techniques. One good reference is Hadley's book on R Packages, Testing chapter: http://r-pkgs.had.co.nz/tests.html. I like that, but it is far from the only one.
[^1] Tangentially, I believe that one power the roxygen2 package affords is the convenience of storing a function's documentation in the same file as the function itself. Its proximity "reminds" me to update the docs when I'm working on code. It would be nice if we could determine a sane way to similarly add formal testthat (or similar) unit-tests to the function file itself. I've seen (and at times used) informal unit-tests by including specific code in the roxygen2 #examples section: when the file is rendered to an .Rd file, any errors in the example code will alert me on the console. I know that this technique is sloppy and hasty, and in general I only suggest it when more formal unit-testing will not be done. It does tend to make help documentation a lot more verbose than it needs to be.
[^2] I said above "given very specific inputs": an alternative is something called "fuzzing", a technique where functions are called with random or invalid input. I believe this is very useful for searching for stack overflow, memory-access, or similar problems that cause a program to crash and/or execute the wrong code. I've not seen this used in R (ymmv).
I've been trying to dust off my R skills recently and I've been having some difficulty with variable scoping in this particular code.
So my function loop here calls other functions within the program that currently work without any problem calling years and trials (both ints), and simMat (a matrix of numeric). Where my main question lies is with the matrix simMat. I want to be able to call it from the command line and see the values but whenever I do that I'll get a matrix of NAs and I don't know why. I am nearly positive that is something to do with the variable scoping but I am not very familiar with that. Also, the suppressWarnings are to get rid of messages about coercion (don't know a lot about that either, any recommendation is appreciated)
I want to be able to call simMat form the command line and pass it to another function to do some arithmetic. I would greatly appreciate any help here on how I can accomplish this!!!
#This looks the same for the func asking for the num of years and trials
numTrials <- function()
{
trials<- readline(prompt="How many trials? ")
trials<- as.integer(trials)
if (is.na(trials)){
trials<- readinteger()
}
return(trials)
}
#Do the simple cash flow simulation
loop<-function(trials, years)
{
trials<-suppressWarnings(numTrials())
years<-suppressWarnings(numYears())
simMat<-matrix(nrow=trials, ncol=years)
for (i in 1:trials){
sim <- newCashFlow[1]
for (j in 1:years){
simMat[i,j]<-sim
random<-randomRates(cholMat2)
sim = sim + sum(random*newCashFlow[j]*weights)
}
}
simMat
plotSimulation(simMat,years,i)
}
If you intend to access something from the console of R which is acting within the global environment, then you need to create the variable OUTSIDE of the function, in that environment you will be working in. As such it will persist when the loop function has completed its tasks.
To be able to use the matrix simMat outside of the loop create it there.
Also, before doing so, run the following script in your code to see where each variable lives. This will help you understand what happens as you make changes.
Sys.getenv(c("simMat", "trials", "years", "sim"))
or simply call the environment with parent.env(simMat)
This website is a very good one to explain these environment issues.
Hadley Wickham...R Genius!
More Hadley Wickha Genius on Lexical Scoping & Functions
These two sites should get you past anything!
I was going through some examples in hadley's guide to functionals, and came across an unexpected problem.
Suppose I have a list of model objects,
x=1:3;y=3:1; bah <- list(lm(x~y),lm(y~x))
and want to extract something from each (as suggested in hadley's question about a list called "trials"). I was expecting one of these to work:
lapply(bah,`$`,i='call') # or...
lapply(bah,`$`,call)
However, these return nulls. It seems like I'm not misusing the $ function, as these things work:
`$`(bah[[1]],i='call')
`$`(bah[[1]],call)
Anyway, I'm just doing this as an exercise and am curious where my mistake is. I know I could use an anonymous function, but think there must be a way to use syntax similar to my initial non-solution. I've looked through the places $ is mentioned in ?Extract, but didn't see any obvious explanation.
I just realized that this works:
lapply(bah,`[[`,i='call')
and this
lapply(bah,function(x)`$`(x,call))
Maybe this just comes down to some lapply voodoo that demands anonymous functions where none should be needed? I feel like I've heard that somewhere on SO before.
This is documented in ?lapply, in the "Note" section (emphasis mine):
For historical reasons, the calls created by lapply are unevaluated,
and code has been written (e.g. bquote) that relies on this. This
means that the recorded call is always of the form FUN(X[[0L]],
...), with 0L replaced by the current integer index. This is not
normally a problem, but it can be if FUN uses sys.call or
match.call or if it is a primitive function that makes use of the
call. This means that it is often safer to call primitive functions
with a wrapper, so that e.g. lapply(ll, function(x) is.numeric(x))
is required in R 2.7.1 to ensure that method dispatch for is.numeric
occurs correctly.
This post (Lazy evaluation in R – is assign affected?) covers some common ground but I am not sure it answers my question.
I stopped using assign when I discovered the apply family quite a while back, albeit, purely for reasons of elegance in situations such as this:
names.foo <- letters
values.foo <- LETTERS
for (i in 1:length(names.foo))
assign(names.foo[i], paste("This is: ", values.foo[i]))
which can be replaced by:
foo <- lapply(X=values.foo, FUN=function (k) paste("This is :", k))
names(foo) <- names.foo
This is also the reason this (http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f) R-faq says this should be avoided.
Now, I know that assign is generally frowned upon. But are there other reasons I don't know? I suspect it may mess with the scoping or lazy evaluation but I am not sure? Example code that demonstrates such problems will be great.
Actually those two operations are quite different. The first gives you 26 different objects while the second gives you only one. The second object will be a lot easier to use in analyses. So I guess I would say you have already demonstrated the major downside of assign, namely the necessity of then needing always to use get for corralling or gathering up all the similarly named individual objects that are now "loose" in the global environment. Try imagining how you would serially do anything with those 26 separate objects. A simple lapply(foo, func) will suffice for the second strategy.
That FAQ citation really only says that using assignment and then assigning names is easier, but did not imply it was "bad". I happen to read it as "less functional" since you are not actually returning a value that gets assigned. The effect looks to be a side-effect (and in this case the assign strategy results in 26 separate side-effects). The use of assign seems to be adopted by people that are coming from languages that have global variables as a way of avoiding picking up the "True R Way", i.e. functional programming with data-objects. They really should be learning to use lists rather than littering their workspace with individually-named items.
There is another assignment paradigm that can be used:
foo <- setNames( paste0(letters,1:26), LETTERS)
That creates a named atomic vector rather than a named list, but the access to values in the vector is still done with names given to [.
As the source of fortune(236) I thought I would add a couple examples (also see fortune(174)).
First, a quiz. Consider the following code:
x <- 1
y <- some.function.that.uses.assign(rnorm(100))
After running the above 2 lines of code, what is the value of x?
The assign function is used to commit "Action at a distance" (see http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) or google for it). This is often the source of hard to find bugs.
I think the biggest problem with assign is that it tends to lead people down paths of thinking that take them away from better options. A simple example is the 2 sets of code in the question. The lapply solution is more elegant and should be promoted, but the mere fact that people learn about the assign function leads people to the loop option. Then they decide that they need to do the same operation on each object created in the loop (which would be just another simple lapply or sapply if the elegant solution were used) and resort to an even more complicated loop involving both get and apply along with ugly calls to paste. Then those enamored with assign try to do something like:
curname <- paste('myvector[', i, ']')
assign(curname, i)
And that does not do quite what they expected which leads to either complaining about R (which is as fair as complaining that my next door neighbor's house is too far away because I chose to walk the long way around the block) or even worse, delve into using eval and parse to get their constructed string to "work" (which then leads to fortune(106) and fortune(181)).
I'd like to point out that assign is meant to be used with environments.
From that point of view, the "bad" thing in the example above is using a not quite appropriate data structure (the base environment instead of a list or data.frame, vector, ...).
Side note: also for environments, the $ and $<- operators work, so in many cases the explicit assign and get isn't necessary there, neither.