Making a package in R that depends on data.table - r

I have to make an R package that depends on the package data.table. However, if I would do a function such as the next one in the package
randomdt <- function(){
dt <- data.table(random = rnorm(10))
dt[dt$random > 0]
}
the function [ will use the method for data.frame not for data.table and therefore the error
Error in `[.data.frame`(x, i) : undefined columns selected
will appear. Usually this would be solved by using get('[.data.table') or similar method (package::function is the simplest) but that appears not to work. After all, [ is a primitive function and I don't know how the methods to it work.
So, how can I call the data.table [ function from my package?

Updated based on some feedback from MichaelChirico and comments by Arun and Soheil.
Roughly speaking, there's two approaches you might consider. The first is building the dependency into your package itself, while the second is including lines in your R code that test for the presence of data.table (and possibly even install it automatically if it is not found).
The data.table FAQ specifically addresses this in 6.9, and states that you can ensure that data.table is appropriately loaded by your package by:
Either i) include data.table in the Depends: field of your DESCRIPTION file, or ii) include data.table in the Imports: field of your DESCRIPTION file AND import(data.table) in your NAMESPACE file.
As noted in the comments, this is common R behavior that is in numerous packages.
An alternative approach is to create specific lines of code which test for and import the required packages as part of your code. This is, I would contend, not the ideal solution given the elegance of using the option provided above. However, it is technically possible.
A simple way of doing this would be to use either require or library to check for the existence of data.table, with an error thrown if it could not be attached. You could even use a simple set of conditional statements to run install.packages to install what you need if loading them fails.
Yihui Xie (of knitr fame) has a great post about the difference between library and require here and makes a strong case for just using library in cases where the package is absolutely essential for the upcoming code.

Related

How can I use data.table in a package without importing all functions?

I'm building an R package in which I would like to use dtplyr to perform various bits of data manipulation. My issue is that dtplyr seems to only work if I import the whole of data.table (i.e. using the roxygen #' #import data.table). Without this I get errors like:
Error in .(x = sum(x), y = sum(y), :
could not find function "."
If I can solve this problem by only importing certain functions from data.table that would be great, but there seems to be no function .() in the package. My knowledge of data.table is limited, but I can only assume it uses .() to edit parsed code (similar to the base R bquote()), but that dtplyr for some reason needs data.table to be loaded for this to work.
I've tried various things such as withr::with_package("data.table", code) and requireNamespace("data.table"), but so far importing the whole package is the only thing that seems to work. This is not a viable solution because it completely ruins the well-maintained namespace in the package I'm working on by importing so many functions from data.table.
NB, this package houses a project which will be worked on by many other analysts well into the future. While simply writing data.table code may be preferable in terms of performance and general good-practice, using dtplyr to translate dplyr code gives a boost in readability and ease-of-use that is far more important in this context.
The (documented) solution I found is to set .datatable.aware <- TRUE somewhere in the package source code. According to the documentation, if you're using data.table in a package without importing the whole thing, you should do this so that [.data.table() does not revert to calling [.data.frame(). From the docs:
...please define .datatable.aware = TRUE anywhere in your R source code (no need to export). This tells data.table that you as a package developer have designed your code to intentionally rely on data.table functionality even though it may not be obvious from inspecting your NAMESPACE file.

Use `[` method from data.table package in package development

We are creating a package where one of our functions uses functions of the data.table package.
Instead of importing entire packages through our roxygen header, we try to use :: as much as possible in our code.
For a function, this is easy. For example:
data.table::setkey(our_data_1, our_variable)
Yet, we do not know how to do this for a method. For example:
our_data_3 <- our_data_1[our_data_2, roll = "nearest"]
where [ has a specific method for data.tables, which is indicated by:
methods(`[`)
I have tried multiple approaches. Multiple combinations, using #importFrom, failed. For example, adding the following line to our roxygen header...
#importFrom data.table `[.data.table`
...returned the following when running devtools::document():
Warning message:
object ‘[.data.table’ is not exported by 'namespace:data.table'
I have also tried things like [.data.table within our code, but those failed as well...
Importing the entire data.table package in our roxygen header worked (#import data.table), but this is not preferred since we want to refer to the package of each function within our code (or at least use #importFrom).
Is there a way to use the [ method of data.table within the code of a function without importing the entire data.table package? Or is it at least possible to only import the method, for example through using #importFrom in our roxygen header?
Thank you in advance!
There is no need to import S3 methods, they are automatically dispatched by class of an object.
In case of [ data.table method, there is a trick which we use to ensure that data.table passed to a library that expects data.frame, will be handled properly, as a data.frame. This handling is decided based on NAMESPACE file. If you don't import data.table in NAMESPACE then data.table method assumes you want to use it as data.frame.
You can state your intent explicitly by using extra variable .datatable.aware=TRUE in any of you R script files.
You should read Importing data.table vignette where this is well described.
I also put example package which you can run and debug from there if for some reason your code will still not work: https://gitlab.com/jangorecki/useDTmethod
I think you don't need to import S3 method or to use :: like we do on functions.
In my opinion you just need to add data.table as a dependency in DESCRIPTION and it should be working.
R will know that you are applying [ to a data.table object and will use the correct method.

Determining if there are unused packages in an R script [duplicate]

As my code evolves from version to version, I'm aware that there are some packages for which I've found better/more appropriate packages for the task at hand or whose purpose was limited to a section of code which I've now phased out.
Is there any easy way to tell which of the loaded packages are actually used in a given script? My header is beginning to get cluttered.
Update 2020-04-13
I've now updated the referenced function to use the abstract syntax tree (AST) instead of using regular expressions as before. This is a much more robust way of approaching the problem (it's still not completely ironclad). This is available from version 0.2.0 of funchir, now on CRAN.
I've just got around to writing a quick-and-dirty function to handle this which I call stale_package_check, and I've added it to my package (funchir).
e.g., if we save the following script as test.R:
library(data.table)
library(iotools)
DT = data.table(a = 1:3)
Then (from the directory with that script) run funchir::stale_package_check('test.R'), we'll get:
Functions matched from package data.table: data.table
**No exported functions matched from iotools**
Have you considered using packrat?
packrat::clean() would remove unused packages, for example.
I've written a command-line script to accomplish this task. You can find it in this Github gist. I'm sure there are edge cases that it misses, but it works pretty well, on both R scripts and Rmd files.
My approach always is to close my R script or IDE (i.e. RStudio) and then start it again.
After this I run my function without loading any dependecies/packages beforehand.
This should result in various warning and error messages telling you which functions couldn't be found and executed. This again will give you hints on what packages are necessary to load beforehand and which one you can leave out.

How can I tell which packages I am not using in my R script?

As my code evolves from version to version, I'm aware that there are some packages for which I've found better/more appropriate packages for the task at hand or whose purpose was limited to a section of code which I've now phased out.
Is there any easy way to tell which of the loaded packages are actually used in a given script? My header is beginning to get cluttered.
Update 2020-04-13
I've now updated the referenced function to use the abstract syntax tree (AST) instead of using regular expressions as before. This is a much more robust way of approaching the problem (it's still not completely ironclad). This is available from version 0.2.0 of funchir, now on CRAN.
I've just got around to writing a quick-and-dirty function to handle this which I call stale_package_check, and I've added it to my package (funchir).
e.g., if we save the following script as test.R:
library(data.table)
library(iotools)
DT = data.table(a = 1:3)
Then (from the directory with that script) run funchir::stale_package_check('test.R'), we'll get:
Functions matched from package data.table: data.table
**No exported functions matched from iotools**
Have you considered using packrat?
packrat::clean() would remove unused packages, for example.
I've written a command-line script to accomplish this task. You can find it in this Github gist. I'm sure there are edge cases that it misses, but it works pretty well, on both R scripts and Rmd files.
My approach always is to close my R script or IDE (i.e. RStudio) and then start it again.
After this I run my function without loading any dependecies/packages beforehand.
This should result in various warning and error messages telling you which functions couldn't be found and executed. This again will give you hints on what packages are necessary to load beforehand and which one you can leave out.

R: importing data.table package namespace, unexplainable jump in memory consumption

I use data.table package inside my own package and I import data.table namespace in NAMESPACE and DESCRIPTION files.
In one of my functions I use data.table function to convert data.frame into data.table
dt <- data.table(df)
But when I call my function, at the point of calling data.table() memory usage jumps instantly and R just stops responding.
The code within the function works fine when I run it line by line and with low memory consumption.
Also, if I put library(data.table) within my function everything is fine. I was trying to avoid putting library(data.table) in my function and declare dependency instead. However, it seems something is going wrong that way. I am running R-2.14.0 on Mac OS X 10.6.8
Can anybody explain what could be a reason, and how can I fix that (without using library(data.table) within my function)?
Some random guesses in no particular order :
Try use the Imports or Depends field in DESCRIPTION only. I don't think you need to import in NAMESPACE as well, but I might be wrong. Why that would explain the memory use though, don't know.
What is df? Is it big or somehow recursive or strange in some way? Please provide str(df) to tell us something about it, if possible.
Try as.data.table(df) which is faster than data.table(df). But it sounds like your problem is different to that.
Is your function call being called repeatedly? I can see why repeatedly converting df to dt would use up memory, but not why just calling library(data.table) would make that fast.
Try starting R with R --vanilla to ensure no .Rdata (which may include functions masking data.table's) is being loaded on startup, amongst other things. If you have developed your own package then some kind of function name conflict, or the order of packages on the search() path sounds plausible.
Otherwise we'll need more information please. I don't recall anything similar to this happening to me, or being reported before.
And, which version of data.table are you using? There is this bug fix in v1.8.1 on R-Forge (not yet on CRAN) :
Moved data.table setup code from .onAttach to .onLoad so that it is also run when data.table is simply imported from within a package, fixing #1916 related to missing data.table options.
But if you are using 1.8.0 from CRAN, and are Importing (only) rather than Depending then I'd expect you to get an error about missing options rather than a jump in memory consumption.

Resources