R: importing data.table package namespace, unexplainable jump in memory consumption - r

I use data.table package inside my own package and I import data.table namespace in NAMESPACE and DESCRIPTION files.
In one of my functions I use data.table function to convert data.frame into data.table
dt <- data.table(df)
But when I call my function, at the point of calling data.table() memory usage jumps instantly and R just stops responding.
The code within the function works fine when I run it line by line and with low memory consumption.
Also, if I put library(data.table) within my function everything is fine. I was trying to avoid putting library(data.table) in my function and declare dependency instead. However, it seems something is going wrong that way. I am running R-2.14.0 on Mac OS X 10.6.8
Can anybody explain what could be a reason, and how can I fix that (without using library(data.table) within my function)?

Some random guesses in no particular order :
Try use the Imports or Depends field in DESCRIPTION only. I don't think you need to import in NAMESPACE as well, but I might be wrong. Why that would explain the memory use though, don't know.
What is df? Is it big or somehow recursive or strange in some way? Please provide str(df) to tell us something about it, if possible.
Try as.data.table(df) which is faster than data.table(df). But it sounds like your problem is different to that.
Is your function call being called repeatedly? I can see why repeatedly converting df to dt would use up memory, but not why just calling library(data.table) would make that fast.
Try starting R with R --vanilla to ensure no .Rdata (which may include functions masking data.table's) is being loaded on startup, amongst other things. If you have developed your own package then some kind of function name conflict, or the order of packages on the search() path sounds plausible.
Otherwise we'll need more information please. I don't recall anything similar to this happening to me, or being reported before.
And, which version of data.table are you using? There is this bug fix in v1.8.1 on R-Forge (not yet on CRAN) :
Moved data.table setup code from .onAttach to .onLoad so that it is also run when data.table is simply imported from within a package, fixing #1916 related to missing data.table options.
But if you are using 1.8.0 from CRAN, and are Importing (only) rather than Depending then I'd expect you to get an error about missing options rather than a jump in memory consumption.

Related

How can I use data.table in a package without importing all functions?

I'm building an R package in which I would like to use dtplyr to perform various bits of data manipulation. My issue is that dtplyr seems to only work if I import the whole of data.table (i.e. using the roxygen #' #import data.table). Without this I get errors like:
Error in .(x = sum(x), y = sum(y), :
could not find function "."
If I can solve this problem by only importing certain functions from data.table that would be great, but there seems to be no function .() in the package. My knowledge of data.table is limited, but I can only assume it uses .() to edit parsed code (similar to the base R bquote()), but that dtplyr for some reason needs data.table to be loaded for this to work.
I've tried various things such as withr::with_package("data.table", code) and requireNamespace("data.table"), but so far importing the whole package is the only thing that seems to work. This is not a viable solution because it completely ruins the well-maintained namespace in the package I'm working on by importing so many functions from data.table.
NB, this package houses a project which will be worked on by many other analysts well into the future. While simply writing data.table code may be preferable in terms of performance and general good-practice, using dtplyr to translate dplyr code gives a boost in readability and ease-of-use that is far more important in this context.
The (documented) solution I found is to set .datatable.aware <- TRUE somewhere in the package source code. According to the documentation, if you're using data.table in a package without importing the whole thing, you should do this so that [.data.table() does not revert to calling [.data.frame(). From the docs:
...please define .datatable.aware = TRUE anywhere in your R source code (no need to export). This tells data.table that you as a package developer have designed your code to intentionally rely on data.table functionality even though it may not be obvious from inspecting your NAMESPACE file.

Determining if there are unused packages in an R script [duplicate]

As my code evolves from version to version, I'm aware that there are some packages for which I've found better/more appropriate packages for the task at hand or whose purpose was limited to a section of code which I've now phased out.
Is there any easy way to tell which of the loaded packages are actually used in a given script? My header is beginning to get cluttered.
Update 2020-04-13
I've now updated the referenced function to use the abstract syntax tree (AST) instead of using regular expressions as before. This is a much more robust way of approaching the problem (it's still not completely ironclad). This is available from version 0.2.0 of funchir, now on CRAN.
I've just got around to writing a quick-and-dirty function to handle this which I call stale_package_check, and I've added it to my package (funchir).
e.g., if we save the following script as test.R:
library(data.table)
library(iotools)
DT = data.table(a = 1:3)
Then (from the directory with that script) run funchir::stale_package_check('test.R'), we'll get:
Functions matched from package data.table: data.table
**No exported functions matched from iotools**
Have you considered using packrat?
packrat::clean() would remove unused packages, for example.
I've written a command-line script to accomplish this task. You can find it in this Github gist. I'm sure there are edge cases that it misses, but it works pretty well, on both R scripts and Rmd files.
My approach always is to close my R script or IDE (i.e. RStudio) and then start it again.
After this I run my function without loading any dependecies/packages beforehand.
This should result in various warning and error messages telling you which functions couldn't be found and executed. This again will give you hints on what packages are necessary to load beforehand and which one you can leave out.

Use data.table in functions/packages (With roxygen)

I am quite new to R but it seems, this question is closely related to the following post 1, 2, 3 and a bit different topic 4. Unfortunately, I have not enough reputation to comment right there. My problem is that after going through all the suggestions there, the code still does not work:
I included "Depends" in the description file
I tried the second method including a change of NAMESPACE (Not reproducable)
I created a example package here containing a very small part of the code which showed a bit different error ("J" not found in routes[J(lat1, lng1, lat2, lng2), .I, roll = "nearest", by = .EACHI] instead of 'lat1' not found in routes[order(lat1, lng1, lat2, lng2, time)])
I tested all scripts using the console and R-scripts. There, the code ran without problems.
Thank you very much for your support!
Edit: #Roland
You are right. Roxygen overwrites the namespace. You have to include #' #import data.table to the function. Do you understand, why only inserting Depends: data.table in the DESCRIPTION file does not work? This might be a useful hint in the documentation or did I miss it?
It was missleading that changing to routes <- routes[order("lat1", "lng1", "lat2", "lng2", "time")] helped at least a bit as this line was suddenly no problem any more. Is it correct, that in this case data.frame order is used? I will see how far I get now. I will let you know the final result...
Answering your questions (after edit).
Quoting R exts manual:
Almost always packages mentioned in ‘Depends’ should also be imported from in the NAMESPACE file: this ensures that any needed parts of those packages are available when some other package imports the current package.
So you still should have import in NAMESPACE despite the fact if you depends or import data.table.
The order call doesn't seems to be what you expect, try the following:
order("lat1", "lng1", "lat2", "lng2", "time")
library(data.table)
data.table(a=2:1,b=1:2)[order("a","b")]
In case of issues I recommend to start debugging by writing unit test for your expected results. The most basic way to put unit tests in package is just plain R script in tests directory having stopifnot(...) call. Be aware you need to library/require your package at the start of the script.
This is more in addition to the answers above: I found this to be really useful...
From the docs [Hadley-description](http://r-pkgs.had.co.nz/description.html und)
Imports packages listed here must be present for your package to
work. In fact, any time your package is installed, those packages
will, if not already present, be installed on your computer
(devtools::load_all() also checks that the packages are installed).
Adding a package dependency here ensures that it’ll be installed.
However, it does not mean that it will be attached along with your
package (i.e., library(x)). The best practice is to explicitly refer
to external functions using the syntax package::function(). This
makes it very easy to identify which functions live outside of your
package. This is especially useful when you read your code in the
future.
If you use a lot of functions from other packages this is rather
verbose. There’s also a minor performance penalty associated with
:: (on the order of 5$\mu$s, so it will only matter if you call the
function millions of times).
From the docs Hadley-namespace
NAMESPACE also controls which external functions can be used by your
package without having to use ::. It’s confusing that both
DESCRIPTION (through the Imports field) and NAMESPACE (through import
directives) seem to be involved in imports. This is just an
unfortunate choice of names. The Imports field really has nothing to
do with functions imported into the namespace: it just makes sure the
package is installed when your package is. It doesn’t make functions
available. You need to import functions in exactly the same way
regardless of whether or not the package is attached.
... this is what I recommend: list the package in DESCRIPTION so that it’s
installed, then always refer to it explicitly with pkg::fun().
Unless there is a strong reason not to, it’s better to be explicit.
It’s a little more work to write, but a lot easier to read when you
come back to the code in the future. The converse is not true. Every
package mentioned in NAMESPACE must also be present in the Imports or
Depends fields.

Making a package in R that depends on data.table

I have to make an R package that depends on the package data.table. However, if I would do a function such as the next one in the package
randomdt <- function(){
dt <- data.table(random = rnorm(10))
dt[dt$random > 0]
}
the function [ will use the method for data.frame not for data.table and therefore the error
Error in `[.data.frame`(x, i) : undefined columns selected
will appear. Usually this would be solved by using get('[.data.table') or similar method (package::function is the simplest) but that appears not to work. After all, [ is a primitive function and I don't know how the methods to it work.
So, how can I call the data.table [ function from my package?
Updated based on some feedback from MichaelChirico and comments by Arun and Soheil.
Roughly speaking, there's two approaches you might consider. The first is building the dependency into your package itself, while the second is including lines in your R code that test for the presence of data.table (and possibly even install it automatically if it is not found).
The data.table FAQ specifically addresses this in 6.9, and states that you can ensure that data.table is appropriately loaded by your package by:
Either i) include data.table in the Depends: field of your DESCRIPTION file, or ii) include data.table in the Imports: field of your DESCRIPTION file AND import(data.table) in your NAMESPACE file.
As noted in the comments, this is common R behavior that is in numerous packages.
An alternative approach is to create specific lines of code which test for and import the required packages as part of your code. This is, I would contend, not the ideal solution given the elegance of using the option provided above. However, it is technically possible.
A simple way of doing this would be to use either require or library to check for the existence of data.table, with an error thrown if it could not be attached. You could even use a simple set of conditional statements to run install.packages to install what you need if loading them fails.
Yihui Xie (of knitr fame) has a great post about the difference between library and require here and makes a strong case for just using library in cases where the package is absolutely essential for the upcoming code.

How can I tell which packages I am not using in my R script?

As my code evolves from version to version, I'm aware that there are some packages for which I've found better/more appropriate packages for the task at hand or whose purpose was limited to a section of code which I've now phased out.
Is there any easy way to tell which of the loaded packages are actually used in a given script? My header is beginning to get cluttered.
Update 2020-04-13
I've now updated the referenced function to use the abstract syntax tree (AST) instead of using regular expressions as before. This is a much more robust way of approaching the problem (it's still not completely ironclad). This is available from version 0.2.0 of funchir, now on CRAN.
I've just got around to writing a quick-and-dirty function to handle this which I call stale_package_check, and I've added it to my package (funchir).
e.g., if we save the following script as test.R:
library(data.table)
library(iotools)
DT = data.table(a = 1:3)
Then (from the directory with that script) run funchir::stale_package_check('test.R'), we'll get:
Functions matched from package data.table: data.table
**No exported functions matched from iotools**
Have you considered using packrat?
packrat::clean() would remove unused packages, for example.
I've written a command-line script to accomplish this task. You can find it in this Github gist. I'm sure there are edge cases that it misses, but it works pretty well, on both R scripts and Rmd files.
My approach always is to close my R script or IDE (i.e. RStudio) and then start it again.
After this I run my function without loading any dependecies/packages beforehand.
This should result in various warning and error messages telling you which functions couldn't be found and executed. This again will give you hints on what packages are necessary to load beforehand and which one you can leave out.

Resources