I am trying to write a test for a package function in R.
Let's say we have a function that simply writes a string x to disk using writeLines():
exporting_function <- function(x, file) {
writeLines(x, con = file)
invisible(NULL)
}
One way of testing it would be to check if a file exists. Typically, it should not exist at first, but after the exporting function was run it should. Also, you might want to test the file size to be greater than 0:
library(testthat)
test_that("file is written to disk", {
file = 'output.txt'
expect_false(file.exists(file))
exporting_function("This is a test",
file = file)
expect_true(file.exists(file))
expect_gt(file.info('output.txt')$size, 0)
})
Is this a good way to test it? In the CRAN Repository Policy it states that Packages should not write in the user’s home filespace (including clipboards), nor anywhere else on the file system apart from the R session’s temporary directory. Would this test violate this constraint?
There is a expect_output_file function. From the documentation and examples I am not sure if this is a more appropriate expectation to test the function. It requires a.o. an object argument which should be the object to test. What is the object to test in my case?
That looks as if it violates CRAN policy. Why not simply write to the temporary directory, using
file <- tempfile()
in place of
file = 'output.txt'
?
As to whether it is a good test: wouldn't it be better to try reading the file back in, and confirming that what was read matches what was written? That's easy in your toy example. It might be harder in the real one, but having an import function paired with your export function is always a good idea.
Related
I have a function in an R Script file that when called it initializes and creates a temp log file locally using logger and saves specific things from the function.
function_name <- function(source, dimensions, metrics, filters){
tmp_log_file <- init_logger()
log_info('Currently in Function -function_name- from -myscript.R-')
cat("Location of log file:", tmp_log_file)
... # Do more things in function while still logging certain things
results <- second_function()
return(results)
}
Now the function where logger gets initialized is the following,
init_logger <- function(create_temp_file = TRUE) {
if (!requireNamespace("logger", quietly = TRUE)) {
stop("Package logger needed. Please install it.",
call. = FALSE)
}
library(logger)
if(isTRUE(create_temp_file)){
# create temp file in temp dir, it gets deleted automatically when done with session.
tmp <- tempfile(fileext = '.txt')
log_appender(appender_file(tmp))
}
log_threshold(TRACE)
log_formatter(formatter_paste)
log_layout(layout_simple)
log_appender(appender_file(tmp))
return(tmp)}
Now I want to expand this to other functions, some are called inside my first function 'function_name()', but other functions are not. If user decides to use some other functions I will like to save more logging information in the same original temp file otherwise if file does not exist then I want to create/initialize a temp log file.
Is there a way to check if logger has been initialized and is currently saving logging information in a temp file?
Some general information:
R 4.1.3
Logger package 0.2.2
RStudio 2022.02.3+492
My current solution is to check logger::log_appender() and see if the log entries are been appended to the console or to a file since the default is to console, but this does not seem to be a great way to do it.
Perhaps even trying to do this is bad practice(would like to know if that is the case and the reasons). I only need the log file if something goes wrong and the user will send the log file in that case. Otherwise since is a temp file created in R, tempfile(), it will be deleted automatically after user is done with the session.
I dont have much experience with logging in general so bare with me.
I have a flexdashboard which is used by multiple users. They read, modify and write the same (csv) file. I haven't been able to figure out how to do this with a SQL connection so in the meantime (I need a working app) I would like to use a simple .csv file as a database. This should be fine since the users aren't likely to work on it the exact same time and loading and writing the full file is almost instant.
My strategy is therefore:
1-load file,
2-edit (edits are done in rhandsontable which is backconverted to a dataframe)
3-save:
(a)-loads file again (to get the latest data),
(b)-appends the edits from the rhandsontable and keeps the latest data (indicated by a timestamp)
(c)-write.csv
I'm thinking I should add something in (1) such that it checks if the file is not already in use/open (because an other user is at (3). So: check if open, if not-> continue, else-> sys.sleep(3) and try again.
Any ideas about how to do this in R? In Delphi it would be something like:
if fileinuse(filename) then sleep(3) else df<-read.csv
What's the R way?
Edit:
I'm starting with the my edited answer as it's more elegant. This uses a shell command to test whether a file is available as discussed in this question:
How to check in command-line if a given file or directory is locked (used by any process)?
This avoids loading and saving the file, and is therefor more efficient.
# Function to test availability
IsInUse <- function(fName) {
shell(paste("( type nul >> ", fName, " ) 2>nul && echo available || echo in use", sep=""), intern = TRUE)=="in use"
}
# Test availability
IsInUse("test.txt")
Original answer:
Interesting question! I did not find a way to check if a file is in use before trying to write to it. The solution below is far from elegant. It relies on a tryCatch function, and on reading and writing to a file to check if it is available (which can be quite slow depending on your file size).
# Function to check if the file is in use (relies on reading and writing which is inefficient)
IsInUse <- function(fName) {
rData <- read.csv(fName)
tryCatch(
{
write.csv(rData, file=fName, row.names = FALSE)
return(FALSE)
},
error=function(cond) {
return(TRUE)
}
)
}
# Loop to check if file is in use
while(IsInUse(fName)) {
print("Still in use")
Sys.sleep(0.1)
}
# Your action here
I also found the answer to this question useful How to write trycatch in R to make sense of the tryCatch function.
I'd be interested to see if anyone else has a more elegant suggestion!
Interesting question, indeed! Curious about an elegant solution too...
I have a 370MB zip file and the content is a 4.2GB csv file.
I did:
unzip("year2015.zip", exdir = "csv_folder")
And I got this message:
1: In unzip("year2015.zip", exdir = "csv_folder") :
possible truncation of >= 4GB file
Have you experienced that before? How did you solve it?
I agree with #Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.
To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.
I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.
decompress_file <- function(directory, file, .file_cache = FALSE) {
if (.file_cache == TRUE) {
print("decompression skipped")
} else {
# Set working directory for decompression
# simplifies unzip directory location behavior
wd <- getwd()
setwd(directory)
# Run decompression
decompression <-
system2("unzip",
args = c("-o", # include override flag
file),
stdout = TRUE)
# uncomment to delete archive once decompressed
# file.remove(file)
# Reset working directory
setwd(wd); rm(wd)
# Test for success criteria
# change the search depending on
# your implementation
if (grepl("Warning message", tail(decompression, 1))) {
print(decompression)
}
}
}
Notes:
The function does a few things, which I like and recommend:
uses system2 over system because the documentation says "system2 is a more portable and flexible interface than system"
separates the directory and file arguments, and moves the working directory to the directory argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
I set it to use the -o flag to automatically overwrite when rerun, but you could supply any number of arguments
includes a .file_cache argument which allows you to skip decompression
this comes in handy if you're testing a process which runs on the decompressed file, since 4GB+ files tend to take some time to decompress
commented out in this instance, but if you know you don't need the archive after decompressing, you can remove it inline
the system2 command redirects the stdout to decompression, a character vector
an if + grepl check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression
Checking ?unzip, found the following comment in Note:
It does have some support for bzip2 compression and > 2GB zip files
(but not >= 4GB files pre-compression contained in a zip file: like
many builds of unzip it may truncate these, in R's case with a warning
if possible).
You can try to unzip it outside of R (using 7-Zip for example).
To add to the list of possible solutions, in case you have Java (JDK) available on your machine, you can wrap jar xf into an R function similar to utils::unzip() in interface, a very simple example:
unzipLarge <- function(zipfile, exdir = getwd()) {
oldWd <- getwd()
on.exit(setwd(oldWd))
setwd(exdir)
system2("jar", args = c("xf", zipfile))
}
And then use:
unzipLarge("year2015.zip", exdir = "csv_folder")
this is an atomic example of my current issue:
For the moment I have a project containing several R scripts (all in the same directory named DIR). I have a main script in DIR sourcing all the R files and, containing a basicconfig:
basicConfig()
I take two scripts in DIR, dog.r and cat.r. I have currently only one function in these scripts. In dog.r :
feedDog <- function(){
loginfo("The dog is happy to eat!", logger="dog.r")
}
And in cat.r :
feedCat <- function(){
loginfo("The cat is voracious", logger="cat.r")
}
It's fine with this example. But in real I have something like 20 scripts and 20 possible error messages in each. So that instead of writting:
loginfo("some message", logger="name of script")
I would like to write:
loginfo("some message", logger=logger)
And configure different loggers.
The issue is that if I declare a logger in each R scripts, only one will be taken into account when I source all files with my main ... I dunno how to bypass this issue.
PS: in Python it is possible to define a logger in each file taking automatically the name of the script like this:
logger = logging.getLogger(__name__)
But I am afraid it is not possible in R ?
If your source() a file, the functions created in that file will have an attribute called srcref that stored the location from the sourced file that the function came from. If you have a name that points to that function, you can use getSrcFilename to get the filename the function came from. For example, create a file that we can source
# -- testthis.R --
testthis <- function() {
loginfo("Hello")
}
Now if we enter R, we can run
loginfo <-function(msg) {
fnname <- sys.call(-1)[[1]]
fnfile <- getSrcFilename(eval(fnname))
paste(msg, "from", deparse(fnname), "in", fnfile)
}
source("testthis.R")
testthis()
# [1] "Hello from testthis in testthis.R"
The function loginfo uses sys.call(-1) to see what function it was called from. Then it extracts the name from that call (with [[1]]) and then we use eval() to turn that "name" object into the actual function. Once we have the function, we can get the source file name. This is the same as running
getSrcFilename(testthis)
if you already knew the name of the function. So it is possible, it's just a bit tricky. I believe this special attribute is only added to functions. Other than that, each source file doesn't get it's own namespace or anything so they can't each have their own logger.
I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.