How to test graphical output of functions? - r

I am wondering how to test functions that produce graphics. I have a simple
plotting function img:
img <- function() {
plot(1:10)
}
In my package I like to create a unit test for this function using testthat.
Because plot and its friends in base graphics just return NULL a simple
expect_identical is not working:
library("testthat")
## example for a successful test
expect_identical(plot(1:10), img()) ## equal (as expected)
## example for a test failure
expect_identical(plot(1:10, col="red"), img()) ## DOES NOT FAIL!
# (because both return NULL)
First I thought about plotting into a file and compare the md5 checksums to
ensure that the output of the functions is equal:
md5plot <- function(expr) {
file <- tempfile(fileext=".pdf")
on.exit(unlink(file))
pdf(file)
expr
dev.off()
unname(tools::md5sum(file))
}
## example for a successful test
expect_identical(md5plot(img()),
md5plot(plot(1:10))) ## equal (as expected)
## example for a test failure
expect_identical(md5plot(img()),
md5plot(plot(1:10, col="red"))) ## not equal (as expected)
That works well on Linux but not on Windows. Surprisingly
md5plot(plot(1:10)) results in a new md5sum at each call.
Aside this problem I need to create a lot of temporary files.
Next I used recordPlot (first creating a null-device, call the plotting
function and record its output). This works as expected:
recPlot <- function(expr) {
pdf(NULL)
on.exit(dev.off())
dev.control(displaylist="enable")
expr
recordPlot()
}
## example for a successful test
expect_identical(recPlot(plot(1:10)),
recPlot(img())) ## equal (as expected)
## example for a test failure
expect_identical(recPlot(plot(1:10, col="red")),
recPlot(img())) ## not equal (as expected)
Does anybody know a better way to test the graphical output of functions?
EDIT: regarding the points #josilber asks in his comments.
While the recordPlot approach works well you have to rewrite the whole plotting function in the unit test. That becomes complicated for complex plotting functions. It would be nice to have an approach that allows to store a file (*.RData or *.pdf, ...) which contains an image against you could compare in future tests. The md5sum approach isn't working because the md5sums differ on different platforms. Via recordPlot you could create an *.RData file but you could not rely on its format (from the recordPlot manual page):
The format of recorded plots may change between R versions.
Recorded plots can not be used as a permanent storage format for
R plots.
Maybe it would be possible to store an image file (*.png, *.bmp, etc), import it and compare it pixel by pixel ...
EDIT2: The following code illustrate the desired reference file approach using svg as output. First the needed helper functions:
## plot to svg and return file contant as character
plot_image <- function(expr) {
file <- tempfile(fileext=".svg")
on.exit(unlink(file))
svg(file)
expr
dev.off()
readLines(file)
}
## the IDs differ at each `svg` call, that's why we simple remove them
ignore_svg_id <- function(lines) {
gsub(pattern = "(xlink:href|id)=\"#?([a-z0-9]+)-?(?<![0-9])[0-9]+\"",
replacement = "\\1=\"\\2\"", x = lines, perl = TRUE)
}
## compare svg character vs reference
expect_image_equal <- function(object, expected, ...) {
stopifnot(is.character(expected) && file.exists(expected))
expect_equal(ignore_svg_id(plot_image(object)),
ignore_svg_id(readLines(expected)), ...)
}
## create reference image
create_reference_image <- function(expr, file) {
svg(file)
expr
dev.off()
}
A test would be:
create_reference_image(img(), "reference.svg")
## create tests
library("testthat")
expect_image_equal(img(), "reference.svg") ## equal (as expected)
expect_image_equal(plot(1:10, col="red"), "reference.svg") ## not equal (as expected)
Sadly this is not working across different platforms. The order (and the names)
of the svg elements completely differs on Linux and Windows.
Similar problems exist for png, jpeg and recordPlot. The resulting files
differ on all platforms.
Currently the only working solution is the recPlot approach above. But therefore
I need to rewrite the whole plotting functions in my unit tests.
P.S.:
I am completley confused about the different md5sums on Windows. It seems they depending on the creation time of the temporary files:
# on Windows
table(sapply(1:100, function(x)md5plot(plot(1:10))))
#4693c8bcf6b6cb78ce1fc7ca41831353 51e8845fead596c86a3f0ca36495eacb
# 40 60

Mango Solutions have published an open source package, visualTest, that does fuzzy matching of plots, to address this use case.
The package is on github, so install using:
devtools::install_github("MangoTheCat/visualTest")
library(visualTest)
Then use the function getFingerprint() to extract a finger print for each plot, and compare using the function isSimilar(), specifying a suitable threshold.
First, create some plots on file:
png(filename = "test1.png")
img()
dev.off()
png(filename = "test2.png")
plot(1:11, col="red")
dev.off()
The finger print is a numeric vector:
> getFingerprint(file = "test1.png")
[1] 4 7 4 4 10 4 7 7 4 7 7 4 7 4 5 9 4 7 7 5 6 7 4 7 4 4 10
[28] 4 7 7 4 7 7 4 7 4 3 7 4 4 3 4 4 5 5 4 7 4 7 4 7 7 7 4
[55] 7 7 4 7 4 7 5 6 7 7 4 8 6 4 7 4 7 4 7 7 7 4 4 10 4 7 4
> getFingerprint(file = "test2.png")
[1] 7 7 4 4 17 4 7 4 7 4 7 7 4 5 9 4 7 7 5 6 7 4 7 7 11 4 7
[28] 7 5 6 7 4 7 4 14 4 3 4 7 11 7 4 7 5 6 7 7 4 7 11 7 4 7 5
[55] 6 7 7 4 8 6 4 7 7 4 4 7 7 4 10 11 4 7 7
Compare using isSimilar():
> isSimilar(file = "test2.png",
+ fingerprint = getFingerprint(file = "test1.png"),
+ threshold = 0.1
+ )
[1] FALSE
You can read more about the package at http://www.mango-solutions.com/wp/products-services/r-services/r-packages/visualtest/

It's worth noting that the vdiffr package also supports comparing plots. A nice feature is that it integrates with the testthat package -- it's actually used for testing in ggplot2 -- and it has an add-in for RStudio to help manage your testsuite.

Related

R - Print list in file and recover list

I have a list that looks like this:
> indices
$`48-168`
$`48-168`$`1`
[1] 1 2 3 4 5 6 7 8 9 10
$`60-180`
$`60-180`$`1`
[1] 1 2 3 4 5 6 7 8 9 10
$`180-300`
$`180-300`$`1`
[1] 1 2
$`180-300`$`4`
[1] 4 5 6 7 8 9 10
$`180-300`$`3`
[1] 3
I want to print it somehow in a file and then recover the same list later.
I though printing the object given by unlist(as.relistable(obj)) and use relist later but then I do not know how to recover the information from the file.
Given that your data is not particularly well structured, you might want to just use save() here, and save the original R list object:
save(indices, file="/path/to/your/file.txt")
When you want to load indices again, use the load() function:
load(file="/path/to/your/file.txt")

How does is.null work on list elements in R? [duplicate]

I found a very suprising and unpleasant feature of R - it completes list item names!!! See the following code:
a <- list(cov_spring = "spring")
a$cov <- c()
a$cov
# spring ## I expect it to be empty!!! I've set it empty!
a$co
# spring
a$c
I don't know what to do with that.... I need to be able to set $cov to NULL and have $cov_spring there at the same time!!! And use $cov separately!! This is annoying!
My question:
What is going on here? How is this possible, what is the logic behind?
Is there some easy fix, how to turn this completion off? I need to use list items cov_spring and cov independently as if they are normal variables. No damn completion please!!!
From help("$"):
'x$name' is equivalent to 'x[["name", exact = FALSE]]'
When you scroll back and read up on exact=:
exact: Controls possible partial matching of '[[' when extracting by
a character vector (for most objects, but see under
'Environments'). The default is no partial matching. Value
'NA' allows partial matching but issues a warning when it
occurs. Value 'FALSE' allows partial matching without any
warning.
So this provides you partial matching capability in both $ and [[ indexing:
mtcars$cy
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[["cy"]]
# NULL
mtcars[["cy", exact=FALSE]]
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
There is no way I can see of to disable the exact=FALSE default for $ (unless you want to mess with formals, which I do not recommend for the sake of reproducibility and consistent behavior).
Programmatic use of frames and lists (for defensive purposes) should prefer [[ over $ for precisely this reason. (It's rare, but I have been bitten by this permissive behavior.)
Edit:
For clarity on that last point:
mtcars$cyl becomes mtcars[["cyl"]]
mtcars$cyl[1:3] becomes mtcars[["cyl"]][1:3]
mtcars[,"cy"] is not a problem, nor is mtcars[1:3,"cy"]
You can use [ or [[ instead.
a["cov"] will return a list with a NULL element.
a[["cov"]] will return the NULL element directly.

Very confusing R feature - completion of list item names

I found a very suprising and unpleasant feature of R - it completes list item names!!! See the following code:
a <- list(cov_spring = "spring")
a$cov <- c()
a$cov
# spring ## I expect it to be empty!!! I've set it empty!
a$co
# spring
a$c
I don't know what to do with that.... I need to be able to set $cov to NULL and have $cov_spring there at the same time!!! And use $cov separately!! This is annoying!
My question:
What is going on here? How is this possible, what is the logic behind?
Is there some easy fix, how to turn this completion off? I need to use list items cov_spring and cov independently as if they are normal variables. No damn completion please!!!
From help("$"):
'x$name' is equivalent to 'x[["name", exact = FALSE]]'
When you scroll back and read up on exact=:
exact: Controls possible partial matching of '[[' when extracting by
a character vector (for most objects, but see under
'Environments'). The default is no partial matching. Value
'NA' allows partial matching but issues a warning when it
occurs. Value 'FALSE' allows partial matching without any
warning.
So this provides you partial matching capability in both $ and [[ indexing:
mtcars$cy
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[["cy"]]
# NULL
mtcars[["cy", exact=FALSE]]
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
There is no way I can see of to disable the exact=FALSE default for $ (unless you want to mess with formals, which I do not recommend for the sake of reproducibility and consistent behavior).
Programmatic use of frames and lists (for defensive purposes) should prefer [[ over $ for precisely this reason. (It's rare, but I have been bitten by this permissive behavior.)
Edit:
For clarity on that last point:
mtcars$cyl becomes mtcars[["cyl"]]
mtcars$cyl[1:3] becomes mtcars[["cyl"]][1:3]
mtcars[,"cy"] is not a problem, nor is mtcars[1:3,"cy"]
You can use [ or [[ instead.
a["cov"] will return a list with a NULL element.
a[["cov"]] will return the NULL element directly.

Avoid memory increase in foreach loop in R

I try to create summary statistics combining two different spatial data-sets: a big raster file and a polygon file. The idea is to get summary statistics of the raster values within each polygon.
Since the raster is too big to process it at once, I try to create subtasks and process them in parallel i.e. process each polygon from the SpatialPolgyonsDataframe at once.
The code works fine, however after around 100 interations I run into memory problems. Here is my code and what I intent to do:
# session setup
library("raster")
library("rgdal")
# multicore processing.
library("foreach")
library("doSNOW")
# assign three clusters to be used for current R session
cluster = makeCluster(3, type = "SOCK",outfile="")
registerDoSNOW(cluster)
getDoParWorkers()# check if it worked
# load base data
r.terra.2008<-raster("~/terra.tif")
spodf.malha.2007<-readOGR("~/,"composed")
# bring both data-sets to a common CRS
proj4string(r.terra.2008)
proj4string(spodf.malha.2007)
spodf.malha.2007<-spTransform(spodf.malha.2007,CRSobj = CRS(projargs = proj4string(r.terra.2008)))
proj4string(r.terra.2008)==proj4string(spodf.malha.2007) # should be TRUE
# create a function to extract areas
function.landcover.sum<-function(r.landuse,spodf.pol){
return(table(extract(r.landuse,spodf.pol)))}
# apply it one one subset to see if it is working
function.landcover.sum(r.terra.2008,spodf.malha.2007[1,])
## parallel loop
# define package(s) to be use in the parallel loop
l.packages<-c("raster","sp")
# try a parallel loop for the first 6 polygons
l.results<-foreach(i=1:6,
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
}
here the output is a list that looks like this.
l.results
[[1]]
9 10
193159 2567
[[2]]
7 9 10 12 14 16
17 256 1084 494 67 15
[[3]]
3 5 6 7 9 10 11 12
2199 1327 8840 8579 194437 1061 1073 1834
14 16
222 1395
[[4]]
3 6 7 9 10 12 16
287 102 728 329057 1004 1057 31
[[5]]
3 5 6 7 9 12 16
21 6 20 495 184261 4765 28
[[6]]
6 7 9 10 12 14
161 161 386 943 205 1515
So the result is rather small and should not be the source of the memory allocation problem. So than the following loop upon the whole polygon dataset which has >32.000 rows creates the memory allocation which exceeds 8GB after around 100 iteratins.
# apply the parallel loop on the whole dataset
l.results<-foreach(i=1:nrow(spodf.malha.2007),
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
# gc(reset=TRUE) # does not resolve the problem
# closeAllConnections() # does not resolve the problem
}
What am I doing wrong?
edit:
I tried (as suggested in the comments) to remove the object after each iteration in the internal loop, but it did not resolve the problem. I furthermore tried to resolve eventual problems of multiple data-imports by passing the objects to the environment in the first place:
clusterExport(cl = cluster,
varlist = c("r.terra.2008","function.landcover.sum","spodf.malha.2007"))
without major changes. My R version is 3.4 on a linux platform so supposedly also the patch of the link from the fist comment should already be included in this version. I also tried the parallel package as suggested in the first comment but no differences appeared.
You can try exact_extract in the exactextractr package. Is the fastest and memory safer function to extract values from raster. The main function is implemented in C++ and usually it doesn't need parallelization. Since you do not provide any example data I post an example with real data:
library(raster)
library(sf)
library(exactextractr)
# Pull municipal boundaries for Brazil
brazil <- st_as_sf(getData('GADM', country='BRA', level=2))
# Pull gridded precipitation data
prec <- getData('worldclim', var='prec', res=10)
#transform precipitation data in a dummy land use map
lu <- prec[[1]]
values(lu) <- sample(1:10,ncell(lu),replace = T)
plot(lu)
#extract land uses class for each pixel inside each polygon
ex <- exact_extract(lu, brazil)
#apply table to the resulting list. Here I use just the first 5 elements to avoid long output
lapply(ex[1:5],function(x){
table(x[,1])#note that I use x[,1] because by default exact_extract provide in the second column the coverage fraction of each pixel by each polygon
})
here the example output:
[[1]]
1 2 4 6 7 9 10
1 1 1 2 3 1 1
[[2]]
2 3 4 5 6 7 8 10
2 4 3 2 1 2 2 2
[[3]]
1 2 4 6 7 8 9 10
4 5 1 1 4 2 5 5
[[4]]
1 2 3 4 5 6 7 8 9 10
2 2 4 2 2 4 1 4 1 2
[[5]]
3 4 5 6 8 10
2 3 1 1 2 3

Is R able to compute contingency tables on big file without putting the whole file in RAM?

Let me explain the question:
I know the functions table or xtabs compute contingency tables, but they expect a data.frame, which is always stored in RAM. It's really painful when trying to do this on a big file (say 20 GB, the maximum I have to tackle).
On the other hand, SAS is perfectly able to do this, because it reads the file line by line, and updates the result in the process. Hence there is ever only one line in RAM, which is much more acceptable.
I have done the same as SAS with ad-hoc Python programs on occasion, when I had to do more complicated stuff that either I didn't know how to do in SAS or thought it was too cumbersome. Python syntax and integrated features (dictionaries, regular expressions...) compensate for its weaknesses (speed, mainly, but when reading 20 GB, speed is limitated by the hard drive anyway).
My question, then: I would like to know if there are packages to do this in R. I know it's possible to read a file line by line, like I do in Python, but computing simple statistics (contingency tables for instance) on a big file is such a basic task that I feel there should be some more or less "integrated" feature to do it in a statistical package.
Please tell me if this question should be asked on "Cross Validated". I had a doubt, since it's more about software than statistics.
You can use the package ff for this which uses the hard disk drive instead of RAM but it is implemented in a way that it doesn't make it (significantly) slower than the normal way R uses RAM.
This if from the package description:
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory.
I think this will solve your problem of loading a 20GB file in RAM. I have used it myself for such purposes and it worked great.
See here a small example as well. From the example on the xtabs documentation:
Base R
#example from ?xtabs
d.ergo <- data.frame(Type = paste0("T", rep(1:4, 9*4)),
Subj = gl(9, 4, 36*4))
> print(xtabs(~ Type + Subj, data = d.ergo)) # 4 replicates each
Subj
Type 1 2 3 4 5 6 7 8 9
T1 4 4 4 4 4 4 4 4 4
T2 4 4 4 4 4 4 4 4 4
T3 4 4 4 4 4 4 4 4 4
T4 4 4 4 4 4 4 4 4 4
ff package
#convert to ff
d.ergoff <- as.ffdf(d.ergo)
> print(xtabs(~ Type + Subj, data = d.ergoff)) # 4 replicates each
Subj
Type 1 2 3 4 5 6 7 8 9
T1 4 4 4 4 4 4 4 4 4
T2 4 4 4 4 4 4 4 4 4
T3 4 4 4 4 4 4 4 4 4
T4 4 4 4 4 4 4 4 4 4
You can check here for more information on memory manipulation.

Resources