Run a code snippet before and after tests with testthat - r

I have a set of R functions which get called from a Ruby application and would like to test them. The R functions access data from a Postgres database so to write tests I need to populate a test database with some sample data.
I would like to avoid cleaning up the database within every test case. It would be nice if I could start a transaction before every test and rollback the transaction after.
The last paragraph of the "Writing Tests" section of this page http://r-pkgs.had.co.nz/tests.html makes me think that it's not possible to execute a block of code before every test case.
Does anyone know of any creative work arounds? I'm considering forking the project and adding the functionality but wanted to make sure I'm not reinventing the wheel.

For anyone who stumbles upon this, I ended up creating a wrapper function for test_that() which I use when getting data from postgres. Below is my solution:
db.test_that <- function(description, test_code) {
dbGetQuery(database_connection, 'BEGIN TRANSACTION')
test_that(description, test_code)
dbRollback(database_connection)
}

Related

Version control of big data tables (iceberg)

I'm building a Iceberg tables on the top of a data lake. These tables are used for reporting tools. I'm trying to figure out what is the best way to control a version/deploy changes to these tables in CI/CD process. E.g. I could like to add a column to the Iceberg table. To do that I have to write a ALTER TABLE statement, save it to the git repository and deploy via CI/CD pipeline. Tables are accessible via AWS Glue Catalog.
I couldn't find to much info about this in google so if anyone could share some knowledge, it would be much appreciated.
Cheers.
Version control of Iceberg tables.
Agree with #Fokko Driesprong. This is a supplement only.
Sometimes, table changes are considered as part of task version changes. That is, table change statements, ALTER TABLE, are bound to task upgrades.
Tasks are sometimes automatically deployed. So it often executes a table change statement first, and then deploys a new task. If the change is disruptive, then we need to stop the old task first and then deploy the new one.
Corresponding to the upgrade, we also have a rollback script, of course, the corresponding table change statement.
thanks for asking this question. I don't think there is a definitive way of doing this. In practice I see most people bundling this as part of the job that writing to the Iceberg table. This way you can make sure that new columns are populated right away with the new version of the job. If you don't do any breaking changes (such as deletion of column), then the downstream jobs won't break. Hope this helps!

Is there any way to execute repeatable flyway scripts first?

We use flyway since years to maintain our DB scripts, and it does a wonderful job.
However there is one single situation where I am not really happy - possibly someone out there has a solution:
In order to reduce the number of scripts required (and also in order to keep overview about "where" our procedures are defined) I'd like to implement our functions/procedures in one script. Every time a procedure changes (or a new one is developed) this script shall be updated - repeatable scripts sound perfect for this purpose, but unfortunately they are not.
The drawback is, that a new procedure cannot be accessed by non-repeatable scripts, as repeatable scripts are executed last, so the procedure does not exist when the non-repeatable script executes.
I hoped I can control this by specifying different locations (e.g. loc_first containing the repeatables I want to be executed first, loc_normal for the standard scripts and the repeatables to be executed last).
Unfortunately the order of locations has no impact on execution order ;-(
What's the proper way to deal with this situation? Right now I need to specify the corresponding procedures in non-repeatable scripts, but that's exactly what I'd like to avoid ....
I found a workaround on my own: I'm using flyway directly with maven (the same would work in case you use the API of course). Each stage of my maven script has its own profile (specifiying URL etc.)
Now I create two profiles for every stage - so I have e.g. dev and devProcs.
The difference between these two maven profiles is, that the "[stage]Procs" profile operates on a different location (where only the repeatable scripts maintaining procedures are kept). Then I need to execute flyway twice - first with [stage]Procs then with [stage].
To me this looks a bit messy, but at least I can maintain my procedures in a repeatable script this way.
According to flyway docs, Repeatable migrations ALWAYS execute after versioned migration.
But, I guess, you can use Flyway callbacks. Looks like, beforeMigrate.sql callback is exactly what you need.

Bokeh and Joblib don't play together

I have a Bokeh script which calls the data using a function wrapped with joblib's #memory.cache decorator. When I run the script as a python script the get_data function is fast (cached). When I call it using bokeh server --show code.py it seems like cache is lost and the function is re-evaluated, making data retrieval slow. How can I make Bokeh work nicely with Joblib?
It's hard to say for certain without being able to run an example that reproduces what you are seeing. But my guess is that it has something to do with the way the Bokeh server code runner executes the app script, on every session.
So, I can think of a few possible things to try.
First, as of 0.12.4 there's examples and guidance for embedding a Bokeh server as a library e.g. in a standalone python script, or in a Flask or Tornado app. The examples there all also use FunctionHandler which does not exec. My hunch is that this is more like the standard single process/single namespace python execution model, and will play better with your joblib decorator.
(If you try this route, and it works, please let use know somehow, it's probably worth documenting better.)
Otherwise, another option that might work better is to use lifecycle hooks to provide your wrapped function in a way that is sure to be shared across sessions. You can see this technique in the spectrogram example (c.f. the audio.py)
Finally, just some gentle advice for SO. If you can include a minimal example code, that greatly increases the odds of being able to get code back in an answer. E.g., if there was example code here that I could try to get working, then I'd be able to post a complete working code in the answer.

How do I show a message when an sql query is running in R?

I'm working on some code in R that makes heavy use of the package rpostgresql. Some of the queries I'm using take quite a while to download, so I would like to periodically show a message to the users (something like "Download in progress, please be patient..") so that it is clear that the program didn't crash.
Ideally, I'd like to write the code as an all-purpose wrapper for functions so that it will give feedback to the user along the way (download in progress, time elapsed, etc.)
For example, if we have an example function:
x<-runif(10)
I would like to make a wrapper of the form
some_wrapper_function(x<-runif(10))
where some_wrapper_function gives periodic updates while the wrapped code is running.
It seems like this requires parallel coding, but with the hitch that the clusters need to talk to each other at least once.
Any thoughts? Or is there an existing function that does this?

I'm working with Drupal and I want to diff between two databases, to see which tables have been updated in a given session. how do I do this?

I'm working with Drupal on a project, trying to find a way to speed up tests (we're using Cucumber and Selenium), and I'm trying to see which tables have been changed in a given series of steps, so I can just revert dump and out reset those tables between each test case.
Right now, Simpletest, the Drupal testing framework works by installing and setting up the tables for every module needed for a test, which makes for slow tests, and I'm emulating a similar approach by loading a db dump for each test.
Given that a site, if you're doing integration testing has a 'known good' state to be starting from, I think it would be faster to be able to just revert back to that point each time, instead of waiting twenty seconds or so to drop the database then pipe the dumpfile back in between each test runs.
However, when I try diffing between two dumpfiles (ie before.I.create.a.node.sql, and after.I.create.a.node.sql) the output is an unreadable load of serialised php, that I can't make sense of.
Ae there any tools I can use to help work out which tables I need to drop and rebuild between test cases, so I don't incur the 20 second hit on each test, short of reading the schema and code of every module I'm working with?
I'm following the ideas outlined here with getting cucumber to work with PHP, and yes, I have seen [this question here on a similar subject
Thanks!
Drupal does store a lot of serialized PHP in the database. But the main part of it is kept in the cache tables; like cache, cache_field, cache_menu, etc and you can safely truncate these before dumping the database.
If you have any simpletest tables you could drop those. They are all temporary and is used only for running your Simpletest test suite.
That should reduce the dump size a lot. If it's not enough I can recommend reading up on the tables in the book Pro Drupal Development, or you could skim through the .install files to read the module's schema definitions. Though most will probably be real data you'd want to revert between tests.
Because of the relational nature of the database, be sure to either know exactly what you're doing or dump/revert all the remaining tables together.

Resources