How can I check that new data extracts have the same structure? - r

I work for a research consortium with a web-based data management system that's managed by another agency. I can download the underlying data from that system as a collection of CSV files. Using R and knitr, I've built a moderately complex reporting system on top of those files. But every so often, the other agency changes the format of the data extracts and blows up my reports (or worse, changes it in a subtle yet nefarious way that I don't notice for weeks).
They'll probably never notify me when these things happen, so I suppose I should be testing more. I'd like to start by testing that those CSV files have the same structure each time (but allowing different numbers of rows as we collect more data). What's the best way to do that? R is my preferred tool but I'm interested in hearing about others that are free and on Windows.

If your files are just CSVs, here is an example (assuming you keep a reference file around):
reference.file <- read.csv("ref.csv")
new.file <- read.csv("new.file")
struct.extract <- function(df) {
list(
vapply(df, class, character(1L)),
attributes(df)[names(attributes(df)) != "row.names"]
)
}
identical(struct.extract(reference.file), struct.extract(new.file))
This compares attributes of the data frame, as well as classes of the columns. If you need to get more detailed on column format you can extend this easily. This assumes the reports are not changing # of rows (or columns), but if that's the case, that should be easy to modify as well.

Related

File organization - how to handle different combinations of filters on one data.frame efficiently?

I currently do a lot of descriptive analysis in R. I always work with a data.table like df
net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)
library(data.table)
df <- data.table(net,gross,color,height)
In order to obtain results, I do apply a lot of filters.
Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:
df[color=="green" & height>175]
In my real data.table, I have 7 columns and all kind of filter-combinations.
Since I always address the same data.table, I'd like to find the most efficient way to filter the data.
So far, my files are organized like this (bottom-up):
execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a
new data.table and source the "source file" with this new filtered table.
I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.
How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?
I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.
I usually do it like this:
create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results
Automate the process as much as possible.

Importing and accessing large data files in Shiny

I have an app where I want to pull out values from a lookup table based on user inputs. The reference table is a statistical test, based on a calculation that'd be too slow to do for all the different combinations of user inputs. Hence, a lookup table for all the possibilities.
But... right now the table is about 60 MB (as .Rdata) or 214 MB (as .csv), and it'll get much larger if I expand the possible user inputs. I've already reduced the number of significant figures in the data (to 3) and removed the row/column names.
Obviously, I can preload the lookup table outside the reactive server function, but it'll still take a decent chunk of time to load in that data. Does anyone have any tips on dealing with large amounts of data in Shiny? Thanks!
flaneuse, we are still working with a smaller set that you but we have been experimenting with:
Use rds for our data
As #jazzurro mentioned rds above, and you seem to know how to do this, but the syntax for others is below.
Format .rds allows you to bring in a single R object so you can rename it if needs be.
In your prep data code, for example:
mystorefile <- file.path("/my/path","data.rds")
# ... do data stuff
# Save down (assuming mydata holds your data frame or table)
saveRDS(mydata, file = mystorefile)
In your shiny code:
# Load in my data
x <- readRDS(mystorefile)
Remember to copy your data .rds file into your app directory when you deploy. We use a data directory /myapp/data and then file.path for store file is changed to "./data" in our shiny code.
global.R
We have placed our readRDS calls to load in our data in this global file (instead of in server.R before shinyServer() call), so that is run once, and is available for all sessions, with the added bonus it can be seen by ui.R.
See this scoping explanation for R Shiny.
Slice and dice upfront
The standard daily reports use the most recent data. So I make a small latest.dt in my global.R of a smaller subset of my data. So the landing page with the latest charts work with this smaller data set to get faster charts.
The custom data tab which uses the full.dt then is on a separate tab. It is slower but at that stage the user is more patient, and is thinking of what dates and other parameters to choose.
This subset idea may help you.
Would be interested in what others (with more demanding data sets have tried)!

Strategies for repeating large chunk of analysis

I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.
The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.
I considered:
Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.
The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.
What is a good strategy for dealing with this type of problem?
Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.
The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.
The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).
That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.
If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.
If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.
If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.
For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:
{
"Version": "20110701a",
"Initialization":
{
"indices": [1,2,3,4,5,6,7,8,9,10],
"step_size": 0.05
},
"Stopping":
{
"tolerance": 0.01,
"iterations": 100
}
}
Save that as "parameters01.json" and load as:
library(RJSONIO)
Params <- fromJSON("parameters.json")
and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:
Rscript --vanilla MyScript.R parameters01.json
then, within the program, identify the parameters file from the commandArgs() function.
Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.
Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?
Update:
Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.
I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)
I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.
btw here's my little shell script,
#!/bin/sh
R CMD Sweave docSweave.Rnw
for file in `ls pdfs`;
do pdfcrop pdfs/"$file" pdfs/"$file"
done
pdflatex docSweave.tex
open docSweave.pdf
The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.
Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis.
An example:
setup1 <- local({
x <- rnorm(50, mean=2.0)
y <- rnorm(50, mean=1.0)
environment()
# ...
})
setup2 <- local({
x <- rnorm(50, mean=1.8)
y <- rnorm(50, mean=1.5)
environment()
# ...
})
attach(setup1) and run/source your analysis code
plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...
When finished, detach(setup1) and attach the second one.
Now, at least you can easily switch between setups. Helped me a few times.
I tend to push such results into a global list.
I use Common Lisp but then R isn't so different.
Too late for you here, but I use Sweave a lot, and most probably I'd have used a Sweave file from the beginning (e.g. if I know that the final product needs to be some kind of report).
For repeating parts of the analysis a second and third time, there are then two options:
if the results are rather "independent" (i.e. should produce 3 reports, comparison means the reports are inspected side by side), and the changed input comes in the form of new data files, that goes into its own directory together with a copy of the Sweave file, and I create separate reports (similar to source, but feels more natural for Sweave than for plain source).
if I rather need to do the exactly same thing once or twice again inside one Sweave file I'd consider reusing code chunks. This is similar to the ugly for-loop.
The reason is that then of course the results are together for the comparison, which would then be the last part of the report.
If it is clear from the beginning that there will be some parameter sets and a comparison, I write the code in a way that as soon as I'm fine with each part of the analysis it is wrapped into a function (i.e. I'm acutally writing the function in the editor window, but evaluate the lines directly in the workspace while writing the function).
Given that you are in the described situation, I agree with Nick - nothing wrong with source and everything else means much more effort now that you have it already as script.
I can't make a comment on Iterator's answer so I have to post it here. I really like his answer so I made a short script for creating the parameters and exporting them to external JSON files. And I hope someone finds this useful: https://github.com/kiribatu/Kiribatu-R-Toolkit/blob/master/docs/parameter_configuration.md

Undo command in R

I can't find something to the effect of an undo command in R (neither on An Introduction to R nor in R in a Nutshell). I am particularly interested in undoing/deleting when dealing with interactive graphs.
What approaches do you suggest?
You should consider a different approach which leads to reproducible work:
Pick an editor you like and which has R support
Write your code in 'snippets', ie short files for functions, and then use the facilities of the editor / R integration to send the code to the R interpreter
If you make a mistake, re-edit your snippet and run it again
You will always have a log of what you did
All this works tremendously well in ESS which is why many experienced R users like this environment. But editors are a subjective and personal choice; other people like Eclipse with StatET better. There are other solutions for Mac OS X and Windows too, and all this has been discussed countless times before here on SO and on other places like the R lists.
In general I do adopt Dirk's strategy. You should aim for your code to be a completely reproducible record of how you have transformed your raw data into output.
However, if you have complex code it can take a long time to re-run it all. I've had code that takes over 30 minutes to process the data (i.e., import, transform, merge, etc.).
In these cases, a single data-destroying line of code would require me to wait 30 minutes to restore my workspace.
By data destroying code I mean things like:
x <- merge(x, y)
df$x <- df$x^2
e.g., merges, replacing an existing variable with a transformation, removing rows or columns, and so on. In these cases, it's easy, especially when first learning R to make a mistake.
To avoid having to wait this 30 minutes, I adopt several strategies:
If I'm about to do something where there's a risk of destroying my active objects, I'll first copy the result into a temporary object. I'll then check that it worked with the temporary object and then rerun replacing it with the proper object.
E.g., first run temp <- merge(x, y); check that it worked str(temp); head(temp); tail(temp) and if everything looks good x <- merge(x, y)
As is common in psychological research, I often have large data frames with hundreds of variables and different subsets of cases. For a given analysis (e.g., a table, a figure, some results text), I'll often extract just the subset of cases and variables that I need into a separate object for the analysis and work with that object when preparing and finalising my analysis code. That way, I'm less likely to accidentally damage my main data frame. This assumes that the results of the analysis does not need to be fed back into the main data frame.
If I have finished performing a large number of complex data transformations, I may save a copy of the core workspace objects. E.g., save(x, y, z , file = 'backup.Rdata') That way, If I make a mistake, I only have to reload these objects.
df$x <- NULL is a handy way of removing a variable in a data frame that you did not want to create
However, in the end I still run all the code from scratch to check that the result is reproducible.

Efficiency of operations on R data structures

I'm wondering if there's any documentation about the efficiency of operations in R, specifically those related to data manipulation.
For example:
I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list.
I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over.
The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on.
Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops.
what about the various flavors of apply?
are those just hidden loops?
what about matrices vs. data frames?
Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:
1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.
2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.
Here are the snippets i use for those parameters:
To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):
as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))
To get the classes for each column:
function(fname){sapply(read.table(fname, header=T, nrows=5), class)}
Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:
3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).
4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)
system.time(read.table("tdata01.txt.gz", sep=","))
=> user system elapsed
6.173 0.245 **6.450**
system.time(load("tdata01.RData"))
=> user system elapsed
0.912 0.006 **0.912**
5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.
Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.
These things do pop up on the lists, in particular on r-devel. One fairly well-established nugget is that e.g. matrix operations tend to be faster than data.frame operations. Then there are add-on packages that do well -- Matt's data.table package is pretty fast, and Jeff has gotten xts indexing to be quick.
But it "all depends" -- so you are usually best adviced to profile on your particular code. R has plenty of profiling support, so you should use it. My Intro to HPC with R tutorials have a number of profiling examples.
I will try to come back and provide more detail. If you have any question about the efficiency of one operation over another, you would do best to profile your own code (as Dirk suggests). The system.time() function is the easiest way to do this although there are many more advanced utilities (e.g. Rprof, as documented here).
A quick response for the second part of your question:
What about the various flavors of apply? Are those just hidden loops?
For the most part yes, the apply functions are just loops and can be slower than for statements. Their chief benefit is clearer code. The main exception that I have found is lapply which can be faster because it is coded in C directly.
And what about matrices vs. data frames?
Matrices are more efficient than data frames because they require less memory for storage. This is because data frames require additional attribute data. From R Introduction:
A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes

Resources