Can a single R/Markdown file contain all exercises for R/exams? - r-exams

Is it possible to have all exercises of an exam in the same R/Markdown file for R/exams? From what I have read, each question has to be in its own file. Sure, I can split, by using R, a R/Markdown file with several questions into files containing a single question, but wouldn't it be easier to have exam2xyz doing that?

It is a fundamental design principle in R/exams that 1 exercise = 1 file (either .Rmd or .Rnw).
At first this might seem a bit overcomplicated, especially if you are used to putting together written exams with static exercises in a particular order. Users who do so often apply some manual randomization as they include small variations when they re-use exercises from old exams.
In contrast, R/exams is designed for a different workflow. The goal is to write exercises that already include some level of randomization so that they can be re-used without modification in different contexts, e.g., in different combinations of exercises in written exams or as one of many similar exercises in an item pool in a learning management system etc.
Therefore requiring that every exercise is a different file facilitates:
Building a large pool of independent exercises.
Working on the exercise pool collaboratively, e.g., in a version control system.
Thorough stress testing of dynamic exercises to make sure that they work in many random variations.
Re-using the same exercise file in different contexts.
If you feel that this does not suit your workflow well, that's fair enough. In that case I would choose some separator line between different exercises in the same file. Then you can write a small function creates files in a temporary directory with the split up files. Feel free to ask for feedback on this if you're struggling with that. But before you do I would encourage you to try to go along with the separate files.

Related

Autograding of NOPS exams: Implementation & extension to string questions?

We are using R/exams to create tests in Canvas and TestVision.
We have other forms and other software to perform written exams.
I know R/exams has a great NOPS feature and was wondering:
What software is used to autograde the NOPS forms?
Can that software also evaluate string questions?
Now it looks that the NOPS form doesn't make it easy for software to read parts. Ideally the software would be adapted so adapted NOPS forms (changes in blue) could read more easily Student Name, and string questions:
NOPS format
The NOPS forms have not been designed by us but they follow the format that our university has been using. We simply mimicked their format because we initially just generated the PDF files ourselves but used the commercial scanning software of our university.
Scanning
However, over the years we have written our own scanner implementation in R in exams::nops_scan(). The basic approach is to convert PDF pages to PNG images, read these into R, convert them to black and white pixel matrices, find the scanner markings in the corners, and then extract just the boxes relative to these markings. The boxes either contain printed digits in a fixed font for which a simple decision tree yields a reliable classification - or the boxes are empty/filled vs. checked which can also be classified reasonably reliably. The result is stored in a simple text format that was again not developed by us but to be fully compatible with the commercial system that our university used.
Grading
Based on the scan results the function exams::nops_eval() computes points and grades. Various evaluation strategies can be plugged in and starting from version 2.4-0 the reports generated by the function can be customized.
Extension to OCR
At the moment no OCR (optical character recognition) is used, except for the simple task of recognizing printed numbers in a fixed font. But no hand-written characters or digits are ever evaluated automatically. I had played around with this a little bit using tesseract but the results were not reliable enough for our purposes.
The string questions that are currently supported are intended for open-ended questions. Hence students get a reasonable amount of space to write something down. The teacher can then grade the answer sheet manually, again by ticking boxes only, which can be read rather reliably. The scanned images of the full sheet are included in the report for the students so that they can also see any hand-written feedback/corrections included in the answer form.
Tutorial
A hands-on guide to using the NOPS approach is available at: http://www.R-exams.org/tutorials/exams2nops/
Misc
Unfortunately, the system is not implemented in a very modular fashion. The reasons for this were two-fold: (1) We followed very closely the given format our university had been using. (2) The bulk of the implementation was written under a lot of time pressure (see the anecdote below). So while the features you propose would be nice to have, they are unlikely to fit well into the current setup. If you would want to have a stab at this, I would recommend to write a modular new implementation, just using the bits and pieces from the existing code that are useful enough.
Anecdote: Scanning of about 400-500 exam sheets had failed on the university system due to a mistake of the copy shop that had printed the sheets. It was mid-July, everybody was on vacation already including myself. So I sat on my parents porch for two days to write the scanner tool and evaluate the exams that the students were waiting for.

Documentation tool for Ada software

Brief: I'm looking for some kind of tool to produce a software description from the comments in existing software source code.
In more detail: I've got existing source code written in Ada. Changes need to be made to this source code and I also need to generate a document containing a description of the software as a whole and all of its packages, routines etc. (if possible as PDF). For the existing routines these source code comments already exist and contain sufficient detail for my needs.
The description shall include at least
overall software design
textual description of packages, routines, variables, constants etc.
call and caller graphs
For projects based on C I'd do this using Doxygen. Doxygen itself, however, does not cope with sotware written in Ada. My thought was to (automatically) convert existing comments in the source code so that Doxygen can read these. The conversion itself was no problem (using Doxygen's filter mechanism), but as keywords and syntax between C and Ada differ a lot, this did not produce any useable output.
I then had a look at Understand from SciTools. While this analyses the software to a good detail and generates nice metrices, I was not able to get anything out of it, that resembles a document with what I need.
I want to avoid (manually) writing a separate document, but instead would like to generate this from the code base. I will have to put all the necessary information (perhaps with the the exception of a general overview) there anyhow, so why not use it for documentation purpose as well.
Is there any tool that is able to do what I need?
There's a tool called "AdaDoc", which seems to do a part of what you're asking for. You can of course use "a2ps" for the textual part of your needs (I like that better than what AdaDoc generates).
There are several UML tools ("Umbrello" is one name I remember), which offer to create graphs of inter-package relations, but for a seriously sized project, the best option is to use the original design documents, and simply verify that the source text actually matches that design.
For languages not supported by Doxygen, I've written my own "general purpose" filter.
It's very basic, but useful for me.
https://github.com/malkev/doxphp

'make'-like dependency-tracking library?

There are many nice things to like about Makefiles, and many pains in the butt.
In the course of doing various project (I'm a research scientist, "data scientist", or whatever) I often find myself starting out with a few data objects on disk, generating various artifacts from those, generating artifacts from those artifacts, and so on.
It would be nice if I could just say "this object depends on these other objects", and "this object is created in the following manner from these objects", and then ask a Make-like framework to handle the details of actually building them, figuring out which objects need to be updated, farming out work to multiple processors (like Make's -j option), and so on. Makefiles can do all this - but the huge problem is that all the actions have to be written as shell commands. This is not convenient if I'm working in R or Perl or another similar environment. Furthermore, a strong assumption in Make is that all targets are files - there are some exceptions and workarounds, but if my targets are e.g. rows in a database, that would be pretty painful.
To be clear, I'm not after a software-build system. I'm interested in something that (more generally?) deals with dependency webs of artifacts.
Anyone know of a framework for these kinds of dependency webs? Seems like it could be a nice tool for doing data science, & visually showing how results were generated, etc.
One extremely interesting example I saw recently was IncPy, but it looks like it hasn't been touched in quite a while, and it's very closely coupled with Python. It's probably also much more ambitious than I'm hoping for, which is why it has to be so closely coupled with Python.
Sorry for the vague question, let me know if some clarification would be helpful.
A new system called "Drake" was announced today that targets this exact situation: http://blog.factual.com/introducing-drake-a-kind-of-make-for-data . Looks very promising, though I haven't actually tried it yet.
This question is several years old, but I thought adding a link to remake here would be relevant.
From the GitHub repository:
The idea here is to re-imagine a set of ideas from make but built for R. Rather than having a series of calls to different instances of R (as happens if you run make on R scripts), the idea is to define pieces of a pipeline within an R session. Rather than being language agnostic (like make must be), remake is unapologetically R focussed.
It is not on CRAN yet, and I haven't tried it, but it looks very interesting.
I would give Bazel a try for this. It is primarily a software build system, but with its genrule type of artifacts it can perform pretty arbitrary file generation, too.
Bazel is very extendable, using its Python-like Starlark language which should be far easier to use for complicated tasks than make. You can start by writing simple genrule steps by hand, then refactor common patterns into macros, and if things become more complicated even write your own rules. So you should be able to express your individual transformations at a high level that models how you think about them, then turn that representation into lower level constructs using something that feels like a proper programming language.
Where make depends on timestamps, Bazel checks fingerprints. So if at any one step produces the same output even though one of its inputs changed, then subsequent steps won't need to get re-computed again. If some of your data processing steps project or filter data, there might be a high probability of this kind of thing happening.
I see your question is tagged for R, even though it doesn't mention it much. Under the hood, R computations would in Bazel still boil down to R CMD invocations on the shell. But you could have complicated muliti-line commands assembled in complicated ways, to read your inputs, process them and store the outputs. If the cost of initialization of the R binary is a concern, Rserve might help although using it would make the setup depend on a locally accessible Rserve instance I believe. Even with that I see nothing that would avoid the cost of storing the data to file, and loading it back from file. If you want something that avoids that cost by keeping things in memory between steps, then you'd be looking into a very R-specific tool, not a generic tool like you requested.
In terms of “visually showing how results were generated”, bazel query --output graph can be used to generate a graphviz dot file of the dependency graph.
Disclaimer: I'm currently working at Google, which internally uses a variant of Bazel called Blaze. Actually Bazel is the open-source released version of Blaze. I'm very familiar with using Blaze, but not with setting up Bazel from scratch.
Red-R has a concept of data flow programming. I have not tried it yet.

What is a good localization strategy for knitr reports?

I would like to create aggregation reports based on ggplot2 and knitr. Unfortunately I want to do it in four languages, namely English, German, French, Italian. So far labels for plots and figures are basically coming from data itself, i.e. they are generated from data.frame headers or factor levels.
Given that I have more than 100 categorical variables with different levels I wonder what an efficient translation strategy might be. There's .po files and Portable Object editors for other languages and even for R and its messages itself. Given that the number of languages might be increased it becomes more likely that other persons need to be involved for translation. Obviously these persons are typically no R users and might not even like text editors.
Has anybody faced the same problem and has developed some good strategy or experience to share? Could you imagine something xliff like?
EDIT: I have seen this thread in the meantime but I believe gettext does only work for packages. I wonder if the domain in this post is really valid.
No direct experience myself, but wondering if a google spreadsheet might be provide a good workflow for collaborating with translators e.g. then get into R using RCurl solutions such as answer here:
read.csv fails to read a CSV file from google docs
I used this technique to analyse survey data.

Automatic documentation of datasets

I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
This seems like something that ProjectTemplate would have already, but I can't seem to find that functionality there.
Does such a tool exist?
If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:
A markup language should be used (YAML?)
All sub-directories should be scanned
To facilitate (2), a standard extension for a dataset descriptor should be used
Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.
Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.
That's a conceptual start, now to address practical issues:
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Welcome to everyone's nightmare. :)
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.
It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)
Does such a tool exist?
Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.
For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.
For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...
1. A markup language should be used (YAML?)
Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.
2. All sub-directories should be scanned
Done: listDirectory()
3. To facilitate (2), a standard extension for a dataset descriptor should be used
Trivial: ".json". ;-) Or ".SecretSauce" works, too.
4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.
5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
That's where others' suggestions of roxygen and roxygen2 can apply.
6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
Hmm, I'm stumped here. :)
(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.
Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.

Resources