Bilingual (English and Portuguese) documentation in an R package - r

I am writing a package to facilitate importing Brazilian socio-economic microdata sets (Census, PNAD, etc).
I foresee two distinct groups of users of the package:
Users in Brazil, who may feel more at ease with the documentation in
Portuguese. The probably can understand English to some extent, but a
foreign language would probably make the package feel less
"ergonomic".
The broader international users community, from whom English
documentation may be a necessary condition.
Is it possible to write a package in a way that the documentation is "bilingual" (English and Portuguese), and that the language shown to the user will depend on their country/language settings?
Also,
Is that doable within the roxygen2 documentation framework?
I realise there is a tradeoff of making the package more user-friendly by making it bilingual vs. the increased complexity and difficulty to maintain. General comments on this tradeoff from previous expirience are also welcome.
EDIT: following the comment's suggestion I cross-posted r-package-devel mailling list. HERE, then follow the answers at the bottom. Duncan Murdoch posted an interesting answer covering some of what #Brandons answer (bellow) covers, but also including two additional suggestions that I think are useful:
have the package in one language, but the vignettes for different
languages. I will follow this advice.
have to versions of the package , let's say 1.1 and 1.2, one on each
language

According to Ropensci, there is no standard mechanism for translating package documentation into non-English languages. They describe the typical process of internationalization/localization as follows:
To create non-English documentation requires manual creation of
supplemental .Rd files or package vignettes.
Packages supplying
non-English documentation should include a Language field in the
DESCRIPTION file.
And some more info on the Language field:
A ‘Language’ field can be used to indicate if the package
documentation is not in English: this should be a comma-separated list
of standard (not private use or grandfathered) IETF language tags as
currently defined by RFC 5646 (https://www.rfc-editor.org/rfc/rfc5646,
see also https://en.wikipedia.org/wiki/IETF_language_tag), i.e., use
language subtags which in essence are 2-letter ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) or 3-letter ISO 639-3
(https://en.wikipedia.org/wiki/ISO_639-3) language codes.
Care is needed if your package contains non-ASCII text, and in particular if it is intended to be used in more than one locale. It is possible to mark the encoding used in the DESCRIPTION file and in .Rd files.
Regarding encoding...
First, consider carefully if you really need non-ASCII text. Many
users of R will only be able to view correctly text in their native
language group (e.g. Western European, Eastern European, Simplified
Chinese) and ASCII.72. Other characters may not be rendered at all,
rendered incorrectly, or cause your R code to give an error. For .Rd
documentation, marking the encoding and including ASCII
transliterations is likely to do a reasonable job. The set of
characters which is commonly supported is wider than it used to be
around 2000, but non-Latin alphabets (Greek, Russian, Georgian, …) are
still often problematic and those with double-width characters
(Chinese, Japanese, Korean) often need specialist fonts to render
correctly.
On a related note, R does, however, provide support for "errors and warnings" in different languages - "There are mechanisms to translate the R- and C-level error and warning messages. There are only available if R is compiled with NLS support (which is requested by configure option --enable-nls, the default)."

Besides bilingual documentation, please allow me the following comment: Given your two "target" groups, it may be assumed that some of your users will be running non-English OS (typically, Windows in Portuguese). When importing time series data (or any date entries as a matter of fact), due to different "date" formatting (English vs. non-English), you may get different "results" (i.e. misinterpeted date entries) when importing to English/non-English machines. I have some experience with those issues (I often work with Czech-language-based OSs) and -other than ad-hoc coding- I don't find a simple solution.
(If you find this off-topic, please feel free to delete)

Related

Autograding of NOPS exams: Implementation & extension to string questions?

We are using R/exams to create tests in Canvas and TestVision.
We have other forms and other software to perform written exams.
I know R/exams has a great NOPS feature and was wondering:
What software is used to autograde the NOPS forms?
Can that software also evaluate string questions?
Now it looks that the NOPS form doesn't make it easy for software to read parts. Ideally the software would be adapted so adapted NOPS forms (changes in blue) could read more easily Student Name, and string questions:
NOPS format
The NOPS forms have not been designed by us but they follow the format that our university has been using. We simply mimicked their format because we initially just generated the PDF files ourselves but used the commercial scanning software of our university.
Scanning
However, over the years we have written our own scanner implementation in R in exams::nops_scan(). The basic approach is to convert PDF pages to PNG images, read these into R, convert them to black and white pixel matrices, find the scanner markings in the corners, and then extract just the boxes relative to these markings. The boxes either contain printed digits in a fixed font for which a simple decision tree yields a reliable classification - or the boxes are empty/filled vs. checked which can also be classified reasonably reliably. The result is stored in a simple text format that was again not developed by us but to be fully compatible with the commercial system that our university used.
Grading
Based on the scan results the function exams::nops_eval() computes points and grades. Various evaluation strategies can be plugged in and starting from version 2.4-0 the reports generated by the function can be customized.
Extension to OCR
At the moment no OCR (optical character recognition) is used, except for the simple task of recognizing printed numbers in a fixed font. But no hand-written characters or digits are ever evaluated automatically. I had played around with this a little bit using tesseract but the results were not reliable enough for our purposes.
The string questions that are currently supported are intended for open-ended questions. Hence students get a reasonable amount of space to write something down. The teacher can then grade the answer sheet manually, again by ticking boxes only, which can be read rather reliably. The scanned images of the full sheet are included in the report for the students so that they can also see any hand-written feedback/corrections included in the answer form.
Tutorial
A hands-on guide to using the NOPS approach is available at: http://www.R-exams.org/tutorials/exams2nops/
Misc
Unfortunately, the system is not implemented in a very modular fashion. The reasons for this were two-fold: (1) We followed very closely the given format our university had been using. (2) The bulk of the implementation was written under a lot of time pressure (see the anecdote below). So while the features you propose would be nice to have, they are unlikely to fit well into the current setup. If you would want to have a stab at this, I would recommend to write a modular new implementation, just using the bits and pieces from the existing code that are useful enough.
Anecdote: Scanning of about 400-500 exam sheets had failed on the university system due to a mistake of the copy shop that had printed the sheets. It was mid-July, everybody was on vacation already including myself. So I sat on my parents porch for two days to write the scanner tool and evaluate the exams that the students were waiting for.

Grakn: how can I construct a knowledge graph from a collection of texts?

I have several documents (pdf and txt) in my notebook and I want to construct a knowledge graph using Grakn.
Through Google I found the blog but there is no documentation or readme teaching how to do that.
Also is written in the blog "The script to mine text can be found on our GitHub repo here" but I am failing in understanding what I have to do.
Can someone here advise me how to construct a knowledge graph from text using Grakn?
Grakn is a knowledge engine/network, which understands knowledge by well defined entities and relations (ontologies), so you need to use NLP (Natural Language processing) to make human language accessible to a graph network. also you need OCR (Optical Character Recognition) to convert some image texts to text. also you should teach the network basic ontologies to understand the texts. you are actually heading through Singularity era.
To give an example of how to go from a collection of text to a knowledge graph, let us assume that all of your text is concerned with a certain domain of knowledge - in the example of the blog post you mention, we are dealing with biomedical research publications.
A first step could be to find entities, or defined "things", in the text. To stick with the biomedical example, we could look for drugs and genes mentioned in the publications. This is called named-entity-recognition (NER), a technique applied in text-mining.
If a certain drug is often mentioned in the same publication as a particular gene, they "co-occur" and are likely related in some way. This would be an example of a relationship. The automated extraction of exactly how they are related is a difficult problem and is called relationship-extraction (RE).
Solutions for both NER and RE are usually domain-specific (ranging from simple matching of dictionary terms to AI models).
If you are interested in text-mining, a good place to start in python is NLTK.
The idea of a knowledge graph is to put defined things, called entities, in defined relationships to one another to create context. After you have a list of entities that you have found in all your documents, as well as their relationships (as in the example above, co-occurrance in a document or even a single sentence), you can define a schema and upload the entities and relationships into grakn and use all of its functionality to analyze your data.
For a tutorial on how to use grakn with already extracted data, see here

Documentation tool for Ada software

Brief: I'm looking for some kind of tool to produce a software description from the comments in existing software source code.
In more detail: I've got existing source code written in Ada. Changes need to be made to this source code and I also need to generate a document containing a description of the software as a whole and all of its packages, routines etc. (if possible as PDF). For the existing routines these source code comments already exist and contain sufficient detail for my needs.
The description shall include at least
overall software design
textual description of packages, routines, variables, constants etc.
call and caller graphs
For projects based on C I'd do this using Doxygen. Doxygen itself, however, does not cope with sotware written in Ada. My thought was to (automatically) convert existing comments in the source code so that Doxygen can read these. The conversion itself was no problem (using Doxygen's filter mechanism), but as keywords and syntax between C and Ada differ a lot, this did not produce any useable output.
I then had a look at Understand from SciTools. While this analyses the software to a good detail and generates nice metrices, I was not able to get anything out of it, that resembles a document with what I need.
I want to avoid (manually) writing a separate document, but instead would like to generate this from the code base. I will have to put all the necessary information (perhaps with the the exception of a general overview) there anyhow, so why not use it for documentation purpose as well.
Is there any tool that is able to do what I need?
There's a tool called "AdaDoc", which seems to do a part of what you're asking for. You can of course use "a2ps" for the textual part of your needs (I like that better than what AdaDoc generates).
There are several UML tools ("Umbrello" is one name I remember), which offer to create graphs of inter-package relations, but for a seriously sized project, the best option is to use the original design documents, and simply verify that the source text actually matches that design.
For languages not supported by Doxygen, I've written my own "general purpose" filter.
It's very basic, but useful for me.
https://github.com/malkev/doxphp

What's the definition of Package in Julia-Lang?

I have frequently used Packages in Julia-Lang, there are many articles that describes how to work with them, but I don't know what is the exact definition of that.
EDIT
Following is a general definition from wiki:
Package (package management system), in which individual files or
resources are packed together as a software collection that provides
certain functionality as part of a larger system
I would like to know the special points of view toward Package that Julia-lang has. e.g. look at this definition from wiki about Java Package
I would say that a Julia package is a module (similar to a namespace in other languages) containing a collection of related functions that provide new functionality for Julia, and that will be useful for other people.
This definition is not unambiguous though. For example, I suggested recently that several image format packages could belong inside a single ImageFormats package, but the replies were that there was a good reason (code size and binary dependencies) for certain kinds of formats to be in separate packages.
If you follow the discussion of the pull requests for new packages on METADATA.jl, you will have a good idea about the community's feeling about what packages should be for / look like. My takeaway from following those discussions is that a more-or-less unified vision is starting to emerge.

What is a good localization strategy for knitr reports?

I would like to create aggregation reports based on ggplot2 and knitr. Unfortunately I want to do it in four languages, namely English, German, French, Italian. So far labels for plots and figures are basically coming from data itself, i.e. they are generated from data.frame headers or factor levels.
Given that I have more than 100 categorical variables with different levels I wonder what an efficient translation strategy might be. There's .po files and Portable Object editors for other languages and even for R and its messages itself. Given that the number of languages might be increased it becomes more likely that other persons need to be involved for translation. Obviously these persons are typically no R users and might not even like text editors.
Has anybody faced the same problem and has developed some good strategy or experience to share? Could you imagine something xliff like?
EDIT: I have seen this thread in the meantime but I believe gettext does only work for packages. I wonder if the domain in this post is really valid.
No direct experience myself, but wondering if a google spreadsheet might be provide a good workflow for collaborating with translators e.g. then get into R using RCurl solutions such as answer here:
read.csv fails to read a CSV file from google docs
I used this technique to analyse survey data.

Resources