Function to check (English language) grammar in R - r

Are there any R functions that receive a sentence/paragraph and return a boolean indicating whether that document is grammatically correct?
i.e.
grammar_check("I've had a good day!")
# TRUE
grammar_check("ive have a good day")
# FALSE
I have found some very advanced text capabilities in R (mostly used in machine learning / bag of words etc), for example those in the caret and tm packages, but I have not been able to find any function that can check grammar
Note that this answer points to some python packages and web services, but are for extracting and stemming words, and are not in R
Also note: An informative comment on this answer points out that to write a rules-based function would take a whole team of people a very long time (since English is so idiosyncratic) - so this is not something that can be hacked together by an individual in a short amount of time

Related

How to grade exams/questions manually?

What I'd like to do:
I would like to use r-exams in the following procedure:
Providing electronic exams in pdf format to students (using exams2pdf(..)
Let the students upload excel file with their answers
Grade the answers using (using eval_nops(...))
My Question:
Is calling the function eval_nops() the preferred way to manually grad questions in r-exams?
If not, which way is to be prefered?
What I have tried:
I'm aware of the exam2nops() function, and I know that it gives back an .RDS file where the correct answers are stored. Hence, I basically have what I need. However, I found that procedure to be not very straightforward, as the correct answers are buried rather deeply inside the RDS file.
Overview
You are right that there is no readily available system for administering/grading exams outside of a standard learning management system (LMS) like Moodle or Canvas, etc. R/exams does provide some building blocks for the grading, though, especially exams_eval(). This can be complemented with tools like Google forms etc. Below I start with the "hard facts" regarding exams_eval() even though this is a bit technical. But then I also provide some comments regarding such approaches.
Using exams_eval()
Let us consider a concrete example
eval <- exams_eval(partial = TRUE, negative = FALSE, rule = "false2")
indicating that you want partial credits for multiple-choice exercises but the overall points per item must not become negative. A correctly ticked box yields 1/#correct points and an incorrectly ticked box 1/#false. The only exception is where there is only one false item (which would then cancel all points) then 1/2 is used.
The resulting object eval is a list with the input parameters (partial, negative, rule) and three functions checkanswer(), pointvec(), pointsum(). Imagine that you have the correct answer pattern
cor <- "10100"
The associated points for correctly and incorrectly ticked boxed would be:
eval$pointvec(cor)
## pos neg
## 0.5000000 -0.3333333
Thus, for the following answer pattern you get:
ans <- "11100"
eval$checkanswer(cor, ans)
## [1] 1 -1 1 0 0
eval$pointsum(cor, ans)
## [1] 0.6666667
The latter would still need to be multiplied with the overall points assigned to that exercise. For numeric answers you can only get 100% or 0%:
eval$pointsum(1.23, 1.25, tolerance = 0.05)
## [1] 1
eval$pointsum(1.23, 1.25, tolerance = 0.01)
## [1] 0
Similarly, string answers are either correct or false:
eval$pointsum("foo", "foo")
## [1] 1
eval$pointsum("foo", "bar")
## [1] 0
Exercise metainformation
To obtain the relevant pieces of information for a given exercise, you can access the metainformation from the nested list that all exams2xyz() interfaces return:
x <- exams2xyz(...)
For example, you can then extract the metainfo for the i-th random replication of the j-th exercise as:
x[[i]][[j]]$metainfo
This contains the correct $solution, the $type, and also the $tolerance etc. Sure, this is somewhat long and inconvenient to type interactively but should be easy enough to cycle through programatically. This is what nops_eval() for example does base on the .rds file containing exactly the information in x.
Administering exams without a full LMS
My usual advice here is to try to leverage your university's services (if available, of course). Yes, there can be problems with the bandwidth/stability etc. but you can have all of the same if you're running your own system (been there, done that). Specifically, a discussion of Moodle vs. PDF exams mailed around is available here:
Create fillable PDF form with exams2nops
http://www.R-exams.org/general/distancelearning/#replacing-written-exams
If I were to provide my exams outside of an LMS I would use HTML, though, and not PDF. In HTML it is much easier to embed additional information (data, links, etc.) than in PDF. Also HTML can be viewed on mobile device moch more easily.
For collecting the answers, some R/exams users employ Google forms, see e.g.:
https://R-Forge.R-project.org/forum/forum.php?thread_id=34076&forum_id=4377&group_id=1337. Others have been interested in using learnr or webex for that:
http://www.R-exams.org/general/distancelearning/#going-forward.
Regarding privacy, though, I would be very surprised if any of these are better than using the university's LMS.

How do I insert text before a group of exercises in an exam?

I am very new to R and to R/exams. I've finally figured out basic things like compiling a simple exam with exams2pdf and exams2canvas, and I've figured out how to arrange exercises such that this group of X exercises gets randomized in the exam and others don't.
In my normal written exams, sometimes I have a group of exercises that require some introductory text (e.g,. a brief case study on which the next few questions are based, or a specific set of instructions for the questions that follow).
How do I create this chunk of text using R/exams and Rmd files?
I can't figure out if it's a matter of creating a particular Rmd file and then simply adding that to the list when creating the exam (like a dummy file of sorts that only shows text, but isn't numbered), or if I have to do something with the particular tex template I'm using.
There's a post on R-forge about embedding a plain LaTeX file between exercises that seems to get at what I'm asking, but I'm using Rmd files to create exercises, not Rnw files, and so, frankly, I just don't understand it.
Thank you for any help.
There are two strategies for this:
1. Separate exercise files in the same sequence
Always use the same sequence of exercises, say, ex1.Rmd, ex2.Rmd, ex3.Rmd where ex1.Rmd creates and describes the setting and ex2.Rmd and ex3.Rmd simply re-use the variables created internally by ex1.Rmd. In the exams2xyz() interface you have to assure that all exercises are processed in the same environment, e.g., the global environment:
exams2pdf(c("ex1.Rmd", "ex2.Rmd", "ex3.Rmd"), envir = .GlobalEnv)
For .Rnw exercises this is not necessary because they are always processed in the global environment anyway.
2. Cloze exercises
Instead of separate exercise files, combine all exercises in a single "cloze" exercise ex123.Rmd that combines three sub-items. For a simple exercise with two sub-items, see: http://www.R-exams.org/templates/lm/
Which strategy to use?
For exams2pdf() both strategies work and it is more a matter of taste whether one prefers all exercises together in a single file or split across separate files. However, for other exams2xyz() interfaces only one or none of these strategies work:
exams2pdf(): 1 + 2
exams2html(): 1 + 2
exams2nops(): 1
exams2moodle(): 2
exams2openolat(): 2
exams2blackboard(): -
exams2canvas(): -
Basically, strategy 1 is only guaranteed to work for interfaces that generate separate files for separate exams like exams2pdf(), exams2nops(), etc. However, for interfaces that create pools of exercises for learning management systems like exams2moodle(), exams2canvas(), etc. it typically often cannot be assured that the same random replication is drawn for all three exercises. (Thus, if there are two random replications per exercise, A and B, participants might not get A/A/A or B/B/B but A/B/A.)
Hence, if ex1/2/3 are multiple-choice exercises that you want to print and scan automatically, then you could use exams2nops() in combination with strategy 1. However, strategy 2 would not work because cloze exercises cannot be scanned automatically in exams2nops().
In contrast, if you want to use Moodle, then exams2moodle() could be combined with strategy 2. In contrast, strategy 1 would not work (see above).
As you are interested in Canvas export: In Canvas neither of the two strategies works. It doesn't support cloze exercises. And to the best of my knowledge it is not straightforward to assure that exercises are sampled "in sync".

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

FAQ markup to R data structure

I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:
the fortunes package
bibtex entries
Rd files
each with some desirable features.
In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:
FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").
Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)
Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.
QUESTION
I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.
It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,
\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test"))
Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.
Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?
To summarise
I would like to come up with:
1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items
2- a more convenient format to enter new FAQs (rather than the current texinfo format)
3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.
Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.
(This addresses point 3.)
You can convert the texinfo file to XML
wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi
makeinfo --xml R-FAQ.texi
and then read it with the XML package.
library(XML)
doc <- xmlParse("R-FAQ.xml")
r <- xpathSApply( doc, "//node", function(u) {
list(list(
title = xpathSApply(u, "nodename", xmlValue),
contents = as(u, "character")
))
} )
free(doc)
But it is much easier to convert it to text
makeinfo --plaintext R-FAQ.texi > R-FAQ.txt
and parse the result manually.
doc <- readLines("R-FAQ.txt")
# Split the document into questions
# i.e., around lines like ****** or ======.
i <- grep("[*=]{5}", doc) - 1
i <- c(1,i)
j <- rep(seq_along(i)[-length(i)], diff(i))
stopifnot(length(j) == length(doc))
faq <- split(doc, j)
# Clean the result: since the questions
# are in the subsections, we can discard the sections.
faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]
# Use the result
cat(faq[[ sample(seq_along(faq),1) ]], sep="\n")
I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.
There seem to be three assumptions here.
1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.
2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.
3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.
All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.
There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.
But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.

Efficiency of operations on R data structures

I'm wondering if there's any documentation about the efficiency of operations in R, specifically those related to data manipulation.
For example:
I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list.
I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over.
The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on.
Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops.
what about the various flavors of apply?
are those just hidden loops?
what about matrices vs. data frames?
Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:
1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.
2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.
Here are the snippets i use for those parameters:
To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):
as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))
To get the classes for each column:
function(fname){sapply(read.table(fname, header=T, nrows=5), class)}
Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:
3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).
4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)
system.time(read.table("tdata01.txt.gz", sep=","))
=> user system elapsed
6.173 0.245 **6.450**
system.time(load("tdata01.RData"))
=> user system elapsed
0.912 0.006 **0.912**
5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.
Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.
These things do pop up on the lists, in particular on r-devel. One fairly well-established nugget is that e.g. matrix operations tend to be faster than data.frame operations. Then there are add-on packages that do well -- Matt's data.table package is pretty fast, and Jeff has gotten xts indexing to be quick.
But it "all depends" -- so you are usually best adviced to profile on your particular code. R has plenty of profiling support, so you should use it. My Intro to HPC with R tutorials have a number of profiling examples.
I will try to come back and provide more detail. If you have any question about the efficiency of one operation over another, you would do best to profile your own code (as Dirk suggests). The system.time() function is the easiest way to do this although there are many more advanced utilities (e.g. Rprof, as documented here).
A quick response for the second part of your question:
What about the various flavors of apply? Are those just hidden loops?
For the most part yes, the apply functions are just loops and can be slower than for statements. Their chief benefit is clearer code. The main exception that I have found is lapply which can be faster because it is coded in C directly.
And what about matrices vs. data frames?
Matrices are more efficient than data frames because they require less memory for storage. This is because data frames require additional attribute data. From R Introduction:
A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes

Resources