Different types of tasks in a single R/exams exercise - r-exams

In some questions, I need to include the different types of tasks or sub-items, e.g., multiple-choice and numeric, in a single exercise. Is this possible in R/exams?
In the example below, assume that Part A and Part B are the tasks and must be within the same exercise. It is clear that they could easily be written in two separate exercises but could they also be combined in a single exercise?
Question 01
Part A
What is the capital of Germany?
a. Bonn
b. Berlin
c. Munich
d. Hamburg
Part B
What is the population of Germany's capital?
~##ANSWER1##~

Cloze exercises
Such questions are supported in R/exams and are called cloze exercises. For worked examples see
boxhist,
boxhist2,
fourfold,
fourfold2,
lm, among others.
The caveat is that such exercises are not supported by all exams2xyz interfaces. Most importantly, they are supported by exams2moodle and exams2openolat, though.
Illustrative example: German capital (Rmd version)
To turn your illustrative example into an Rmd exercise you need:
Question
========
What is the capital of Germany? \#\#ANSWER1\#\#
What is the population of Germany's capital (in millions)? \#\#ANSWER2\#\#
Answerlist
----------
* Bonn
* Berlin
* Munich
* Hamburg
*
Meta-information
================
exname: German capital
extype: cloze
exclozetype: schoice|num
exsolution: 0100|3.669495
extol: 0.1
exshuffle: TRUE
Rendered in Moodle the exercise looks like this:
Details and variations
The \#\#ANSWERi\#\# placeholders are replaced by the corresponding interaction elements in the final exercise. The answer list in combination with the exclozetype then provide the information that is necessary to build the interaction elements. Note that the fifth answer list element, corresponding to the population size, is empty.
It is also possible to write cloze exercises without \#\#ANSWERi\#\# placeholders but these are then somewhat more limited in how the interaction elements can be controlled. See boxhist vs. boxhist2 and fourfold vs. fourfold2.
Rnw version
The Rnw version of the same exercise is as follows:
\begin{question}
What is the capital of Germany? ##ANSWER1##
What is the population of Germany's capital (in millions)? ##ANSWER2##
\begin{answerlist}
\item Bonn
\item Berlin
\item Munich
\item Hamburg
\item
\end{answerlist}
\end{question}
\exname{German capital}
\extype{cloze}
\exclozetype{schoice|num}
\exsolution{0100|3.669495}
\extol{0.1}
\exshuffle{TRUE}

Related

Matching statutory provisions of two in R

In advance: Sorry for all the Norwegian references, but I hope I've explained my problem good enough for them to still make sense...
So, in 2005 Norway got a new criminal law. The old one was somewhat unstructured (only three chapters), while the statutory provisions in the 2005 version have been structured into 31 chapters, depending on the area of the offense (can be seen here: https://lovdata.no/dokument/NL/lov/2005-05-20-28). I call these "areas of law". For example, in the 2005 version laws regarding sexual offenses are in chapter 26. Logically, then the statutory provisions that belong to this chapter are categorized as belonging to the area of law called "s
Some of the old laws have been structured into the new chapters, some new have been added, and some have been repealed. I have what is called a "law mirror" – a list where you can find where the old provision are in the new law, if it hasn't been repealed. The new law came into force for offenses committed from the 1st of October in 2015.
An example of a law mirror: https://no.wikipedia.org/wiki/Straffeloven_(lovspeil). I've pivoted the list longer, such that it looks like this:
Law Mirror: "Seksuallovbrud" means sexual offense, "kap_2005" says which chapter in the 2005 law that the statutory provision (norwegian: "paragraf") falls under, and "straffelov" specifies whether the provison comes from the 2005 or 1902 version of the law.
The data I have consist of two separate data frames. Df1 is the law mirror. Df2 consists of cases in the Norwegian court of appeals from between 1993 and 2019, where the criminal law was the basis of the verdict. I've made a dummy (strl1902) in Df2 for whether the verdict in the case came before or after the new law came into force. Equal to 1 if it's the old one. I've also extracted the number of the statutory provision.
On the basis of this I want to categorize the cases using statutory provisions from the old criminal law into the areas of law from the new law.
This is where I need help:
Do any of you have any idea of how I can distinguish between the provisions from the old and the new law, such that I also can make dummies for the provisions from the 1902 law, such that they are separated into the areas of the law of the 2005 law?
Hope this makes sense.

How could you remove the similar portion from two large strings?

I am working on classification of some documents and a number of the documents have large sections of similar (and usually irrelevant) text. I would like to identify and remove those similar sections, as I believe I may be able to make a better model.
An example would be proposals by an organization, each of which contains the same paragraph regarding the organization's mission statement and purpose.
A couple points which make it difficult:
similar sections are not known ahead of time, making a fixed pattern inappropriate
could be located anywhere in the documents, documents do not have consistent structure
the pattern could be many characters long, e.g. 3000+ characters
I don't want to remove every similar word, just large sections
I don't want to identify which strings are similar, rather I want to remove the similar sections.
I've considered regex and looked through some packages like stringr, strdist, and the base functions, but these utilities seem useful if you already know the pattern and the pattern is much shorter, or if the documents have a similar structure. In my case the text could be structured differently and the pattern is not predefined, but rather whatever is similar between the documents.
I considered making and comparing lists of 3000-grams for each document but this didn't seem feasible or easy to implement.
Below is an example of a complete solution, but really I am not even sure how to approach this problem, so information in that direction would be useful as well.
Example code
doc_a <- "this document discusses african hares in the northern sahara. african hares
are the most common land dwelling mammal in the northern sahara. crocodiles eat
african hares. this text is from a book written for the foundation for education
in northern africa."
doc_b <- "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and
alligators are different. crocodiles eat african hares. crocodiles are the most common
land dwelling reptile in egypt. this text is from a book written for the foundation
for education in northern africa."
# this function would trim similar sections of 6 or more words in length
# (length in characters is also acceptable)
trim_similar(doc_a, doc_b, 6)
Output
[1] "this document discusses african hares in the northern sahara. african hares
mammal in the northern sahara. crocodiles eat african hares."
[2] "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and alligators
are different. crocodiles eat african hares. crocodiles reptile in egypt."

semantic matching strings - using word2vec or s-match?

I have this problem of matching two strings for 'more general', 'less general', 'same meaning', 'opposite meaning' etc.
The strings can be from any domain. Assume that the strings can be from people's emails.
To give an example,
String 1 = "movies"
String 2 = "Inception"
Here I should know that Inception is less general than movies (sort of is-a relationship)
String 1 = "Inception"
String 2 = "Christopher Nolan"
Here I should know that Inception is less general than Christopher Nolan
String 1 = "service tax"
String 2 = "service tax 2015"
At a glance it appears to me that S-match will do the job. But I am not sure if S-match can be made to work on knowledge bases other than WordNet or GeoWordNet (as mentioned in their page).
If I use word2vec or dl4j, I guess it can give me the similarity scores. But does it also support telling a string is more general or less general than the other?
But I do see word2vec can be based on a training set or large corpus like wikipedia etc.
Can some one throw light on the way to go forward?
The current usage of machine learning methods such as word2vec and dl4j for modelling words are based on distributional hypothesis. They train models of words and phrases based on their context. There is no ontological aspects in these word models. At its best trained case a model based on these tools can say if two words can appear in similar contexts. That is how their similarity measure works.
The Mikolov papers (a, b and c) which suggests that these models can learn "Linguistic Regularity" doesn't have any ontological test analysis, it only suggests that these models are capable of predicting "similarity between members of the word pairs". This kind of prediction doesn't help your task. These models are even incapable of recognising similarity in contrast with relatedness (e.g. read this page SimLex test set).
I would say that you need an ontological database to solve your problem. More specifically about your examples, it seems for String 1 and String 2 in your examples:
String 1 = "a"
String 2 = "b"
You are trying to check entailment relations in sentences:
(1) "c is b"
(2) "c is a"
(3) "c is related to a".
Where:
(1) entails (2)
or
(1) entails (3)
In your two first examples, you can probably use semantic knowledge bases to solve the problem. But your third example will probably need a syntactical parsing before understanding the difference between two phrases. For example, these phrases:
"men"
"all men"
"tall men"
"men in black"
"men in general"
It needs a logical understanding to solve your problem. However, you can analyse that based on economy of language, adding more words to a phrase usually makes it less general. Longer phrases are less general comparing to shorter phrases. It doesn't give you a precise tool to solve the problem, but it can help to analyse some phrases without special words such as all, general or every.

Replicate Postgres pg_trgm text similarity scores in R?

Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000

R documentation formatting for references

In documenting a reference in a manual what are the necessary escaping and formatting needed to display the reference correctly when the package is compiled? For instance the following is an attempt to format an APA6 style reference for R documentation.
\references{
Heylighen, F., \\& Dewaele, J.M. (2002). Variation in the contextuality of language: An
empirical measure. Context in Context, Special issue of Foundations of Science, 7 (3),
293-340.
}
What would I need to do to this so that it comes out looking about like this:
Heylighen, F., & Dewaele, J.M. (2002). Variation in the contextuality of language:
An empirical measure. Context in Context, Special issue of Foundations of Science,
7 (3), 293-340.
Maybe this information is located in some R documentation somewhere. If this is the case please kindly direct me to rad this passage.
Based on an .Rd file that has a references section, I think you can simply not escape the ampersand.

Resources