How to remove questions with repeated items? - r-exams

Frequently, when generating questions based on parameters, some generated questions have to be eliminated because, within such a questions, it sometimes happens that some items are the same.
My code is the following.
The question:
```{r, include = FALSE}
a <- sample(1:1,1)
b <- sample(1:1,1)
```
Question
========
Let $z = `r (a^2)*(b^2)`$. Hence, $\sqrt{z}$ is equal to:
Answerlist
----------
* $`r a*b`$
* $`r -a*b`$
* $`r 2*a*b`$
* $`r 3*a*b`$
* $`r 4*a*b`$
Meta-information
================
exname: My question
extype: schoice
exsolution: 10000
exshuffle: TRUE
The code to generate the several versions of the question:
library(exams)
setwd("/tmp/")
expargrid <- function(file, ...) {
df <- expand.grid(...)
stopifnot(nrow(df) >= 1L)
sapply(1L:nrow(df), function(i) {
args <- as.list(df[i,])
args <- c(list(file = file), args)
do.call(exams::expar, args)
})
}
n <- 1
myquestions <- expargrid(paste0("question",sprintf("%02d", n),".Rmd"), a = 0:1, b = 0:1)
exams2moodle(myquestions,dir = "/tmp/", schoice=list(answernumbering="none"), name="PM")
Can the bad questions be eliminated automatically from the Moodle xml file generated by R/Exams?

In general it is hard to catch such problems outside of the Rmd exercise files. Instead it would be better to write the R code in the exercise in such a way that it assures that the question list only contains unique items - and if that is not the case to keep on re-sampling the parameters until there is a version that works.
Part of the problem to catch the problem in general is also that it might depend on the random number generator and its seed because the problem might just occur very rarely.
In your special case, this is a bit different because you make a full grid of all possible combinations so that each exercise file does not have any random elements anymore. Here the best solution is to run an exams2xyz() interface (or the underlying xexams() function) once, inspect the output, eliminate the problematic exercise, and then run the desired exams2xyz() interface again.
Relying on your myquestions vector with four static variants of the dynamic question01.Rmd exercise template you could do:
myq_check <- xexams(myquestions)
## Warning messages:
## 1: In driver$read(file_tex[idj]) :
## duplicated items in question list in '_tmp_Rtmpjhh2Lb_question01+60A32D0E+C9FE1'
## 2: In driver$read(file_tex[idj]) :
## duplicated items in question list in '_tmp_Rtmpjhh2Lb_question01+60A32D0E+CA2C6'
## 3: In driver$read(file_tex[idj]) :
## duplicated items in question list in '_tmp_Rtmpjhh2Lb_question01+60A32D0E+CA4CF'
This is similar to running exams2moodle(myquestions) but only weaves the Rmd files and reads them into R - without converting them to HTML and writing a Moodle XML file. Hence it's a bit faster and does not produce any files that would need to be cleaned up afterwards.
The output myq_check is a nested list with
1 list for n = 1 random replication,
each containing 4 lists pertaining to the four exercise files from myquestions,
each containing 6 list elements with
question text,
questionlist with the question items,
solution text (if any, empty in this case),
solutionlist with the solution explanations (if any, empty here),
metainfo with the meta-information,
supplements with the file paths of supplementary files (if any, empty here).
Running xexams(myquestions) already warns about problems in three (ot of the four) exercise files. (You need to use R/exams version >= 2.4-0 to get these.) By inspecting the number of unique items in the questionlist we can find out which are affected by this:
ok <- sapply(myq_check[[1]], function(x) {
length(x$questionlist) == length(unique(x$questionlist)) })
## exercise1 exercise2 exercise3 exercise4
## FALSE FALSE FALSE TRUE
Thus, only the last exercise in myquestions is really suitable so you should subset before proceeding with exams2moodle():
myquestions <- myquestions[ok]
exams2moodle(myquestions)
As already pointed out above. In this case this is sufficient to make the selection. If there is remaining randomness in the exercise files, this might not be enough to catch all problems. Then it would be better to program a custom solution into the Rmd exercise.

Related

Is there a knitr strat for passing R content to LaTeX commands?

I'm creating a small R package that will allow a user to create exams with R code for tables and figures, multiple question types, randomly ordered questions, and randomly ordered responses on multiple choice items. Inserting R code into LaTeX is not problematic, but one issue I've run into is the need to "slap" together text ingested via R with LaTeX commands. Consider this example:
I have this in a LaTeX file:
\newcommand{\question}[1]{
\begin{minipage}{\linewidth}
\item
{#1}
\end{minipage}
}
I read the content with readr::read_file and store it in a variable. I then have the contents of the questions in a .json file:
...
{
"type" : "mc",
"section_no" : 1,
"points" : 2,
"question" : "What is the best beer?",
"correct_answer" : "Hamm's",
"lure_1" : "Miller Lite",
"lure_2" : "PBR",
"lure_3" : "Naturdays",
"lure_4" : "Leine's"
},
...
which I read with jsonlite::fromJSON (which converts to a dataframe), do some massaging, and store in a variable. Let's call the questions and their available options questions. What I've been doing is putting the necessary LaTeX content together with the character string manually with
question.tex <- paste0("\\question{", question[i], "\\\\")
to achieve this in the knitted .tex file:
\question{What is the best beer?\\A. PBR\\B. Naturdays\\C. Miller Lite\\D. Leine's\\E. Hamm's\\}
but I'm thinking there has to be a better way to do this. I'm looking for a function that will allow for a more seamless passing of arguments to my LaTeX command, something like knitr::magic_func(latex.command, question[i]) to achieve the result above. Does this exist?
Maybe I am asking for an extra level of abstraction that knitr doesn't have (or wasn't designed to have)? Or perhaps there's a better way? I guess at this point I'm not far away from being able to create a function that reads the LaTeX command name, number of arguments, and inserts text appropriately, but better to not reinvent the wheel! Also, I think this question could be generalized to simpler commands like \title, \documentclass, etc.
Small MWE:
## (more backslashes since we need to escape in R)
tex.command <- "\\newcommand{\\question}[1]{
\\begin{minipage}{\\linewidth}
\\item
{#1}
\\end{minipage}
}"
q <- "What is the best beer?\\\\A. PBR\\\\B. Naturdays\\\\C. Miller Lite\\\\D. Leine's\\\\E. Hamm's\\\\"
## some magic function here?
magic_func(tex.command, q)
## desired result
"\\question{What is the best beer?\\\\A. PBR\\\\B. Naturdays\\\\C. Miller Lite\\\\D. Leine's\\\\E. Hamm's\\\\\\\\"

Extracting Body of Text from Research Articles; Several Attempted Methods

I need to extract the body of texts from my corpus for text mining as my code now includes references, which bias my results. All coding is performed in R using RStudio. I have tried many techniques.
I have text mining code (of which only the first bit is included below), but recently found out that simply text mining a corpus of research articles is insufficient as the reference section will bias results; reference sections alone may provide another analysis, which would be a bonus.
EDIT: perhaps there is an R package that I am not aware of
My initial response was to clean the text formats after converting from pdf to text using Regex commands within quanteda. As a reference I was intending to follow: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005962&rev=1 . Their method confuses me not just in coding a parallel regex code, but in how to implement recognizing the last reference section to avoid cutting off portions of the text when "reference" appears prior to that section; I have been in contact with their team, but am waiting to learn more about their code since it appears they use a streamlined program now.
PubChunks and LAPDF-text were my next two options the latter of which is referenced in the paper above. In order to utilize the PubChunks package I need to convert all of my pdf (now converted to text) files into XML. This should be straightforward only the packages I found (fileToPDF, pdf2xml, trickypdf) did not appear to work; this seems to be a within-R concern. (Coding relating to trickypdf is included below).
For LAPDF-text, ...[see edit]... the code did not seem to run properly. There are also very limited resources out there for this package in terms of guides etc and they have shifted their focus to a larger package using different language that does happen to include LAPDF-text.
EDIT: I installed java 1.6 (SE 6) and Maven 2.0 then ran the LAPDF-text installer, which seemed to work. That being said, I am still having issues with this process and mvn commands recognizing folders though am continuing to work through it.
I am guessing there is someone else out there, as there are related research papers with similarly vague processes, who has done this before and has also got their hands dirty. Any recommendations is greatly appreciated.
Cheers
library(quanteda)
library(pdftools)
library(tm)
library(methods)
library(stringi) # regex pattern
library(stringr) # simpler than stringi ; uses stringi on backend
setwd('C:\\Users\\Hunter S. Baggen\\Desktop\\ZS_TestSet_04_05')
files <- list.files(pattern = 'pdf$')
summary(files)
files
# Length 63
corpus_tm <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
corpus_tm
# documents 63
inspect(corpus_tm)
meta(corpus_tm[[1]])
# convert tm::Corpus to quanteda::corpus
corpus_q <- corpus(corpus_tm)
summary(corpus_q, n = 2)
# Add Doc-level Variables here *by folder and meta-variable year
corpus_q
head(docvars(corpus_q))
metacorpus(corpus_q)
#_________
# extract segments ~ later to remove segments
# corpus_segment(x, pattern, valuetype, extract_pattern = TRUE)
corpus_q_refA <- corpus_reshape(corpus_q, to = "paragraphs", showmeta = TRUE)
corpus_q_refA
# Based upon Westergaard et al (15 Million texts; removing references)
corpus_q_refB <- corpus_trim(corpus_q, what = c('sentences'), exclude_pattern = '^\[\d+\]\s[A-Za-z]')
corpus_q_refB # ERROR with regex above
corpus_tm[1]
sum(str_detect(corpus_q, '^Referen'))
corpus_qB <- corpus_q
RemoveRef_B <- corpus_segment(corpus_q, pattern = 'Reference', valuetype = 'regex')
cbind(texts(RemoveRef_B), docvars(corpus_qB))
# -------------------------
# Idea taken from guide (must reference guide)
setGeneric('removeCitations', function(object, ...) standardGeneric('removeCitations'))
'removCitations'
setMethod('removeCitations', signature(object = 'PlainTextDocument'),
function(object, ...) {
c <- Content(object)
# remove citations tarting with '>'
# EG for > : citations <- grep('^[[:blank:]]*>.*', c) if (length(citations) > 0) c <- c[-citations]
# EG for -- : signatureStart <- grep('^-- $', c) if (length(signatureStart) > 0) c <- c[-(signatureStart:length(c))]
# using 15 mil removal guideline
citations <- grep('^\[\d+\]\s[A-Za-z]')
}
# TRICKY PDF download from github
library(pubchunks)
library(polmineR)
library(githubinstall)
library(devtools)
library(tm)
githubinstall('trickypdf') # input Y then 1 if want all related packages
# library(trickypdf)
# This time suggested I install via 'PolMine/trickypdf'
# Second attempt issue with RPoppler
install_github('PolMine/trickypdf')
library(trickypdf) # Not working
# Failed to install package 'Rpoppler' is not available for R 3.6.0
Short of the RPoppler issue above the initial description should be sufficient.
UPDATE: Having reached out to several research groups the TALN-UPF researchers got back to me and provided me with a pdfx java program that has allowed me to convert my pdfs easily into xml. Of course, now I learn that PubChunks is created with its sister program that extracts xmls from search engines and therefore is of little use to me. That being said, the TALN-UPF group will hopefully advise whether I can extract the body of each text via their other programs (Dr Inventor and Grobid). If this is possible then everything will be accomplished. Of course if not I will be back at RegEx.

suppress line/index numbers in R output

Can I systematically suppress the index of the first element in the line of the output in R's output in the console?
I am looking for an option to prettify the output, without having to type anything extra. I imagine that if such a feat is possible, it would be set up as an option in the .renviron file (or similar). An RStudio-specific answer would be acceptable. Apologies if I have overlooked something obvious in the settings (I would have expected that option to be in Preferences --> Code --> Display.
Currently the R console and RStudio consoles display:
1+1
[1] 2
I would like to see:
1+1
2
I know I can get the above with cat(1+1), but what I'm looking for is a systematic change in the display style. Something like the typical Python output (open a terminal, type Python followed by 1+1. I want that)
Edit: Another example. In RStudio, if I define x=1:5, it appears as int [1:5] 1 2 3 4 5 in the environment: that's informative and I don't mind it. But in the R console, it looks like [1] 1 2 3 4 5, which I do not find informative, especially when there are multiple lines.
I have personally got used to these numbers, as I imagine everyone has, but that doesn't make them right: (1) they serve no purpose: if you widen the console, the lines get wider and the line numbers change (if they marked the 80-character width, ok, maybe they would serve a purpose), (2) when I copy-paste output into lecture notes, these line numbers interfere with clarity and confuse the novice.
I have not found an answer to this question, which is surprising, so please let me know if I have missed it. The following question is related but not a duplicate
https://stackoverflow.com/questions/3271939. Is there a duplicate I have missed?
Edit As pointed out by Adiel Loinger in the comments section, these are not "line numbers", as I had called them, but "the index of the first element of the line being printed in the console". Thanks for the correction. I have tried to edit my question accordingly.
I believe the only way to do that is to modify the sources. R is open source, so that's not impossible, but it's not easy.
It's easier to change the print format for particular classes of objects. For example, if you don't like the way lm objects print, you can create your own print.lm method to do it yourself:
print.lm <- function (x, ...)
{
cat("My new version!")
}
Then
> lm(rnorm(10) ~ I(1:10))
My new version!
This doesn't work for things like 1+1, because for efficiency reasons, R always uses the internal version of the print method for auto-printing.
By the way, the printed indices do serve a purpose: if you print a long vector and wonder what the index is for some particular element, you only need to count from the start of the line, not from the start of the vector, to find it.
You can work around indexes and row names by converting the answers to data frames. It's not perfect, but not too hard and depending on your application, maybe an improvement. Functions below.
Base function with the slightly annoying index:
paste0("The answer is ", foo, "bar")
}
my_fun("foo")
[1] "The answer is foobar"
Improvement with data frame:
Note: For data frames with multiple rows, instead of just df, use print.data.frame(df, row.names = FALSE)
my_funner <- function(foo){
df <- data.frame("The_answer_is" = paste0(foo, "bar"), row.names = "")
df
}
my_funner("foo")
The_answer_is
foobar
Another option:
my_funnest <- function(foo){
df <- data.frame("Sorry_about" = "The_answer_is", "the_col_names" = paste0(foo, "bar"), row.names = "")
df
}
my_funnest("foo")
Sorry_about the_col_names
The_answer_is foobar
But those gaps are annoying, so one more option:
my_most_funnest <- function(foo){
df <- data.frame("Sorry_about_the_col_names" = paste0("The answer is ", foo, "bar"), row.names = "")
df
}
my_most_funnest("foo")
Sorry_about_the_col_names
The answer is foobar

Is there a way to page break the output in console [duplicate]

Is there an equivalent to the unix less command that can be used within the R console?
There is also page() which displays a representation of an object in a pager, like less.
dat <- data.frame(matrix(rnorm(1000), ncol = 10))
page(dat, method = "print")
Not really. There are the commands
head() and tail() for showing the beginning and end of objects
print() for explicitly showing an object, and just its name followed by return does the same
summary() for concise summary that depends on the object
str() for its structure
and more. An equivalent for less would be a little orthogonal to the language and system. Where the Unix shell offers you less to view the content of a file (which is presumed to be ascii-encoded), it cannot know about all types.
R is different in that it knows about the object types which is why summary() -- as well as the whole modeling framework -- are more appropriate.
Follow-up edit: Another possibility is provided by edit() as well as edit.data.frame().
I save the print output to a file and then read it using an editor or less.
Type the following in R
sink("Routput.txt")
print(varname)
sink()
Then in a shell:
less Routput.txt
If the file is already on disk, then you can use file.show
You might like my little toy here:
short <- function(x=seq(1,20),numel=4,skipel=0,ynam=deparse(substitute(x))) {
ynam<-as.character(ynam)
#clean up spaces
ynam<-gsub(" ","",ynam)
#unlist goes by columns, so transpose to get what's expected
if(is.list(x)) x<-unlist(t(x))
if(2*numel >= length(x)) {
print(x)
}
else {
frist=1+skipel
last=numel+skipel
cat(paste(ynam,'[',frist,'] thru ',ynam,'[',last,']\n',sep=""))
print(x[frist:last])
cat(' ... \n')
cat(paste(ynam,'[',length(x)-numel-skipel+1,'] thru ', ynam, '[', length(x)-skipel,']\n',sep=""))
print(x[(length(x)-numel-skipel+1):(length(x)-skipel)])
}
}
blahblah copyright by me, not Disney blahblah free for use, reuse, editing, sprinkling on your Wheaties, etc.

Preserving long comments in console output. Not falling victim to ".... [TRUNCATED]"

I am trying to run a script that has lots of comments to explain each table, statistical test and graph. I am using RStudio IDE as follows
source(filename, echo=T)
That ensures that the script outputs everything to the console. If I run the following sequence it will send all the output to a txt file and then switch off the output diversion
sink("filenameIwantforoutput.txt")
source(filename, echo=T)
sink()
Alas, I am finding that a lot of my comments are not being outputted. Instead I get
"...but only if we had had an exclusively b .... [TRUNCATED]".
Once before I learned where to preserve the output but that was a few months ago and now I cannot remember. Can you?
Set the max.deparse.length= argument to source. You probably need something greater than the default of 150. For example:
source(filename, echo=TRUE, max.deparse.length=1e3)
And note the last paragraph in the Details section of ?source reads:
If ‘echo’ is true and a deparsed
expression exceeds
‘max.deparse.length’, that many
characters are output followed by ‘
.... [TRUNCATED] ’.
You can make this behavior the default by overriding the source() function in your .Rprofile.
This seems like a reasonable case for overriding a function because in theory the change should only affect the screen output. We could contrive an example where this is not the case, e.g., can capture the screen output and use as a variable like capture.output(source("somefile.R")) but it seems unlikely. Changing a function in a way that the return value is changed will likely come back to bite you or whoever you share your code with (e.g., if you change a default of a function's na.rm argument).
.source_modified <- source
formals(.source_modified)$max.deparse.length <- Inf
# Use 'nms' because we don't want to do it for all because some were already
# unlocked. Thus if we lock them again, then we are changing the previous
# state.
# (https://stackoverflow.com/a/62563023/1376404)
rlang::env_binding_unlock(env = baseenv(), nms = "source")
assign(x = "source", value = .source_modified, envir = baseenv())
rlang::env_binding_lock(env = baseenv(), nms = "source")
rm(.source_modified)
An alternative is to create your own 'alias'-like function. I use the following in my .Rprofile:
s <- source
formals(s)$max.deparse.length <- Inf
formals(s)$echo <- TRUE

Resources