`nops_eval` is not reading the exam ID, how can I control the exam ID reading? - r-exams

I have about 400 exams in which I defined a custom ID (using the function exams2nops). However, when I use nops_scan none of the IDs is recognized... example:
An exam sample:
Is it due to the number of characters in the exam ID?

Yes, the exam ID needs to have exactly 11 digits. I will add a warning about this to exams2nops().
The "culprit" is this line from the internal read_nops_digits() function:
body(exams:::read_nops_digits)[[6]]
## n <- switch(type, type = 3L, id = 11L, scrambling = 2L)
Thus, when reading the id, the function expects 11 digits. However, I was pleasantly surprised that if you change this 11L to 5L then everything seems to work. You can do so programmatically by making a copy f of this function, changing the 11L to 5L, and overwriting the function in the exams package namespace:
library("exams")
f <- exams:::read_nops_digits
body(f)[[c(6, 3, 4)]] <- 5L
assignInNamespace("read_nops_digits", f, ns = "exams")
After that running nops_scan() should work as needed in your case.
Additional comment: Instead of overwriting the read_nops_digits() function programmatically as above, you can also modify the function "by hand" using an editor via:
fixInNamespace("read_nops_digits", ns = "exams")

Related

"This R function is purely indicative and should never be called": Why am I getting this error?

I am returning to some old code. I am sure it worked in the past. Since I last used it, I've upgraded dplyr to version 1.0.0. Unfortunately, devtools::install_version("dplyr", version='0.8.5") gives me an error whilst compliling, so I can't perform a regression test.
I am trying to create a tidy version of the mcmc class from the runjags package. The mcmc class is essentially a (large) two-dimensional matrix of arbitrary size. It's likely to have several (tens of) thousands of rows and the column names are relevant and (as in my toy data below) potentially awkward. There is also useful information in the attributes of the mcmc object. Hence the somewhat convoluted approach I've taken. It needs to be completely generic.
* Toy data *
# load("./data/oCRMPosteriorShort.rda")
# x <- head(oCRMPosteriorShort$mcmc[[1]])
# dput(x)
x <- structure(c(7.27091686833247, 5.72764789439587, 5.72103479848012,
7.43825337823404, 8.59970106873194, 8.03081445451, 9.16248677241767,
3.09793571064081, 4.66492638321819, 3.19480526258532, 5.1159808007229,
6.08361682213139, 5.05973067601884, 4.14556598358942, 0.95900563867179,
0.88584483221691, 0.950304627720881, 1.13467524314569, 1.44963882689823,
1.19907577185321, 1.15968445234753), .Dim = c(7L, 3L), .Dimnames = list(
c("5001", "5003", "5005", "5007", "5009", "5011", "5013"),
c("alpha[1]", "alpha[2]", "beta")), mcpar = c(5001, 5013,
2), class = "mcmc")
* Stage 1: Code that works: *
a <- attributes(x)
colNames <- a$dimnames[[2]]
#Get the sample IDs (from the attributes of x) and add the chain index
base <- tibble::enframe(a$dimnames[[1]], value="Sample") %>%
tibble::add_column(Chain=1, .before=1) %>%
dplyr::select(-.data$name)
# Create a list of tibbles, defining each as the contents of base plus the contents of the ith column of x,
# plus the name of the ith column in Temp.
t <- lapply(1:length(colNames), function(i) d <- cbind(base) %>% tibble(Temp=colNames[i], Value=x[,colNames[i]]))
At this point, I have a list of tibbles. Each tibble contains columns named Chain (with the value of 1 for each observation in each tibble in this case), Sample (with values taken from the first dimension of the dimnames attribute of x, Temp (with values of beta, alpha[1] and alpha[2] in elements 3, 1 and 2 of the list) and Value (the value of the mcmc object in cell [Sample, Temp].
* Stage 2: Here's the problem... *
It should be a simple matter to row_bind the list into a single tidy tibble containing (some of) the information I need. But:
# row_bind the list of tibbles into a single object
rv <- dplyr::bind_rows(t)
# Error: `vec_ptype2.double.double()` is implemented at C level.
# This R function is purely indicative and should never be called.
# Run `rlang::last_error()` to see where the error occurred.
* Questions *
I can't see what I'm doing wrong here. (And even if I were doing something wrong, I'd expect a more user-friendly, higher level sort of error message.) I can't find any references to this error anywhere on the web.
Does anyone have any idea what's going on?
Would someone run the code using dplyr v0.8.x and report what they see?
I'd appreciate your thoughts.
* Update *
It looked as if the problem has been resolved by a reboot, but has now returned. Even when these tibbles cause the error, a related example from the online doc works:
one <- starwars[1:4, ]
two <- starwars[9:12, ]
bind_rows(list(one, two))
runs without problems.
Context:
> # Context
> R.Version()$version.string
[1] "R version 3.6.3 (2020-02-29)"
> packageVersion("dplyr")
[1] ‘1.0.0’
> Sys.info()["version"]
version
"Darwin Kernel Version 18.7.0: Mon Apr 27 20:09:39 PDT 2020; root:xnu-4903.278.35~1/RELEASE_X86_64"

Most efficient way (fastest) to modify a data.frame using indexing

Little introduction to the question :
I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...).
This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector
The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.
The question itself :
As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).
My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?
By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?
Reproducible example for the comparison:
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers :
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
f1= function(i){
a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
f2= function(i){
S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)
And the result :
Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.
I read more about the question, and I'll just write here for prosperity some of the solutions that were proposed on other posts.
Apparently, reading and writing are both worth to consider when trying to reduce the computation time of assignation to a data.frame by index.
The sources are all found in other discussions:
How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)
Faster i, j matrix cell fill
Time in getting single elements from data.table and data.frame objects
Several solutions appeared relevant :
Use a matrix instead of a data.frame if possible to leverage in place modification (Advanced R).
Use a list instead of a data.frame, because [<-.data.frame is not a primitive function (Advanced R).
Write functions in C++ and use Rcpp (from this source)
Use .subset2 to read instead of [ (third source)
Use data.table as recommanded by #JulienNavarre and #Emmanuel-Lin and the different sources, and use either set for data.frame or := if using a data.table is not a problem.
Use [[ instead of [ when possible (index by one value only). This one is not very effective, and very restrictive, so I removed it from the following comparison.
Here is the analysis of performance using the different solutions :
The code :
# Loading packages :
library(data.table)
library(microbenchmark)
library(ggplot2)
# Creating dummy data :
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
# Transforming data objects into simpler forms :
mat= as.matrix(S$Table_Day)
Slist= as.list(S$Table_Day)
Metlist= as.list(S$Met_c)
MetDT= as.data.table(S$Met_c)
SDT= as.data.table(S$Table_Day)
# Setting up the functions for the tests :
f1= function(i){
S$Table_Day$Bud_dd[i]= cumsum(S$Met_c$DegreeDays[i])
}
f2= function(i){
mat[i,4]= cumsum(S$Met_c$DegreeDays[i])
}
f3= function(i){
mat[i,4]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f4= function(i){
Slist$Bud_dd[i]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f5= function(i){
Slist$Bud_dd[i]= cumsum(Metlist$DegreeDays[i])
}
f6= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", cumsum(S$Met_c$DegreeDays[i]))
}
f7= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", MetDT[i,cumsum(DegreeDays)])
}
f8= function(i){
SDT[i,Bud_dd := MetDT[i,cumsum(DegreeDays)]]
}
i= 6000:6500
res= microbenchmark(f1(i),f3(i),f4(i),f5(i),f7(i),f8(i), times = 10000)
autoplot(res)
And the resulting autoplot :
With f1 the reference base assignment, f2 using a matrix instead of a data.frame, f3 using the combination of .subset2 and matrix, f4 using a list and .subset2, f5 using two lists (both reading and writing), f6 using data.table::set, f7 using data.table::set and data.table for cumulative sum, and f8using data.table :=.
As we can see the best solution is to use lists for reading and writing. This is pretty surprising to see that data.table is the worst solution. I believe I did something wrong with it, because it is supposed to be the best. If you can improve it, please tell me.

R loop through association rule mining (arules) with changing supports and item appearances

fairly new to R here. I am looking to mine association rules in R for specific items, but I want to vary the minimum support target for these rules by each item (i.e. 10% of the item's total frequency in the transaction list). Each item has a different amount of transactions so I believe there is value in varying the support.
I've calculated the support targets for each item in a separate spreadsheet using Excel.
I can do this manually writing the arules code and manually inputting support minimum and item appearances, but the process is slow, especially with many different items.
ex.
arules <- apriori(trans, parameter = list(sup = 0.001, conf = 0.25,target="rules"),
appearance=list(rhs= c("Apples")))
arules2 <- apriori(trans, parameter = list(sup = 0.002, conf = 0.25,target="rules"),
appearance=list(rhs= c("Oranges")))
combined <- c(arules,arules2)
How can I do this using a for loop in R that will calculate the rules for each specified item at a specific support minimum, and also save those generated rules to a new variable each time the loop runs? I intend to later group these rules depending on their type.
Tried something like this which looped through too many times. I also couldn't figure out a way to save each loop to a new variable (i.e. arules1, arules2, arules3....)
min_supp <- c(0.001,0.002,0.003)
names <- c("Apples","Oranges","Grape")
for (inames in names) {
for (supports in min_supp) {
apriori(trans, parameter = list(sup = supports, conf = 0.25,target="rules"),
appearance=list(rhs= inames))
}}
Thanks in advance!
Consider Map (the simplified wrapper to mapply) that can iterate elementwise through same length vectors for a multiple apply method. Additionally, Map will output a list of the returned items which can be named with setNames. Lists are always preferred as you avoid separate, similarly structured objects flooding global environment.
min_supp <- c(0.001,0.002,0.003)
names <- c("Apples","Oranges","Grape")
arules_fun <- function(n, s) apriori(trans, parameter = list(sup = s, conf = 0.25, target="rules"),
appearance=list(rhs= n))
# PROCESS FUNCTION ELEMENTWISE
arules_list <- Map(arules_fun, names, min_supp)
# NAME LIST ITEMS
arules_list <- setNames(arules_list, paste0("arules", 1:length(arules_list)))
arules_list$arules1
arules_list$arules2
arules_list$arules3
...

speed up large result set processing using rmongodb

I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx
count <- mongo.count(mongo, ns, query)
cursor <- mongo.find(mongo, query)
name <- vector("character", count)
age <- vector("numeric", count)
i <- 1
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
name[i] <- mongo.bson.value(b, "name")
age[i] <- mongo.bson.value(b, "age")
i <- i + 1
}
df <- as.data.frame(list(name=name, age=age))
This works fine for hundreds or thousands of results but that while loop is VERY VERY slow. Is there some way to speed this up? Maybe an opportunity for multiprocessing? Any suggestions would be appreciated. I'm averaging 1M per hour and at this rate I'll need a week just to build the data frame.
EDIT:
I've noticed that the more vectors in the while loop the slower it gets. I'm now trying to loop separately for each vector. Still seems like a hack though, there must be a better way.
Edit 2:
I'm having some luck with data.table. Its still running but it looks like it will finish the 12M (this is my current test set) in 4 hours, that's progress but far from ideal
dt <- data.table(uri=rep("NA",count),
time=rep(0,count),
action=rep("NA",count),
bytes=rep(0,count),
dur=rep(0,count))
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
set(dt, i, 1L, mongo.bson.value(b, "cache"))
set(dt, i, 2L, mongo.bson.value(b, "path"))
set(dt, i, 3L, mongo.bson.value(b, "time"))
set(dt, i, 4L, mongo.bson.value(b, "bytes"))
set(dt, i, 5L, mongo.bson.value(b, "elaps"))
}
You might want to try the mongo.find.exhaust option
cursor <- mongo.find(mongo, query, options=[mongo.find.exhaust])
This would be the easiest fix if actually works for your use case.
However the rmongodb driver seems to be missing some extra features available on other drivers. For example the JavaScript driver has a Cursor.toArray method. Which directly dumps all the find results to an array. The R driver has a mongo.bson.to.list function, but a mongo.cursor.to.list is probably what you want. It's probably worth pinging the driver developer for advice.
A hacky solution could be to create a new collection whose documents are data "chunks" of 100000 of the original documents each. Then these each of these could be efficiently read with mongo.bson.to.list. The chunked collection could be constructed using the mongo server MapReduce functionality.
I know of no faster way to do this in a general manner. You are importing data from a foreign application and working with an interpreted language and there's no way rmongodb can anticipate the structure of the documents in the collection. The process is inherently slow when you are dealing with thousands of documents.

`levels<-`( What sorcery is this?

In an answer to another question, #Marek posted the following solution:
https://stackoverflow.com/a/10432263/636656
dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L,
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA, -20L), class = "data.frame")
`levels<-`(
factor(dat$product),
list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
)
Which produces as output:
[1] Generic Generic Bayer Bayer Advil Tylenol Generic Advil Bayer Generic Advil Generic Advil Tylenol
[15] Generic Bayer Generic Advil Bayer Bayer
This is just the printout of a vector; so to store it, you can do the even more confusing:
res <- `levels<-`(
factor(dat$product),
list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
)
Clearly this is some kind of call to the levels function, but I have no idea what's being done here. What is the term for this kind of sorcery, and how do I increase my magical ability in this domain?
The answers here are good, but they are missing an important point. Let me try and describe it.
R is a functional language and does not like to mutate its objects. But it does allow assignment statements, using replacement functions:
levels(x) <- y
is equivalent to
x <- `levels<-`(x, y)
The trick is, this rewriting is done by <-; it is not done by levels<-. levels<- is just a regular function that takes an input and gives an output; it does not mutate anything.
One consequence of that is that, according to the above rule, <- must be recursive:
levels(factor(x)) <- y
is
factor(x) <- `levels<-`(factor(x), y)
is
x <- `factor<-`(x, `levels<-`(factor(x), y))
It's kind of beautiful that this pure-functional transformation (up until the very end, where the assignment happens) is equivalent to what an assignment would be in an imperative language. If I remember correctly this construct in functional languages is called a lens.
But then, once you have defined replacement functions like levels<-, you get another, unexpected windfall: you don't just have the ability to make assignments, you have a handy function that takes in a factor, and gives out another factor with different levels. There's really nothing "assignment" about it!
So, the code you're describing is just making use of this other interpretation of levels<-. I admit that the name levels<- is a little confusing because it suggests an assignment, but this is not what is going on. The code is simply setting up a sort of pipeline:
Start with dat$product
Convert it to a factor
Change the levels
Store that in res
Personally, I think that line of code is beautiful ;)
No sorcery, that's just how (sub)assignment functions are defined. levels<- is a little different because it is a primitive to (sub)assign the attributes of a factor, not the elements themselves. There are plenty of examples of this type of function:
`<-` # assignment
`[<-` # sub-assignment
`[<-.data.frame` # sub-assignment data.frame method
`dimnames<-` # change dimname attribute
`attributes<-` # change any attributes
Other binary operators can be called like that too:
`+`(1,2) # 3
`-`(1,2) # -1
`*`(1,2) # 2
`/`(1,2) # 0.5
Now that you know that, something like this should really blow your mind:
Data <- data.frame(x=1:10, y=10:1)
names(Data)[1] <- "HI" # How does that work?!? Magic! ;-)
The reason for that "magic" is that the "assignment" form must have a real variable to work on. And the factor(dat$product) wasn't assigned to anything.
# This works since its done in several steps
x <- factor(dat$product)
levels(x) <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
x
# This doesn't work although it's the "same" thing:
levels(factor(dat$product)) <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
# Error: could not find function "factor<-"
# and this is the magic work-around that does work
`levels<-`(
factor(dat$product),
list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
)
For user-code I do wonder why such language manipulations are used so? You ask what magic is this and others have pointed out that you are calling the replacement function that has the name levels<-. For most people this is magic and really the intended use is levels(foo) <- bar.
The use-case you show is different because product doesn't exist in the global environment so it only ever exists in the local environment of the call to levels<- thus the change you want to make does not persist - there was no reassignment of dat.
In these circumstances, within() is the ideal function to use. You would naturally wish to write
levels(product) <- bar
in R but of course product doesn't exist as an object. within() gets around this because it sets up the environment you wish to run your R code against and evaluates your expression within that environment. Assigning the return object from the call to within() thus succeeds in the properly modified data frame.
Here is an example (you don't need to create new datX - I just do that so the intermediary steps remain at the end)
## one or t'other
#dat2 <- transform(dat, product = factor(product))
dat2 <- within(dat, product <- factor(product))
## then
dat3 <- within(dat2,
levels(product) <- list(Tylenol=1:3, Advil=4:6,
Bayer=7:9, Generic=10:12))
Which gives:
> head(dat3)
product
1 Generic
2 Generic
3 Bayer
4 Bayer
5 Advil
6 Tylenol
> str(dat3)
'data.frame': 20 obs. of 1 variable:
$ product: Factor w/ 4 levels "Tylenol","Advil",..: 4 4 3 3 2 1 4 2 3 4 ...
I struggle to see how constructs like the one you show are useful in the majority of cases - if you want to change the data, change the data, don't create another copy and change that (which is all the levels<- call is doing after all).

Resources