Is it possible to use use exams2pdf and to obtain the solution of the exact same version generated with exams2nops? - r-exams

When generating exams using the function exams2nops we randomly generate data for each of the produced exams (let's say 5 different versions). We would like to use the exact same version of each exam to produce the solutions version (using exams2pdf). Is it possible to create the solution version right on the go when generating exams with the exams2nops? By exact same version I mean, the same order of the multiple-choice answers and the same wrong values (using the marvelous num_to_schoice function). We save the .rds objects used on each exercise, allowing us to obtain import them when generating solutions, however, the wrong options and order are different since it is random. Should we also save a specific seed in the .rds object? Inside each exercise, we have several random generated values.

When you set the same random seed prior to calling exams2pdf() and exams2nops() you should get the same random versions of the exercises.
Illustration: n = 2 version of an exm with 3 exercises.
library("exams")
exm <- c("capitals.Rmd", "deriv2.Rmd", "tstat2.Rmd")
set.seed(1)
exm1 <- exams2pdf(exm, n = 2)
set.seed(1)
exm2 <- exams2nops(exm, n = 2)
Compare the question list of all three exercises in the second random version of the exams:
all.equal(exm1[[2]][[1]]$questionlist, exm2[[2]][[1]]$questionlist)
## [1] TRUE
all.equal(exm1[[2]][[2]]$questionlist, exm2[[2]][[2]]$questionlist)
## [1] TRUE
all.equal(exm1[[2]][[3]]$questionlist, exm2[[2]][[3]]$questionlist)
## [1] TRUE
Both have to be called separately, though, there is no option to produce both in one go, currently.

Related

interpreting R code function

I would like to perform pathway enrichment analyses.
I have 21 list of significant genes, and mutiple types of pathways I would like to check (ie. check for enrichment in KEGG pathways, GOterms, complexes etc.).
I found this example of code, on an old BioC post. However, I am having trouble adapting it for myself.
Firstly,
1- what does this mean? I don't know this multiple colon syntax.
hyperg <- Category:::.doHyperGInternal
2 - I don't understand how this line works. hyperg.test is a function that needs 3 variables passed to it, correct? Is this line somehow passing "genes.by.pathways, significant.genes, and all.geneIDs to thr hyperg.test?
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
Code that I would like to adapt
library(KEGGREST)
library(org.Hs.eg.db)
# created named list, length 449, eg:
# path:hsa00010: "Glycolysis / Gluconeogenesis"
pathways <- keggList("pathway", "hsa")
# make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))
# for demonstration, just use the first ten pathways
demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)
genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
demo.pathway$GENE[c(TRUE, FALSE)]
})
all.geneIDs <- keys(org.Hs.eg.db)
# chose one of these for demonstration. the first (a whole genome random
# set of 100 genes) has very little enrichment, the second, a random set
# from the pathways themselves, has very good enrichment in some pathways
set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)
# the hypergeometric distribution is traditionally explained in terms of
# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also
hyperg <- Category:::.doHyperGInternal
hyperg.test <-
function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
white.balls.in.urn <- length(pathway.genes)
total.balls.in.urn <- length(all.genes)
black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
balls.pulled.from.urn <- length(significant.genes)
hyperg(white.balls.in.urn, black.balls.in.urn,
balls.pulled.from.urn, white.balls.drawn, over)
}
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
print(pVals.by.pathway)
The reason you are getting your error is because it appears you don't have the Category package installed from bioconductor. I suspect this because of the triple colon operator :::. This operator is very similar to the double colon operator ::. Whereas with :: you can access exported objects from a package without loading it, the ::: allows access to non-exported objects (in this case the hyperg function from Category). If you install the Category package the code runs without error.
With regard to the sapply statement:
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
You can break this down into the separate parts to understand it. Firstly, the sapply is iterating over the elements of gene.by.pathway and passing them to the first argument of hyperg.test. The following arguments are the two addition parameters. It is a little unclear and I personally recommend that people explicitly identify the parameters to avoid unexpected surprises and avoids the need for the exact same order. This is a little repetitive in this case but a good way to avoid a silly bug (e.g. putting significant.genes after all.geneIds)
Rewritten:
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes=significant.genes, all.genes=all.geneIDs))
Once this loop completes, the sapply function simplifies the output in to a matrix. However, the output is much more user-friendly by taking the transpose t.
Generally speaking, when trying to understand complex apply statements I find it best to break them apart in to smaller parts and see what the objects themselves look like.

R: conditional expand.grid function

I would like to find all combinations of vector elements that matches a specific condition. The function expand.grid returns all possible combinations without checking for a specific condition. It is possible to test for a specific condition after using the expand.grid function, but in some situations the number of possible combinations is too large to generate them with expand.grid. Therefore is there a function that allows me to check for a condition while generating all possible combinations.
This is a simplified version of the problem:
A <- seq.int(12, from=0, by=1)*15
B <- seq.int(27, from=0, by=1)*23
C <- seq.int(18, from=0, by=1)*18
D <- seq.int(33, from=0, by=1)*10
out<-expand.grid(A,B,C,D) #out is a dataframe with 235144 x 4 as dimensions
idx<-which(rowSums(out)<=400 & rowSums(out)>=300) #Only a small fraction of 'out' is needed
results <- out(idx,)
In a word, no. After all, if you knew a priori which combinations were desirable/undesirable, you could exclude them from the expansion, e.g. expand.grid(A[A<20],B[B<15],...) . In the general case, which I'm assuming is your real question, you have no simple way to exclude portions of the input vectors.
You might just want to write a multilevel loop which tests each combination in turn and saves or rejects it. This will be slow (again, unless you come up with some clever algorithm to predict regions which are all TRUE or FALSE). So, in the long run, you may be better off using some of the R-packages which partition large calculations (and datasets) so as to avoid exceeding your memory limits.
Now that I've said all that, someone's going to post a link to a package which does exactly that :-(

permutation with repetition

In R, how can I produce all the permutation of a group, but in this group there are some repetitive elements.
Example :
A = {1,1,2,2,3}
solution :
1,1,2,2,3
1,1,2,3,2
1,1,3,2,2
1,2,1,2,3
1,2,2,1,3
1,2,2,3,1
.
.
using the gtools package,
library(gtools)
x <- c(1,1,2,2,3)
permutations(5, 5, x, set = FALSE)
Just use the combinat package:
A = c(1,1,2,2,3)
library(combinat)
permn(A)
If you want to do it with built-in R:
permute <- function(vec,n=length(vec)) {
permute.index <- sample.int(length(vec),n)
return(vec[permute.index])
}
permute(A)
Using the permute package:
x <- c(1,1,2,2,3)
require(permute)
allPerms(x, observed = TRUE)
I have done extensive research on combination and permutation. This result which I have found is written on a book Known as Junction (an art of counting combination and permutation. To view my site then log on to https://sites.google.com/site/junctionslpresentation/home
I have also have solution for your question. I have also found to order a multiple object permutation. This multiple object permutation I call it (CON of MSNO) which means Combination Order Number of Multiple Same Number of Objects.
To view this method of ordering then go to the site https://sites.google.com/site/junctionslpresentation/proof-for-advance-permutation
at the bottom of this site I have attached some word documents. Your required solution is written on the word document 12 Proof (CON of MSNO) and 13 Proof (Converse of CON of MSNO). Download this word document for the proper view of the written matters.

How to perform basic Multiple Sequence Alignments in R?

(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)
The task I'm trying to achieve is to align several sequences.
I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.
My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)
Real_sequence The_sequence_we_see
1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
Here is an example code to reproduce the above example:
ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG)
{
number.of.changes <- sample(seq_len(number.of.changes), 1)
new.letters <- sample(letters.to.change.with , number.of.changes, T)
where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
x[where.to.change.the.letters] <- new.letters
return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-")
insert.missing.values(original.seq)
seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))
seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")
# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
I understand that if all I had was a string and a pattern I would be able to use
library(Biostrings)
pairwiseAlignment(...)
But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).
Is there a known method for doing this in R?
Writing an alignment algorithm in R looks like a bad idea to me, but there is an R interface to the MUSCLE algorithm in the bio3d package (function seqaln()). Be aware of the fact that you have to install this algorithm first.
Alternatively, you can use any of the available algorithms (eg ClustalW, MAFFT, T-COFFEE) and import the multiple sequence alignemts in R using bioconductor functionality. See eg here..
Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.
You can perform multiple alignment in R with the DECIPHER package.
Following your example, it would look something like:
library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)
It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!
You are looking for a global alignment algorithm on multiple sequences.
Did you look at Wikipedia before asking ?
First learn what global alignment is, then look for multiple sequence alignment.
Wikipedia doesn't give a lot of details about algorithms, but this paper is better.

Plotting of very large data sets in R

How can I plot a very large data set in R?
I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?
In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package:
ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
stopifnot(all(qs<=1 & qs>=0))
ffsort(ffv,...)->ffvs
j<-(qs*(length(ffv)-1))+1
jf<-floor(j);ceiling(j)->jc
rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}
This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.
Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by #Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link #aix gave.
If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.
Good, take a simple .csv file, then following function samples a fraction p of the data:
sample.df <- function(f,n=10000,split=",",p=0.1){
con <- file(f,open="rt",)
on.exit(close(con,type="rt"))
y <- data.frame()
#read header
x <- character(0)
while(length(x)==0){
x <- strsplit(readLines(con,n=1),split)[[1]]
}
Names <- x
#read and process data
repeat{
x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
if(is.null(x)) {break}
names(x) <- Names
nn <- nrow(x)
id <- sample(1:nn,round(nn*p))
y <- rbind(y,x[id,])
}
rownames(y) <- NULL
return(y)
}
An example of the usage :
#Make a file
Df <- data.frame(
X1=1:10000,
X2=1:10000,
X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)
# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)
#clean up
unlink("test.txt")
All you need for a boxplot are the quantiles, the "whisker" extremes, and the outliers (if shown), which is all easily precomputed. Take a look at the boxplot.stats function.
You should also look at the RSQLite, SQLiteDF, RODBC, and biglm packages. For large datasets is can be useful to store the data in a database and pull only pieces into R. The databases can also do sorting for you and then computing quantiles on sorted data is much simpler (then just use the quantiles to do the plots).
There is also the hexbin package (bioconductor) for doing scatterplot equivalents with very large datasets (probably still want to use a sample of the data, but works with a large sample).
You could put the data into a database and calculate the quantiles using SQL. See : http://forge.mysql.com/tools/tool.php?id=149
This is an interesting problem.
Boxplots require quantiles. Computing quantiles on very large datasets is tricky.
The simplest solution that may or may not work in your case is to downsample the data first, and produce plots of the sample. In other words, read a bunch of records at a time, and retain a subset of them in memory (choosing either deterministically or randomly.) At the end, produce plots based on the data that's been retained in memory. Again, whether or not this is viable very much depends on the properties of your data.
Alternatively, there exist algorithms that can economically and approximately compute quantiles in an "online" fashion, meaning that they are presented with one observation at a time, and each observation is shown exactly once. While I have some limited experience with such algorithms, I have not seen any readily-available R implementations.
The following paper presents a brief overview of some relevant algorithms: Quantiles on Streams.
You could make plots from manageable sample of your data. E.g. if you use only 10% randomly chosen rows then boxplot on this sample shouldn't differ from all-data boxplot.
If your data are on some database there you be able to create some random flag (as I know almost every database engine has some kind of random number generator).
Second thing is how large is your dataset? For boxplot you need two columns: value variable and group variable. This example:
N <- 1e6
x <- rnorm(N)
b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
g <- factor(sample(b,N,TRUE))
boxplot(x~g)
needs 100MB of RAM. If N=1e7 then it uses <1GB of RAM (which is still manageable to modern machine).
Perhaps you can think about using disk.frame to summarise the data down first before running the plotting?
The problem with R (and other languages like Python and Julia) is that you have to load all your data into memory to plot it. As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.
Computing a boxplot with SQL + R
You need a bunch of statistics to plot a boxplot. If you want a complete reference, you can look at matplotlib's code. The code is in Python, but the code is pretty straightforward, so you'll get it even if you don't know Python.
The most critical piece are percentiles; you can compute those in DuckDB like this (just change the placeholders):
SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY "{{column}}") AS q1,
percentile_disc(0.50) WITHIN GROUP (ORDER BY "{{column}}") AS med,
percentile_disc(0.75) WITHIN GROUP (ORDER BY "{{column}}") AS q3,
AVG("{{column}}") AS mean,
COUNT(*) AS N
FROM "{{path/to/data.parquet}}"
You need some other statistics to create the boxplot with all its details. For full implementation, check this (note: it's written in Python). I had to implement this for a package I wrote called JupySQL, which allows plotting very large datasets in Jupyter by leveraging SQL engines such as DuckDB.
Once you compute the statistics, you can use R to generate the boxplot.

Resources