Multiple feature vectors in weka - vector

I am working in a Text categorization project using Weka,I have 12 class
I need to find text keywords for each class that distinguish one class from others,
So I am thinking to make feature vector(FV) for each class independently and store 12 (FV)s in separated 12 arff files!
The Question Is --> How can I combine 12 different Feature vectors in one feature vector?

Depending on classes overlapping or not, I propose two different approaches instead of joining the feature vectors:
If classes are not overlapping (that is, no document is in two or
more classes at the same time), you would rather build a single ARFF
file and then make use of the AttributeSelection filter (Ranker
search, InfoGainAttributeEval evaluator suggested) to determine which
features most discriminate among all the classes.
If classes are overlapping, you could build twelve one-again-the-rest
classifiers, each one with its own vocabulary. You could apply
attribute selection to each independent problem as well, finding the
features that best discriminate a single class from all of the rest.

Related

walk through all components in model

I have a fairly large model with components grouped hierarchically about 3 levels deep. It would be useful for me to be able recursively iterate through my components and list inputs and outputs, as well as all the option values, and format all that data to my liking so I can make a nice report with it.
calling list_inputs() and list_outputs() on a given group sort of does what I want, in that it prints off the inputs and outputs, but if you call it on a large group you can't get the inputs and outputs of single component next to each other on the page.
I could probably reverse engineer how list_inputs() is working itself but was wondering if there is an easy way to do it.
As you noted, list_inputs and list_outputs are both methods defined on the System class. Thought these methods do group their print-outs by component, the challenge is that you get all the inputs first, then all the outputs. You can't easily see the inputs and outputs for a single component together.
Both of these methods can have their printing shut off by setting out_stream=None, and each of them returns a list of variable data that you can manually parse through. That may not give you the format you want though.
If you want to manually recurse over the hierarchy and write your own custom report method, then you should look at the following methods on System (i.e. components and groups):
get_io_metadata
system_iter
Those, combined with the data returned from list_inputs and list_outputs should give you what you need.

Is it possible to remove the letter prefix ("a. ", "b. ", etc.) from the possible answers in mulitple choice tests?

When producing multiple choice questions, exams prefixes the possible answers with lower case letters. Is it possible to change this behaviour when using exams2qti21 so that the answers are displayed without this prefix?
e.g. to go from
a. 12
b. 35
c. 15
d. 25
to simply,
12
35
15
25
I would like to do this because our content management system, "itsLearning" can randomise the possible answers (per student) and the inclusion of the letter prefixes messes this up.
You can do this by setting the enumerate argument to FALSE for the mchoice and/or schoice questions. By default, the setting of mchoice is also propagated to schoice. So this should do what you want:
exams2qti21(..., mchoice = list(enumerate = FALSE))
As an additional comment:
Letting the learning management system do the randomization is more efficient if the exercises and choice lists are otherwise static. Then you just need to upload one exercise and re-use it because the learning management system does the shuffling.
Letting the exams2xyz() interface from R/exams do the shuffling, on the other hand, gives you far more options than most learning management systems support. In particular you can generate the choice lists fully dynamically (as in deriv2 or tstat2) or you can do subsampling from a large static list (as in capitals). In both cases I would switch off the shuffling in the learning management system.

Checking for similarity of text in two text strings

I have two strings of text (typically two paragraphs). I am looking to check for the "similarity" between them, e.g. check if one paragraph is a plagiarised version of the other. Ideally I need a similarity score, as well as an indication of where the similarities are. I prefer to do this fully in R. Any suggestions please?
The difference of stings can be measured with the levenshtein distance (or concepts that build on top of that). The main idea is to quantify the "editiing distance" of strings: how many letters need to be included/excluded/changed, etc (depending on the algorithm more or less types of editing are allowed). A package in R for this task would be fuzzyjoin.
To look up the similarities you could cut both texts (original and suposed plagiate) in sentences and build the fuzzy joins on this - Then you can filter for best matches. The topic is a bit tricky so I recomend trying out different algorithms (jaccard distance, damerau levenshtein, etc). A start into the topic can be found here: https://cran.r-project.org/web/packages/fuzzyjoin/readme/README.html

How do I insert text before a group of exercises in an exam?

I am very new to R and to R/exams. I've finally figured out basic things like compiling a simple exam with exams2pdf and exams2canvas, and I've figured out how to arrange exercises such that this group of X exercises gets randomized in the exam and others don't.
In my normal written exams, sometimes I have a group of exercises that require some introductory text (e.g,. a brief case study on which the next few questions are based, or a specific set of instructions for the questions that follow).
How do I create this chunk of text using R/exams and Rmd files?
I can't figure out if it's a matter of creating a particular Rmd file and then simply adding that to the list when creating the exam (like a dummy file of sorts that only shows text, but isn't numbered), or if I have to do something with the particular tex template I'm using.
There's a post on R-forge about embedding a plain LaTeX file between exercises that seems to get at what I'm asking, but I'm using Rmd files to create exercises, not Rnw files, and so, frankly, I just don't understand it.
Thank you for any help.
There are two strategies for this:
1. Separate exercise files in the same sequence
Always use the same sequence of exercises, say, ex1.Rmd, ex2.Rmd, ex3.Rmd where ex1.Rmd creates and describes the setting and ex2.Rmd and ex3.Rmd simply re-use the variables created internally by ex1.Rmd. In the exams2xyz() interface you have to assure that all exercises are processed in the same environment, e.g., the global environment:
exams2pdf(c("ex1.Rmd", "ex2.Rmd", "ex3.Rmd"), envir = .GlobalEnv)
For .Rnw exercises this is not necessary because they are always processed in the global environment anyway.
2. Cloze exercises
Instead of separate exercise files, combine all exercises in a single "cloze" exercise ex123.Rmd that combines three sub-items. For a simple exercise with two sub-items, see: http://www.R-exams.org/templates/lm/
Which strategy to use?
For exams2pdf() both strategies work and it is more a matter of taste whether one prefers all exercises together in a single file or split across separate files. However, for other exams2xyz() interfaces only one or none of these strategies work:
exams2pdf(): 1 + 2
exams2html(): 1 + 2
exams2nops(): 1
exams2moodle(): 2
exams2openolat(): 2
exams2blackboard(): -
exams2canvas(): -
Basically, strategy 1 is only guaranteed to work for interfaces that generate separate files for separate exams like exams2pdf(), exams2nops(), etc. However, for interfaces that create pools of exercises for learning management systems like exams2moodle(), exams2canvas(), etc. it typically often cannot be assured that the same random replication is drawn for all three exercises. (Thus, if there are two random replications per exercise, A and B, participants might not get A/A/A or B/B/B but A/B/A.)
Hence, if ex1/2/3 are multiple-choice exercises that you want to print and scan automatically, then you could use exams2nops() in combination with strategy 1. However, strategy 2 would not work because cloze exercises cannot be scanned automatically in exams2nops().
In contrast, if you want to use Moodle, then exams2moodle() could be combined with strategy 2. In contrast, strategy 1 would not work (see above).
As you are interested in Canvas export: In Canvas neither of the two strategies works. It doesn't support cloze exercises. And to the best of my knowledge it is not straightforward to assure that exercises are sampled "in sync".

Properly define the members and invariants of a class in R

We have an R package for a certain purpose. The basic data structure is a correlation function which is a real/complex valued function for a smallish (100) number (T) of time slices. We have multiple measurements (N) of it, so at its core it is a N×T matrix. But then there are more things that it can become:
One can bootstrap it with R samples such that it becomes an R×T matrix. However we want to keep the original data, so there is a field for the R×T matrix and another for the N×T matrix.
It can be symmetrized which will cut T in half and also alter various other functions that work with those objects.
Also it can be shifted which takes the difference between consecutive elements and therefore drops one time slice. The first column in the matrix then corresponds to t = 1 and not t = 0 any more, which becomes important in fits to the data.
Correlation functions may have an imaginary part, this is stored as a second real matrix. But they might not.
When doing non-linear operations with the data, we do that once with the average of the original data and each bootstrap sample. If the result is another correlation function, that object will not have “original data” but only the average.
So basically we have a class that can have various fields and only the average of the original data is really common.
To make things worse, there is no formal documentation for the possible members and the invariants associated with them. Coming from C++ where a concise class definition allows me do encapsulation, The S3 class system in R seems like an invitation for inconsistencies.
This surfaced a few times when some function taking such a correlation function as argument and expected some field to be present while it was not. The code is riddled with lines that just add another field to the class when performing an operation.
Long story short: Is there some automatically enforcable way in the S3 class system to have an exhaustive list of all the fields that a class can have? Right now I only see the possibility to document (in English) in the constructor function and just hope nobody missed a line where fields were added.

Resources