Large DFA examples? - fsm

I am doing some research on DFA optimization, which needs several large DFA examples with inputs.
I tried to program a DFA generator, with which the results would be artificial.
Do you, by any chance, know some large DFAs or FSMs which I can refer to?

Related

How can I generate exams that contain randomly-generated single-choice answers using R/exams package?

I am interested in using R/exams package in order to generate tests composed of 'single-choice' questions. The three most important things that I am looking for are:
-being able to randomly select one (or more) out of a set of exercises for each participant
-being able to randomly shuffle answer alternatives
-being able to randomly select numbers, text blocks, graphics using the R programming language.
I have followed the basic R/exams tutorials and was able to generate their demo exams, but I was not yet able to find a full tutorial on how to achieve these goals. I am a beginner R programmer and I would, therefore, need a step-by-step tutorial.
If there are any suggestions of such tutorials here I would really appreciate any help.
Thank you
All things you are looking for can be accomplished with R/exams. There is not one step-by-step tutorial that illustrates everything, though. But there are quite a few bits and pieces that should get you started.
Do you want to generate written single-choice exams or do you want to conduct your tests in some learning management system like Moodle or so? If you're looking for written exams, then exams2nops() is the most complete solution, see:
http://www.R-exams.org/tutorials/exams2nops/
For setting up single-choice exercises based on numeric questions, a step-by-step tutorial is: http://www.R-exams.org/tutorials/static_num_schoice/
If you prefer an arithmetic illustration rather than one from economics, there is:
http://www.R-exams.org/general/user2019/
For selecting one out of a set of exercises for each participant, you need to define an exam with a list of exercises, e.g.,
exm <- list(
c("a.Rmd", "b.Rmd", "c.Rmd"),
c("d.Rmd", "e.Rmd")
)
When using exams2xyz(exm) then you will get an exam with two exercises. The first one is a random sample of a-c and the second one a random sample of d-e.
I suggest you try to get started with these, keeping it simple in the beginning. I.e., instead of accomplishing all tasks immediately, try to take them one by one.

NLP - Combining Words of same meaning into One

I am quite new to NLP. My question is can I combine words of same meaning into one using NLP, for example, considering the following rows;
1. It’s too noisy here
2. Come on people whats up with all the chatter
3. Why are people shouting like crazy
4. Shut up people, why are you making so much noise
As one can notice, the common aspect here is that the people are complaining about the noise.
noisy, chatter, shouting, noise -> Noise
Is it possible to group the words using a common entity using NLP. I am using R to come up with a solution to this problem.
I have used a sample twitter data set and my expected output will be a table which contains;
Noise
It’s too noisy here
Come on people whats up with all the chatter
Why are people shouting like crazy
Shut up people, why are you making so much noise
I did search the web for reference before posting here. Any suggestion or valuable inputs will be of much help.
Thanks
The problem you mention is better known as paraphrasing, and it is not completetly solved. Maybe if you want a fast solution, you can start replacing synonyms, wordnet can help with that.
Other idea is calculate sentence similarity (just getting a vector representation of each sentence and use cosine distance to measure similarity to each other)
I think this paper could provide a good introduction for your problem.

Best algorithm suited for finding most similar town

I have a dataset full of towns and their geographical data and some input data. This input data is, almost always, a town as well. But it being towns scraped off of the internet, they can be a little bit misspelled or spelled differently. e.g. Saint Petersburg <-> St. Petersburg
During my research I came across a couple of algorithms and tried out two. Firstly I tried
Sørensen–Dice coefficient. This gave me some promising results, until I tried to match short strings against longer strings. The algorithm is really good when all strings are roughly the same size, but when they differ a lot in size you get mixed results. e.g. When matching Saint to the set, it will give Sail as best match, while I want Saint-Petersburg.
The second algo I tried is the Levenshtein distance, but for the same reasons it didn't fare well.
I came across some other algorithms such as cosine similarity, longest common subsequence and more, but those are a bit more complicated it seems and I would like to keep the cost of calculation down.
Are there any algorithms that prioritize length of match over the percentage matched?
Anyone have any experience with matching oddly spelled town names? Please let me know!
EDIT:
I thought this SO question was a possible duplicate, but it turns out it describes Sorenson-Dice.

How can I efficiently best fit large data with large numbers of variables

I have a data set with, 10 million rows and 1,000 variables, and I want to best fit those variables, so I can estimate a new rows value. I am using Jama's QR decomposition to do it (better suggestions welcome, but I think this question applies to any implementation). Unfortunately that takes too long.
It appears I have two choices. Either I can split the data into, say, 1000 size 10,000 chunks and then average the results. Or I can add up every , say, 100 rows, and stick those combined rows into the QR decomposition.
One or both ways may be mathematical disasters, and I'm hoping someone can point me in the right direction.
For such big datasets I'd have to say you need to use HDF5. HDF5 is Hierarchical Data Format v5. They have C/C++ implementation APIs, and other bindings for different languages. HDF uses B-trees to keep index of datasets.
HDF5 is supported by Java, MATLAB, Scilab, Octave, Mathematica, IDL, Python, R, and Julia.
Unfortunately I don't know more than this about it. However I'd suggest you'd begin your research with a simple exploratory internet search!

Genetic Algorithms Introduction

Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.
A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.
I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).
I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!

Resources