I have a large corpus of text, in italian, to analyze using the R-language.
Almost all the preprocessing method is easily writable to adapt to my native language, with a couple of default libraries.
Problem is I can't find a way to implement a stem-completion function or method, because the default dictionaries aren't in italian. Also (mea culpa) I can't find a custom italian stem-completion dictionary.
Does anyone know how to solve this problem in any way?
Thanks a lot in advance! Have a nice day!
Related
Is there any way in R to write a macro like one would in SAS? That is, I want to write a macro with some input variable (corresponding to a row in a dataset) so I can quickly make a plot of certain characteristics from said row. Any information regarding a package/method to do so would be greatly appreciated.
R will generate some very, very, very basic code for you. If you have RStudio installed, you can click File > Import Dataset > From ... point to your file and click 'Open'. R will automatically create the code to do the import. Again, this is very basic. You really need to know how to code to do anything useful.
You get out of it what you put into it, so spend some time learning this stuff, and inevitably you'll learn a ton. I've found that it's very helpful to read through people's questions that are posted here, and try to solve the problem yourself. You'll learn a lot that way and you'll see what the current trends are. Reading books is great, of course, but sometimes I feel like some authors are too academic, and in the real world, sometimes it's done differently than what you see in textbooks.
I'm trying to analyze the texts in Italian in R.
As you do in a textual analysis I have eliminated all the punctuation, special characters and Italian stopwords.
But I have got a problem with Stemming: there is only one Italian stemmer (Snowball), but it is not very precise.
To do the stemming I used the tm library and in particular the stemDocument function and I also tried to use the SnowballC library and both lead to the same result.
stemDocument(content(myCorpus[[1]]),language = "italian")
The problem is that the resulting stemming is not very precise. Are there other more precise Italian stemmers?
or is there a way to implement the stemming, already present in the TM library, by adding new terms?
Another alternative you can check out is the package from this person, he has it for many different languages. Here is the link for Italian.
Whether it will help your case or not is another debate but it can also be implemented via the corpus package. A sample example (for English use case, tweak it for Italian) is also given in their documentation if you move down to the Dictionary Stemmer section
Alternatively, similar to the above way, you can also consider the stemmers or lemmatizers (if you havent considered lemmatizers, they are worth considering) from Python libraries such as NLTK or Spacy and check if you are getting better resutls. After all, they are just files containing mappings of root word vs child words. Download them, fine tune the file to your requirement, and use the mappings as per your convenience by passing it via a custom made function.
Is there a ready template to run evolutionary\genetic algorithms in R?
I am interested in a code that would allow me to add graphical output and user input between iterations.
Thanks for reading!
p.s. found this posting Is there any Genetic Programming code written R
I don't know about genetic programming code written in R, but there is a program called HeuristicLab you can use.
There you can add an External Evaluator in R code and there you can add your graphical output.
Here is a link on how to do it:
http://dev.heuristiclab.com/trac/hl/core/wiki/UsersHowtosOptimizingExternalApplications
Its an open source program and the staff that wrote it usually answers any question you have very quickly.
Here is the download page: http://dev.heuristiclab.com/trac/hl/core/wiki
good luck!
Yep, i think http://rsymbolic.org/projects/show/rgp is exactly what you're looking for.
Basically I have created two MATLAB functions which involve some basic signal processing and I need to describe how these functions work in a written report. It specifically requires me to describe the algorithms using mathematical notation.
Maths really isn't my strong point at all, in fact I'm quite surprised I've even been able to develop the functions in the first place. I'm quite worried about the situation at the moment, it's the last section of writing I need to complete but it is crucially important.
What I want to know is whether I'm going to have to grab a book and teach myself mathematical notation in a very short space of time or is there possibly an easier/quicker way to learn? (Yes I know reading a book should be simple enough, but maths + short time frame = major headache + stress)
I've searched through some threads on here already but I really don't know where to start!
Although your question is rather vague, and I have no idea what sorts of algorithms you have coded that you are trying to describe in equation form, here are a few pointers that may help:
Check the MATLAB documentation: If you are using built-in MATLAB functions, they will sometimes give an equation in the documentation that describes what they are doing internally. Some examples are the functions CONV, CORRCOEF, and FFT. If the function is rather complicated, it may not have an equation but instead have links to some papers describing the algorithm, which may themselves have equations for the algorithm. An example is the function HILBERT (which you can also find equations for on Wikipedia).
Find some lists of common mathematical symbols: Some standard symbols used to represent common mathematical operations can be found here.
Look at some sample pseudocode to see how it's done: For algorithms you yourself have coded up, you'll have to write them out in equation or pseudocode form. A paper that I've used often in my work is Templates for the Solution of Linear Systems, and it has some examples of pseudocode that may be helpful to you. I would suggest first looking at the list of symbols used in that paper (on page iv) to see some typical notations used to represent various mathematical operations. You can then look at some of the examples of pseudocode throughout the rest of the document, such as in the box on page 8.
I suggest that you learn a little bit of LaTeX and investigate Matlab's publish feature. You only need to learn enough LaTeX to write mathematical expressions. Then you have to write Matlab comments in your source file in LaTeX, but only for the bits you want to look like high-quality maths. Finally, open the Matlab editor on your .m file, and select File | Publish.
See Very Quick Intro to LaTeX and check your Matlab documentation for publish.
In addition to the answers already here, I would strongly advise using words in addition to forumlae in your report to describe the maths that you are presenting.
If I were marking a student's report and they explained the concepts of what they were doing correctly, but had poor or incorrect mathematical notation to back it up: this would lose them some marks, but would hopefully not impede my understanding of the hard work they've put in.
If they had poor/wrong maths, with no explanation of what they meant to say, this could jeapordise my understanding of their entire project and cost them a passing grade.
The reason you haven't found any useful threads is because most of the time, people are trying to turn maths into algorithms, not vice versa!
Starting from an arbitrary algorithm, sometimes pseudo-code, along with suitable comments, is the clearest (and possibly only) representation.
I've been looking for some time for a word processor to use for writing technical papers and I haven't really found one. What would really be nice to have is an editor that can handle mathematical expressions, code, and pseudo-code fairly well. I have yet to find one that works very well.
Does anyone have any recommendations?
I personally believe in LaTeX.
Benefits:
You can focus on content over form.
Use logical rather than semantic formatting (e.g., \methodname vs. just italic).
Easier to assemble large documents from multiple files.
Use text-based version control (CVS/SVN/etc.)
Widely used
Much more stable even on super-weak machines
Programmable. For example, I use macros to hide stuff, highlight stuff, obfuscate names by using a macro name with the real name but an obfuscated replacements.
See all the tips and tricks available on SO.
Output looks the same no matter which platform you compile on. Never had that luck with word, each version and each machine produces something slightly different.
My answer's long, so I want to say up front: I think you want OpenOffice Writer (I use v2.4, haven't tried 3.0 yet).
I've used Word with equation editor and LaTeX heavily in the past and OpenOffice Writer
more recently. I used the former two while writing my thesis.
LaTeX may still have advantages in quality of the output and in the ability to use text-based version control, but they're sharply diminished by OO Writer at this point.
Microsoft with equation editor, even the most recent versions, seems very weak still.
What I like about OpenOffice is that you can use the equation formatting mechanisms
in a mode where the window is split between the document you're writing and another
area where you can type very LaTeX-like formatting instructions. One of the big
strengths of LaTeX is that you get to type up something like $x \in S$ for "x is an element of S". OO Writer lets you do this and see the result.
Back when I wrote my thesis, LaTeX was preferable to Word with Eqn. Editor because of the length of my document (over 200 pages), the quality of the results, and the ease of specifying equations. LaTeX does have a disadvantage in simplicity of use that is made more acute by OO Writer.
That said, I'm sure I'd use OO Writer for conference to journal length articles (~8-15 pages v. ~15-40 pages) and also for shorter work. For thesis-length work, I'm not sure which I'd end up using: Word never worked so well for me on longer matter; I suspect OO Writer is better behaved but I don't have enough experience of it to make a firm judgement.
I like LyX (http://www.lyx.org/) -- it's a good tradeoff between "spending all your time writing your document" and "spending all your time writing markup". The most recent versions are even useable!
Apart from that, Word 2008 is actually pretty darn good, provided you use the styles and other "advanced" features.
I fully agree that LATEX is a good choice. I've used for paper in univ, including my master thesis. For LATEX I've been using Kile.
But nowadays there is interesting alternative which is DocBook with MathML extension.
LaTeX with TexMaker got me through grad school.
Depends on what you mean by "Word Processor". If you don't mind not having a WYSIWYG interface, I'd recommend LaTeX (http://www.latex-project.org/).
I wrote my final year Master's dissertation using it, which contained a lot of pseudocode, formulas, etc. Also outputs in a format fairly typical of technical papers.
I use FrameMaker.
MS Word with Mathtype. It has a number of advantages over the default Equation editor, including, but not limited to:
keyboard shortcuts
writing equations in tex mode then converting them
converting equations from "normal" to "linear" mode (the one you can use in your programs, you know a=b/c and such)
templates
no more latex. I can concentrate on the material, not the writing
Word with MS Equation for the mathematical sections.
I like DocBook and use FOP to create PDFs from it.
I use reStructuredText because it can be used in Trac, converted to PDF and HTML, have little markup overhead, and looks nice in its plain form too.
Microsoft Word is considered as the market standard word editor.
My suggestion is for you to use Authorea.
As a former postdoc (Astrophysics) and Ph.D. (Informatics) with 12+ years research experience (Harvard, CERN, UCLA), I have written technical papers for a long time. I have loved and hated LaTeX. For the past 2 years, I have worked with friends and colleagues at developing the next generation platform for writing technical/research documents collaboratively. It is called Authorea. From a technical standpoint Authorea is built on Git and takes LaTeX, Markdown, HTML (even JS, to include fancy d3.js in your papers). Bonus: you don't need to know LaTeX (or any other format) but you can easily add equations, tables, citations, and data to your papers. I hope you'll find it useful.