I'm trying to analyze the texts in Italian in R.
As you do in a textual analysis I have eliminated all the punctuation, special characters and Italian stopwords.
But I have got a problem with Stemming: there is only one Italian stemmer (Snowball), but it is not very precise.
To do the stemming I used the tm library and in particular the stemDocument function and I also tried to use the SnowballC library and both lead to the same result.
stemDocument(content(myCorpus[[1]]),language = "italian")
The problem is that the resulting stemming is not very precise. Are there other more precise Italian stemmers?
or is there a way to implement the stemming, already present in the TM library, by adding new terms?
Another alternative you can check out is the package from this person, he has it for many different languages. Here is the link for Italian.
Whether it will help your case or not is another debate but it can also be implemented via the corpus package. A sample example (for English use case, tweak it for Italian) is also given in their documentation if you move down to the Dictionary Stemmer section
Alternatively, similar to the above way, you can also consider the stemmers or lemmatizers (if you havent considered lemmatizers, they are worth considering) from Python libraries such as NLTK or Spacy and check if you are getting better resutls. After all, they are just files containing mappings of root word vs child words. Download them, fine tune the file to your requirement, and use the mappings as per your convenience by passing it via a custom made function.
Related
I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.
I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.
First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.
Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)
Third of all, I know that by using tm package, I can create DTM rather easily.
The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.
Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?
I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.
Much appreciated in advance.
My version of R is 3.4.1 platform x86_64-w64_mingw32/x64
I am using R to find the most popular words in a document.
I would like to stem the words and then Complete them. This means I need to use the SAME dictionary for both the stemming and the completion. I am confused by the TM package I am using.
Q1) The stemDocument function seems to work fine without a dictionary defined explicitly. However I would like to define one or at least get hold of the one it uses if it is inbuilt into R. Can I download it anywhere? Apparently I cannot do this
dfCorpus <- tm_map(dfCorpus, stemDocument, language = "english")
Q2) I would like to use the SAME dictionary to complete the words and if they aren't in the dictionary keep the original. Can't do this so just need to know what format the dictionary should be in to work because it currently just giving me NA for all the answers. It is two columns stem and word. This is just an example I found online.
dict.data = fread("Z:/Learning/lemmatization-en.txt")
I'm expecting the code to be something like
dfCorpus <- stemCompletion_modified(dfCorpus, dictionary="dict.data", type="prevalent")`
thanks
Edit. I see that I am trying to solve my problem with a hammer. Just because documentation says to do it one way I was trying to get it to work. So now what I need is just a lookup between all English words and their base not stem. I know I'm not allowed to ask that here but I'm sure I will find it. Have a good weekend.
I am developing a package in R and when I run devtools::check() I am getting the following note.
checking DESCRIPTION meta-information ... NOTE
Malformed Description field: should contain one or more complete sentences.
I am not using my name of the package or the word package in the description. I am also using complete sentence for the description yet I am getting this NOTE repeatedly. So I am wondering what does a complete sentence mean in this case.
Try adding periods to the ends of the sentences, that is, turn your existing Description field into Functions to analyze methylation data can be found here. Highlight of this workflow is the comprehensive quality control report.
I've always found it a bit curious that the Description field shouldn't contain the word "package" or the name of the package ...and also requires complete sentences (which require both a subject and a verb, leading to very grammatically awkward constructs if you want your introductory sentence to have a subject without using the no-no words.) I'm pretty sure that grammatically speaking, very few packages have truly "complete" sentences in their Description fields.
I'm pretty sure that it just checks for capital letters and periods, and I'd avoid any special characters, just to be on the safe side.
The Description field of my package on CRAN is Reads river network shape files and computes network distances. Also included are a variety of computation and graphical tools designed for fisheries telemetry research, such as minimum home range, kernel density estimation, and clustering analysis using empirical k-functions with a bootstrap envelope. Tools are also provided for editing the river networks, meaning there is no reliance on external software. Definitely not the greatest, but apparently it worked and CRAN liked it!
Adding a full stop symbol/period/dot (.) at the end of the description details will remove this warning.
I'm in the process of documenting some of my functions for an R package I'm making.
I'm using roxygen markup, though that is largely irrelevant to my question.
I have put equations into my documentation using \deqn{...}. My question is:
Is there a way to cross-reference this equation later on?
For example, in my Rd file:
\deqn{\label{test}
y = mx + b
}
Can I later do something like:
Referring to equation \ref{test}, ...
I've tried \eqref{test}, \ref{test} (which both get "unknown macro" and don't get linked ), and also \link{test} (which complains it can't find function test because it's really just for linking to other functions).
Otherwise I fear I may have to do something hacky and add in the -- (1) and Refer to equation (1) manually within the \deqn etc in the Rd file...
Update
General answer appears to be "no". (awww...)
However, I can write a vignette and use "normal" latex/packages there. In any case, I've just noticed that the matrix equations I spent ages putting into my roxygen/Rd file look awful in the ?myFunction version of the help (they show up as just-about literal latex source). Which is a shame, because they look beautiful in the pdf version of the help.
#Iterator has pointed out the existence of conditional text, so I'll do ASCII maths in the .Rd files, but Latex maths in the pdf manual/vignette.
I'm compiling my comments above into an answer, for the benefit of others.
First, I do not actually know whether or not .Rd supports tagging of equations. However, the .Rd format is such a strict subset of LaTeX, and produces very primitive text output, that shoehorning extensive equations into its format could be a painful undertaking without much benefit to the user.
The alternative is to use package vignettes, or even externally hosted documentation (as is done by Hadley Wickham, for some of his packages). This will allow you to use PDFs or other documentation, to your heart's content. In this way, you can include screenshots, plots, all of the funkiest LaTeX extensions that only you have, and, most significantly, the AMS extensions that we all know and love.
Nonetheless, one can specify different rendering of a given section of documentation (in .Rd) based on the interface, such as text for the console, nice characters for HTML, etc., and conditional text supports that kind of format variation.
It's a good question. I don't know the answer regarding feasibility, but I had similar questions about documenting functions and equations together, and this investigation into what's feasible with .Rd files has convinced me to use PDF vignettes rather than .Rd files.
I have a problem with LaTeX – I don't know how to put mathematical equations and symbols in listings. I use the listings package and it's offers great looking listings, but it doesn't allow math symbols in $ ... $. Another package, algorithms, allows math, but listings doesn't look as good as with listings (the problem is that algorithms demands to get new line after every if, then, etc.)
You can use the option mathescape for your environment which gives you the ability to use the normal latex behavior of the $-signs. Try:
\begin{lstlisting}[mathescape]
...
\end{lstlisting}
For more info, take a look into the listings package manual.