Text Mining in R - openNLP and tm packages - r

I have been trying to extract 'words that leaders use to describe themselves' using Linked In Summaries as the data set.
1) I have cleaned the data using the 'tm' package in R
2) I extracted adjectives making use of 'POS Tagging' in the 'openNLP' package.
My first problem is that :
It extracts all adjectives, I just need adjectives such as loyal, innovative, passionate (adjectives of quality)
My Second problem :
Is there are a way to make the program understand what it is reading.
Eg : the word 'mobile' gets tagged as an adjective, whereas it is a noun usually linked with 'mobile application' e.t.c
I am coding using R. Please help!

Related

R - how to create DocumentTermMatrix for Korean words

I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.
I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.
First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.
Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)
Third of all, I know that by using tm package, I can create DTM rather easily.
The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.
Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?
I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.
Much appreciated in advance.

Italian Stemmer alternative to Snowball

I'm trying to analyze the texts in Italian in R.
As you do in a textual analysis I have eliminated all the punctuation, special characters and Italian stopwords.
But I have got a problem with Stemming: there is only one Italian stemmer (Snowball), but it is not very precise.
To do the stemming I used the tm library and in particular the stemDocument function and I also tried to use the SnowballC library and both lead to the same result.
stemDocument(content(myCorpus[[1]]),language = "italian")
The problem is that the resulting stemming is not very precise. Are there other more precise Italian stemmers?
or is there a way to implement the stemming, already present in the TM library, by adding new terms?
Another alternative you can check out is the package from this person, he has it for many different languages. Here is the link for Italian.
Whether it will help your case or not is another debate but it can also be implemented via the corpus package. A sample example (for English use case, tweak it for Italian) is also given in their documentation if you move down to the Dictionary Stemmer section
Alternatively, similar to the above way, you can also consider the stemmers or lemmatizers (if you havent considered lemmatizers, they are worth considering) from Python libraries such as NLTK or Spacy and check if you are getting better resutls. After all, they are just files containing mappings of root word vs child words. Download them, fine tune the file to your requirement, and use the mappings as per your convenience by passing it via a custom made function.

Tokenize Text and Analyze with Dictionary in Quanteda

I am trying to do a text analysis using the quanteda packages in R and have been successful in gaining the desired output without doing anything to my texts. However, I am interested in removing stopwords and other common phrases to rerun the analysis (from what I am learning in other sources -- this process is called "Tokenizing"(?)). (The instructions are from https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda/)
With the processed text, which I was able to do using the instructions and the quanteda package. However, I am interested in applying a dictionary for analyzing the text. How can I do that? Since it is hard to attach all my documents here, any hints or examples that I can apply would be helpful and greatly appreciated.
Thank you!
i have used this library with great success and then merged by word to get the score or sentiment. Merge by word
library(tidytext)
get_sentiments("afinn")
get_sentiments("bing")
you can save it as a table
table <- get_sentiments("afinn")
total <- merge(data frameA,data frameB,by="ID")

R TM Package stemDocument. What dictionary does it use

My version of R is 3.4.1 platform x86_64-w64_mingw32/x64
I am using R to find the most popular words in a document.
I would like to stem the words and then Complete them. This means I need to use the SAME dictionary for both the stemming and the completion. I am confused by the TM package I am using.
Q1) The stemDocument function seems to work fine without a dictionary defined explicitly. However I would like to define one or at least get hold of the one it uses if it is inbuilt into R. Can I download it anywhere? Apparently I cannot do this
dfCorpus <- tm_map(dfCorpus, stemDocument, language = "english")
Q2) I would like to use the SAME dictionary to complete the words and if they aren't in the dictionary keep the original. Can't do this so just need to know what format the dictionary should be in to work because it currently just giving me NA for all the answers. It is two columns stem and word. This is just an example I found online.
dict.data = fread("Z:/Learning/lemmatization-en.txt")
I'm expecting the code to be something like
dfCorpus <- stemCompletion_modified(dfCorpus, dictionary="dict.data", type="prevalent")`
thanks
Edit. I see that I am trying to solve my problem with a hammer. Just because documentation says to do it one way I was trying to get it to work. So now what I need is just a lookup between all English words and their base not stem. I know I'm not allowed to ask that here but I'm sure I will find it. Have a good weekend.

How to create Newick tree format from raw morphology data in R on Mac OSX

I'm trying to teach myself how to do phylogenetics for historical linguistics in R. I've found a public data set (https://www.cs.rice.edu/~nakhleh/CPHL/IEDATA_112603), and I want to get a Newick format tree from it, so that I can visualize it following these instructions: https://www.r-phylo.org/wiki/HowTo/InputtingTrees. I'm running R 3.4.1 on Max OS 10.12.6.
Here's what I've done so far. I copied the data and used R and a text editor to transform it into a Nexus data file. Since Nexus (as I understand it) can't distinguish between the individual characters 1 and 2, and the combined character 12, I turned all values in the original data set over 9 into letters of the alphabet, in sequence (a-q). Anyone can download it from here: https://ucla.box.com/s/i4fbeagcw8lombg3xuhczfk3h0y7v54m
The problem is, I can't find any instructions or code or guidance to interpret the raw data as a tree. I've found one Python script (Convert csv to Newick tree), but I don't know Python. Can anyone point me in the direction of the right software/library/tutorial, or otherwise help me figure out what my next step should be?
I finally found a colleague who could help me. I did not need to convert the data to Newick or Nexus to make a tree from it, I needed to convert it to phydat (see Phangorn package for R) to make a tree from it. What I did was to use the as.phydat() function in the Phangorn package for R to convert the linguistic data into "phylogenetic data." The way that I did this was by specifying "type = USER" in the function, which let me define my own levels for the data. There's a more detailed example at cran.r-project.org/web/packages/phangorn/vignettes/…. Then, I could create trees from it using the regular Phangorn functions.
Using Phangorn might be a good approach in R (have a look at the "Constructing phylogenetic trees" vignette).
browseVignettes(package = "phangorn")
However, to properly infer the tree, I would advise you to use a "proper" phylogenetic inference software with more options (phangorn is excellent for explorative analysis but can be limited).
I suggest you use the BEAST software that has an entire tutorial dedicated to phylogenetic linguistics (https://www.luke.maurits.id.au/files/research/papers/beastling.pdf). Luke Maurits tutorial on github is really well explained (https://github.com/lmaurits/BEASTling/blob/master/docs/tutorial.rst).
Also, regarding your problem with ambiguous character states in your NEXUS file (i.e. state 12 for 1 and 2) you can code them in the nexus file as (12). For example this is a valid NEXUS format:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=2 NCHAR=3;
MATRIX
t1 1(12)2
t2 111
;
END;

Resources