R - how to create DocumentTermMatrix for Korean words - r

I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.
I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.
First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.
Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)
Third of all, I know that by using tm package, I can create DTM rather easily.
The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.
Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?
I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.
Much appreciated in advance.

Related

Italian Stemmer alternative to Snowball

I'm trying to analyze the texts in Italian in R.
As you do in a textual analysis I have eliminated all the punctuation, special characters and Italian stopwords.
But I have got a problem with Stemming: there is only one Italian stemmer (Snowball), but it is not very precise.
To do the stemming I used the tm library and in particular the stemDocument function and I also tried to use the SnowballC library and both lead to the same result.
stemDocument(content(myCorpus[[1]]),language = "italian")
The problem is that the resulting stemming is not very precise. Are there other more precise Italian stemmers?
or is there a way to implement the stemming, already present in the TM library, by adding new terms?
Another alternative you can check out is the package from this person, he has it for many different languages. Here is the link for Italian.
Whether it will help your case or not is another debate but it can also be implemented via the corpus package. A sample example (for English use case, tweak it for Italian) is also given in their documentation if you move down to the Dictionary Stemmer section
Alternatively, similar to the above way, you can also consider the stemmers or lemmatizers (if you havent considered lemmatizers, they are worth considering) from Python libraries such as NLTK or Spacy and check if you are getting better resutls. After all, they are just files containing mappings of root word vs child words. Download them, fine tune the file to your requirement, and use the mappings as per your convenience by passing it via a custom made function.

Tokenize Text and Analyze with Dictionary in Quanteda

I am trying to do a text analysis using the quanteda packages in R and have been successful in gaining the desired output without doing anything to my texts. However, I am interested in removing stopwords and other common phrases to rerun the analysis (from what I am learning in other sources -- this process is called "Tokenizing"(?)). (The instructions are from https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda/)
With the processed text, which I was able to do using the instructions and the quanteda package. However, I am interested in applying a dictionary for analyzing the text. How can I do that? Since it is hard to attach all my documents here, any hints or examples that I can apply would be helpful and greatly appreciated.
Thank you!
i have used this library with great success and then merged by word to get the score or sentiment. Merge by word
library(tidytext)
get_sentiments("afinn")
get_sentiments("bing")
you can save it as a table
table <- get_sentiments("afinn")
total <- merge(data frameA,data frameB,by="ID")

Using GLOVEs pretrained glove.6B.50.txt as a basis for word embeddings R

I'm trying to convert textual data into vectors using GLOVE in r. My plan was to average the word vectors of a sentence, but I can't seem to get to the word vectorization stage. I've downloaded the glove.6b.50.txt file and it's parent zip file from: https://nlp.stanford.edu/projects/glove/ and I have visited text2vec's website and tried running through their example where they load wikipedia data. But I dont think its what I'm looking for (or perhaps I am not understanding it). I'm trying to load the pretrained embeddings into a model so that if I have a sentence (say 'I love lamp') I can iterate through that sentence and turn each word into a vector that I can then average (turning unknown words into zeros) with a function like vectorize(word). How do I load the pretrained embeddings into a glove model as my corpus (and is that even what I need to do to accomplish my goal?)
I eventually figured it out. The embeddings matrix is all I needed. It already has the words in their vocab as rownames, so I use those to determine the vector of each word.
Now I need to figure out how to update those vectors!

R TM Package stemDocument. What dictionary does it use

My version of R is 3.4.1 platform x86_64-w64_mingw32/x64
I am using R to find the most popular words in a document.
I would like to stem the words and then Complete them. This means I need to use the SAME dictionary for both the stemming and the completion. I am confused by the TM package I am using.
Q1) The stemDocument function seems to work fine without a dictionary defined explicitly. However I would like to define one or at least get hold of the one it uses if it is inbuilt into R. Can I download it anywhere? Apparently I cannot do this
dfCorpus <- tm_map(dfCorpus, stemDocument, language = "english")
Q2) I would like to use the SAME dictionary to complete the words and if they aren't in the dictionary keep the original. Can't do this so just need to know what format the dictionary should be in to work because it currently just giving me NA for all the answers. It is two columns stem and word. This is just an example I found online.
dict.data = fread("Z:/Learning/lemmatization-en.txt")
I'm expecting the code to be something like
dfCorpus <- stemCompletion_modified(dfCorpus, dictionary="dict.data", type="prevalent")`
thanks
Edit. I see that I am trying to solve my problem with a hammer. Just because documentation says to do it one way I was trying to get it to work. So now what I need is just a lookup between all English words and their base not stem. I know I'm not allowed to ask that here but I'm sure I will find it. Have a good weekend.

Text Mining in R - openNLP and tm packages

I have been trying to extract 'words that leaders use to describe themselves' using Linked In Summaries as the data set.
1) I have cleaned the data using the 'tm' package in R
2) I extracted adjectives making use of 'POS Tagging' in the 'openNLP' package.
My first problem is that :
It extracts all adjectives, I just need adjectives such as loyal, innovative, passionate (adjectives of quality)
My Second problem :
Is there are a way to make the program understand what it is reading.
Eg : the word 'mobile' gets tagged as an adjective, whereas it is a noun usually linked with 'mobile application' e.t.c
I am coding using R. Please help!

Resources