Capturing emoticons - julia

Does anyone know if there is a package in Julia that allows capturing a wide range of emoticons/emojis in text? I am trying to do sentiment analysis for a disaster-related Twitter dataset. There is a lot of emoticons with a prayer symbol as well as the horse symbol and others (It's related to the Taal disaster where a lot of animals died).
Just want to know if there is a way to capture a wide variety of emojis.

Related

R function to return average Google ngram frequency?

I'm trying to look at a series of unusual words taken from OCR and determine which ones merit further investigation to see where/if the OCR screwed up. I've tried a few different approaches like comparing the words to existing dictionaries and so on, but I still have a large number of words to look at manually.
What I would like to do:
Send my list of words to Google's ngrams, starting from a particular year (say 1950), and get the average ngram frequency to reduce my list to the real outliers.
What I have tried:
I thought I could use the ngramr package for this, but it can only send 12 words at a time, and I have a list of thousands of words. I was hoping I could save myself some time doing it this way. Perhaps there is some way of dividing my dataset into separate groups of no larger than 12 and then send them all to this package, but it doesn't seem like it is intended for this kind of approach.
Is there a better/smarter way of doing this? Ideally I would like to avoid having to export the data into Python because I want my code to be fully reproducible by someone who is only familiar with R.
Any help would be hugely appreciated.

R - how to create DocumentTermMatrix for Korean words

I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.
I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.
First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.
Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)
Third of all, I know that by using tm package, I can create DTM rather easily.
The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.
Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?
I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.
Much appreciated in advance.

How to bundle related words with text mining in R

I have data of advertisements posted on a secondhand site to sell used smartphones. Each ad describes the product that is being sold. I want to know which parameters are most often described by sellers. For example: brand, model, colour, memory capacity, ...
By text mining all the text from the advertisements I would like to bundle similar words together in 1 category. For example: black, white, red, ... should be linked to each other as they all describe the colour of the phone.
Can this be done with clustering or categorisation and which text mining algorithms are equipped to do this?
Your best attempt is something based on word2vec.
Clustering algorithms will not be able to discover the humang language concept of color reliably. So either you choose some supervised approach, or you need to try methods to first infere SUV concepts.
Word2vec is trained on substitutability of words. As in a sentence such as "I like the red color" you can substitute red with other colors, word2vec could theoretically be able to help with finding such concepts in an unsupervised way, given lots and lots of data. But I'm sure you can also find counterexamples that break these concepts... Good luck... I doubt you'll manage to do this unsupervised.

How to determine negative trigrams

I am trying to search through a load of medical reports. I want to determine the type of language used to say an absence of a finding. Eg "There is no blabla" or "Neither blabla nor bleble is seen here". There are many variations.
My original idea was to divide the text into trigrams and then perform sentiment analysis on the trigrams, and then have a look at the negative trigrams and manually select the ones that denote an absence of something. Then I wanted to design some regular expressions around these absence trigrams.
I get the feeling however that I am not really looking for a sentiment analysis but more of a negation search. I could, I suppose just look for all sentences with 'not' or 'neither' or 'nor or 'no' but I'm sure I'll fall into some kind of linguistic trap.
Does anyone have a comment on my sentiment approach and if this is correct can I have some guidance about sentiment analysis on trigrams (or bigrams I suppose) as all the R tutorials I have found demonstrate unigram sentiment analysis.

Workflow for integrating Syntaxnet into the analysis of long(er) documents

I am trying to figure out what improvements in the text analysis of long documents can be obtained by using Syntaxnet, rather than something "dumb" like word counts, sentence length, etc.
The goal would be to get more accurate linguistic measures (such as "tone" or "sophistication"), for quantifying attributes of long(er) documents like newspaper articles or letters/memos.
What I am trying to figure out is what to do with Syntaxnet output once the POS tagging is concluded. What types of things do people use to process Syntaxnet output?
Ideally I am looking for an example workflow that transforms Syntaxnet output into something quantitative that can be used in statistical analysis.
Also, can someone point me to sources that show how the inferences drawn from a "smart" analysis with Syntaxnet compare to those that can be attained by word counts, sentence length, etc.?

Resources