Generating random sentences using specific words in R - r

I need to generate n random sentences using an array of keywords or tokens in R and preferably using dplyr.
Each sentence may have a different number of words.
The array of words is:
(Helsinki, Town, big, pollution, a, much, Not, is, nice)
and sentences should be in a way that the grammatical order of words in a sentence makes sense and can be something like the following:
big Helsinki,
Helsinki is a nice town,
nice Helsinki

Related

How to combine multiwords in a dfm?

I created a corpus of 233 rows and 3 columns (Date, Title, Article) where the last column, Article, is text (so, I have 233 texts). The final aim is to apply topic models and, to do so, I need to convert my corpus into a dfm. Yet, I would like first to combine words into bigrams and trigrams to make the analysis more rigorous.
The problem is that when I use textstat_collocation or tokens_compound, I am forced to tokenize the corpus and, in so doing, I lose the structure (233 by 4) that is crucial to apply topic models. In fact, once I apply those functions, I just get one row of bigrams and trigrams which is useless to me.
So my question is: do you know any other way to look for bigrams and trigrams in a dfm without necessarily tokenizing the corpus?
Or, in other words, what do you usually do to look for multiwords in a dfm?
Thanks a lot for your time!

Grouping towns/villages into cities/communities

I have this problem in R where I have a list of Spanish communities and inside each community there is a list of towns/municipalities.
For example, this is a list of municipalities inside the community of Catalonia.
https://en.wikipedia.org/wiki/Municipalities_of_Catalonia
So; Catalonia is one community and within this community it has a list of towns/cities which I would like to group/ assign a new value 'Catalona'.
I have a list of all the municipalities/towns/cities in my dataset and I would like to group them into communities such as; Andalusia, Catalonia, Basque Country, Madrid etc.
Firstly, how can I go about grouping these rows into the list of communities?
For example; el prat de llobregat is a municipality within Catalonia so I would like to assign this to the region of Catalonia. Getafe is a municipality of Madrid so I would like to assign this to a value of Madrid. Alicante is a municipality of Valencia so I would like to assign this to a value Valencia. Etc.
#
That was my first question and if you are able to help with just that, I would be very thankful.
However, my dataset is not that clean, I did my best to remove Spanish accents, remove unnecessary code identifiers in the municipality names but there still contains some small errors. For example, castellbisbal is a municipality of Catalonia, however some entries have very small spelling mistakes, i.e. including 1 'l' instead of two, spelling; (castelbisbal).
These errors are human errors and are very small, is there a way I can work around this?
I was thinking of a vector of all correctly spelt names and then rename the incorrectly spelt names based on a percentage of incorectness, could this work? For instance castellbisbal is 13 characters long, and has an error of 1 character, with less than an 8% error rate. Can I rename values based on an error rate?
Do you have any suggestions on how I can proceed with the second part?
Any tips/suggestions would be great.
As for the spelling errors, have you tried the soundex algorithm? It was meant for that and at least two R packages implement it.
library(stringdist)
phonetic("barradas")
[1] "B632"
phonetic("baradas")
[1] "B632"
And the soundex codes for for the same words are the same with package phonics.
library(phonics)
soundex("barradas")
[1] "B632"
soundex("baradas")
[1] "B632"
All you would have to do would be to compare soundex codes, not the words themselves. Note that soundex was designed for the english language so it can only handle english language characters, not accents. But you say you are already taking care of those, so it might work with the words you have to process.

How to remove/separate conjoint words from tweets

I am mining Twitter data and one of the problems I come across while cleaning text is, being unable to remove/separate conjoint words that are usually hashtag data. Upon removing special characters and symbols like '#', I am left with phrases that make no sense. For instance:
1) Meaningless words: I have words like: 'spillwayjfleck' , 'bowhunterva' etc, which make no sense and need to be removed from my Corpus. Is there any function in R which can do it?.
2) Conjoint words: I need a method to separate joint words like: 'flashfloodwarn' to:
'flash', 'flood', 'warn', from my Corpus.
Any help would be appreciated.

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

How can a sentence or a document be converted to a vector?

We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words?
1) Skip gram method: paper here and the tool that uses it, google word2vec
2) Using LSTM-RNN to form semantic representations of sentences.
3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.
4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.
5) Using a CNN to summarize documents.
It all depends on:
which vector model you're using
what is the purpose of the model
your creativity in combining word vectors into a document vector
If you've generated the model using Word2Vec, you can either try:
Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html
Wiki2Vec: https://github.com/idio/wiki2vec
Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):
def sent_vectorizer(sent, model):
sent_vec = np.zeros(400)
numw = 0
for w in sent:
try:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))
A solution that is slightly less off the shelf, but probably hard to beat in terms of accuracy if you have a specific thing you're trying to do:
Build an RNN (with LSTM or GRU memory cells, comparison here) and optimize the error function of the actual task you're trying to accomplish. You feed it your sentence, and train it to produce the output you want. The activations of the network after being fed your sentence is a representation of the sentence (although you might only care about the networks output).
You can represent the sentence as a sequence of one-hot encoded characters, as a sequence of one-hot encoded words, or as a sequence of word vectors (e.g. GloVe or word2vec). If you use word vectors, you can keep backpropagating into the word vectors, updating their weights, so you also get custom word vectors tweaked specifically for the task you're doing.
There are a lot of ways to answer this question. The answer depends on your interpretation of phrases and sentences.
These distributional models such as word2vec which provide vector representation for each word can only show how a word usually is used in a window-base context in relation with other words. Based on this interpretation of context-word relations, you can take average vector of all words in a sentence as vector representation of the sentence. For example, in this sentence:
vegetarians eat vegetables .
We can take the normalised vector as vector representation:
The problem is in compositional nature of sentences. If you take the average word vectors as above, these two sentences have the same vector representation:
vegetables eat vegetarians .
There are a lot of researches in distributional fashion to learn tree structures through corpus processing. For example: Parsing With Compositional Vector Grammars. This video also explain this method.
Again I want to emphasise on interpretation. These sentence vectors probably have their own meanings in your application. For instance, in sentiment analysis in this project in Stanford, the meaning that they are seeking is the positive/negative sentiment of a sentence. Even if you find a perfect vector representation for a sentence, there are philosophical debates that these are not actual meanings of sentences if you cannot judge the truth condition (David Lewis "General Semantics" 1970). That's why there are lines of works focusing on computer vision (this paper or this paper). My point is that it can completely depend on your application and interpretation of vectors.
Hope you welcome an implementation. I faced the similar problem in converting the movie plots for analysis, after trying many other solutions I sticked to an implementation that made my job easier. The code snippet is attached below.
Install 'spaCy' from the following link.
import spacy
nlp = spacy.load('en')
doc = nlp(YOUR_DOC_HERE)
vec = doc.vector
Hope this helps.

Resources