I am mining Twitter data and one of the problems I come across while cleaning text is, being unable to remove/separate conjoint words that are usually hashtag data. Upon removing special characters and symbols like '#', I am left with phrases that make no sense. For instance:
1) Meaningless words: I have words like: 'spillwayjfleck' , 'bowhunterva' etc, which make no sense and need to be removed from my Corpus. Is there any function in R which can do it?.
2) Conjoint words: I need a method to separate joint words like: 'flashfloodwarn' to:
'flash', 'flood', 'warn', from my Corpus.
Any help would be appreciated.
Related
I am working on analyzing some text data from a Ticketing system. I am pulling out long text fields from the tickets and need to analyze which words are being used and which ones are being used the most. But I need it to list all of the words.
The file format is in Excel and I have taken the file and using tm, I have made some edits to the data and removed some stop words and other words that aren't really important to the data I am looking for. I have already made this into a corpus.
When I do the following code it kind of gives me what I need but it does not actually give me all of the words. I know that this is going to be a long list, but that is fine.
dtm <- DocumentTermMatrix(hardwareCN.Clean)
dtmDataFrame1 <- as.data.frame(inspect(dtm))
colSums(dtmDataFrame1)
This gives me only about 10 words, but I know that there are many many more than that. I also then need to be able to export this to share.
Thanks
I created a corpus of 233 rows and 3 columns (Date, Title, Article) where the last column, Article, is text (so, I have 233 texts). The final aim is to apply topic models and, to do so, I need to convert my corpus into a dfm. Yet, I would like first to combine words into bigrams and trigrams to make the analysis more rigorous.
The problem is that when I use textstat_collocation or tokens_compound, I am forced to tokenize the corpus and, in so doing, I lose the structure (233 by 4) that is crucial to apply topic models. In fact, once I apply those functions, I just get one row of bigrams and trigrams which is useless to me.
So my question is: do you know any other way to look for bigrams and trigrams in a dfm without necessarily tokenizing the corpus?
Or, in other words, what do you usually do to look for multiwords in a dfm?
Thanks a lot for your time!
I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages.
These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I will have a dictionary of these words.
I am wondering if text mining tools would be perfect for the job.
I would like to create a Corpus for all these documents and then find texts that meet specific (i am thinking about regex criteria) on the right or down of the given dictionary entry.
Is there such a syntax in data mining packages in R, ie. to get the strings on the right or down of the wordlist entry, the strings that meet a specific pattern?
If not, would be more suitable tool in R to do the job?
Two options with quanteda come to mind:
Use kwic with your list of target patterns, with a window big enough to capture the size after the term that you want. This will return a data.frame that you can use the keyword and post columns from for your analysis. You can also construct a corpus directly from this object (corpus(mykwic)) and then focus on the new post docvar which will contain the text you want.
Use corpus_segment where you use the target word list to create a "tag" type, and anything following this tag, until the next tag, will be reshaped into a new document. This works well but is a bit trickier to configure, since you will need to get the regex right for the tag.
I have a corpus that has words such as 5k,50k,7.5k,75k,10K,100K.
So when i create a TDM using the tm package, terms such as 10k and 100k are extracted separately. However , 5k and 7.5k are not extracted as separate terms.
Now , i understand that after punctuation correction "7.5k" might be falling under "75k" terms , but whats going on with "5k" . Why is it not extracted as a term ?
Basically , i would want to know if there is way to FORCE tm package to look for specific words and extract them as key terms.
Any pointers would help !!
Are you breaking words at punctuation? That is, is '.' a word-break character? If so, then the split of '7.5k' is ('7', '5k'), the second of which matches '5k'.
I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.