Hi: I have a dictionary of negative terms that has been prepared by others. I am not sure how they have gone about doing the stemming, but it looks like they have used something other than the Porter Stemer. The dictionary has a wildcard character (*) that I think is supposed to enable a stemming to happen. But I don't know how to make use of that with grep() or the tm package in the R context, so I stripped it out hoping to find a way to grep the partial match.
So the original dictionary looks like this
#load libraries
library(tm)
#sample dictionary terms for polarize and outlaw
negative<-c('polariz*', 'outlaw*')
#strip out wildcard
negative<-gsub('*', '', negative)
#test corpus
test<-c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')
#Here is how R's porter stemmer stems the text
stemDocument(test)
So, if I stemmed my corpus with R's stemmer, terms like 'outlaw' would be found in the dictionary, but it wouldn't match terms like 'polarized' and such because they would be stemmed differently than what is found in the dictionary.
So, what I would like to have is some way to have the tm package match only exact parts of each word. So, without stemming my documents, I would like it to be able to pick out 'outlaw' in the term 'outlawing' and 'outlaws' and to pick out 'polariz' in 'polarized', 'polarizing and 'polarizes'. Is this possible?
#Define corpus
test.corp<-Corpus(VectorSource(test))
#make Document Term Matrix
dtm<-documentTermMatrix(test.corp, control=list(dictionary=negative))
#inspect
inspect(dtm)
I haven't seen any tm answers, so here's one using the quanteda package as an alternative. It allows you to use "glob" wildcard values in your dictionary entries, which is the default valuetype for quanteda's dictionary functions. (See ?dictionary.) With this approach, you do not need to stem your text.
library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.6.2’
# create a quanteda dictionary, essentially a named list
negative <- dictionary(list(polariz = 'polariz*', outlaw = 'outlaw*'))
negative
## Dictionary object with 2 key entries.
## - polariz: polariz*
## - outlaw: outlaw*
test <- c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')
dfm(test, dictionary = negative, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 7 documents, 2 features.
## 7 x 2 sparse Matrix of class "dfmSparse"
## features
## docs polariz outlaw
## text1 1 0
## text3 1 0
## text2 1 0
## text4 1 0
## text5 0 1
## text6 0 1
## text7 0 1
Related
I would like to know if it is possible to delete documents from a corpus if the text is in fact "empty". I am building a corpus of texts in order to subsequently run some textmodels using quanteda package in R. The texts are in a column of a csv file and are imported as follows:
> mycorpus<-corpus(readtext("tablewithdocuments.csv",text_field="textcolumn"))
> mycorpus
Corpus consisting of 25 documents and 14 docvars.
I know how to erase empty texts from the dfm of the corpus, but I want to have a new corpus which is a subset of the original one excluding documents with a missing cell in the csv column "textcolumn".
In practice, from something as the following corpus:
library("quanteda")
text <- c(
doc1 = "",
doc2 = "pinapples and pizzas taste good",
doc3 = "but please do not mix them together"
)
mycorpus <- corpus(text)
mycorpus
## Corpus consisting of 3 documents and 0 docvars.
summary(mycorpus)
## Corpus consisting of 3 documents:
## Text Types Tokens Sentences
## doc1 0 0 0
## doc2 4 4 1
## doc3 5 5 1
I would like to obtain a new corpus with only doc2 and doc3 in it.
Thank you in advance for you help.
Best wishes,
Michele
this code gives the output as a matrix. But here the repeated words like is,am, i should be avoided. I just want a matrix containing cool ,mark and neo4j. I have tried with grep("cool",tdm). It's not working here. Is there any alternative method?
output: tdm
Docs
Terms 1 2
am 2 0
cool 0 2
i 2 0
is 0 2
mark 2 0
neo4j 0 2
Small example code based on your example.
library(tm)
text <- c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus <- VCorpus(VectorSource(text))
# wordLengths set to 3, basicly the default removes all words of length 1 and 2
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, Inf)))
as.matrix(tdm)
# only words cool and mark
# create a dictionary
my_dict <- c("cool", "mark")
tdm <- TermDocumentMatrix(corpus, control = list(dictionary = dict ))
as.matrix(tdm)
Docs
Terms 1 2
cool 0 2
mark 2 0
Be careful with just transforming document term matrices into a normal matrix. That can eat up a lot of memory if you have a lot of text.
But looking at your questions you need to read up on text-mining.
Here is a start with tidy text-mining
Here is info about text mining with quanteda
And read the vignette of tm
And of course search SO for examples. A lot has already been answered in one way or another.
When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could use machine learning method to calculate the words semantic distance", when applying this document on the dictionary["semantic distance", "machine learning"], it will return a 1x2 matrix:[semantic distance, 1;machine learning,1]
It's possible to do this with quanteda, although it requires the construction of a dictionary for each phrase, and then pre-processing the text to convert the phrases into tokens. To become a "token", the phrases need to be joined by something other than whitespace -- here, the "_" character.
Here are some example texts, including the phrase in the OP. I added two additional texts for the illustration -- below, the first row of the document-feature matrix produces the requested answer.
txt <- c("We could use machine learning method to calculate the words semantic distance.",
"Machine learning is the best sort of learning.",
"The distance between semantic distance and machine learning is machine driven.")
The current signature for phrase to token requires the phrases argument to be a dictionary or a collocations object. Here we will make it a dictionary:
mydict <- dictionary(list(machine_learning = "machine learning",
semantic_distance = "semantic distance"))
Then we pre-process the text to convert the dictionary phrases to their keys:
toks <- tokens(txt) %>%
tokens_compound(mydict)
toks
# tokens from 3 documents.
# text1 :
# [1] "We" "could" "use" "machine_learning"
# [5] "method" "to" "calculate" "the"
# [9] "words" "semantic_distance" "."
#
# text2 :
# [1] "Machine_learning" "is" "the" "best"
# [5] "sort" "of" "learning" "."
#
# text3 :
# [1] "The" "distance" "between" "semantic_distance"
# [5] "and" "machine_learning" "is" "machine"
# [9] "driven" "."
Finally, we can construct the document-feature matrix, keeping all phrases using the default "glob" pattern match for any feature that includes the underscore character:
mydfm <- dfm(toks, select = "*_*")
mydfm
## Document-feature matrix of: 3 documents, 2 features.
## 3 x 2 sparse Matrix of class "dfm"
## features
## docs machine_learning semantic_distance
## text1 1 1
## text2 1 0
## text3 1 1
(Answer updated for >= v0.9.9)
I am trying to implement a text classification program in R that classifies input text (args) into 3 different classes. I have successfully tested the sample program by dividing the input data into training and test data.
I would now like to build something that would allow me to classify custom text.
My input data has following structure:
So if I enter a custom text : "games studies time", I would like to get a matrix that looks like following:
Please tell me what is the best way to do the same.
This sounds a lot like the application of a "dictionary" to text following the tokenization of that text. What you have as the matrix result in your question, however, makes no use of the categories in the input data.
So here are two solutions: one, for producing the matrix you state that you want, and two, for producing a matrix that counts the input text according to the counts of the categories to which your input data maps the text.
This uses the quanteda package in R.
require(quanteda)
mymap <- dictionary(list(school = c("time", "games", "studies"),
college = c("time", "games"),
office = c("work")))
dfm("games studies time", verbose = FALSE)
## Document-feature matrix of: 1 document, 3 features.
## 1 x 3 sparse Matrix of class "dfmSparse"
## features
## docs games studies time
## text1 1 1 1
dfm("games studies time", dictionary = mymap, verbose = FALSE)
## Document-feature matrix of: 1 document, 3 features.
## 1 x 3 sparse Matrix of class "dfmSparse"
## features
## docs school college office
## text1 3 2 0
I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
Everything works until:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
How would you interpret the error message? Are there any suggestions to solve the problem?
There was a bug in quanteda version 0.7.2 that caused dfm() to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.
This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in dfm(), resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)
devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
## ... indexing 57 documents
## ... lowercasing
## ... tokenizing
## ... shaping tokens into data.table, found 134,024 total tokens
## ... applying a dictionary consisting of 68 key entries
## ... summing dictionary-matched features by document
## ... indexing 68 feature types
## ... building sparse matrix
## ... created a 57 x 68 sparse dfm
## ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers Nonfl Swear TV Eating Sleep Groom Death Sports Sexual
## 0 0 0 42 47 49 53 76 81 100
It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older dfm you are using with your Facebook texts.
mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams
## 3