I'm just getting to grips with the tm package in R.
Probably a simple question, but trying to use the findAssocs function to get an idea for word associations in my customer enquiries insight document and I can't seem to get findAssocs to work correctly.
When I use the following:
findAssocs(dtm, words, corlimit = 0.30)
$population
numeric(0)
$migration
numeric(0)
What does this mean? Words is a character vector of 667 words - surely there must be some correlative relationships?
Consider the following example:
library(tm)
corp <- VCorpus(VectorSource(
c("hello world", "hello another World ", "and hello yet another world")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
# Docs
# Terms 1 2 3
# and 0 0 1
# another 0 1 1
# hello 1 1 1
# world 1 1 1
# yet 0 0 1
Now consider
findAssocs(x=tdm, terms=c("hello", "yet"), corlimit=.4)
# $hello
# numeric(0)
#
# $yet
# and another
# 1.0 0.5
From what I understand, findAssocs looks at the correlations of hello with everything but hello and yet, as well as yet with everything except hello and yet. yet and and have a correlation coefficient of 1.0, which is above the lower limit of 0.4. yet is also in 50% of all documents containing another - that's also above our 0.4 limit.
Here's another example showcasing this:
findAssocs(x=tdm, terms=c("yet", "another"), corlimit=0)
# $yet
# and
# 1
#
# $another
# and
# 0.5
Note that hello (and world) don't yield any results because they are in every document. This means the term frequency has zero variance and cor under the hood yields NA (like cor(rep(1,3), 1:3), which gives NA plus a zero-standard-deviation-warning).
Related
Similarly to this post, I'm trying to use the Affective Norms for English Words (in French) for a sentiment analysis with Quanteda. I ultimately want to create a "mean sentiment" per text in my corpus.
First, I load in the ANEW dictionary (FAN in French) and create a named vector of weights. ANEW differs from other dictionaries since it does not use a key: value pair format, but rather assigns a numerical score to each word. The goal is to select features and then scoring them using weighted counts.
The ANEW file looks like this : MOT/ VALENCE cancer: 1.01, potato: 3.56, love: 6.56
#### FAN DATA ####
# read in the FAN data
df_fan <- read.delim("fan_anew.txt", stringsAsFactors = FALSE)
# construct a vector of weights with the term as the name
vector_fan <- df_fan$valence
names(vector_fan) <- df_fan$mot
Then I tried to apply dfm_weight() to my corpus of 27 documents.
# create a dfm selecting on the FAN words
dfm_fan <- dfm(my_corpus, select = df_fan$mot, language = "French")
dfm_fan_weighted <- dfm_fan %>%
dfm_weight(scheme = "prop") %>%
dfm_weight(weights = vector_fan)
## Warning messages:
## 1: dfm_weight(): ignoring 696 unmatched weight features
## 2: In diag(weight) : NAs introduced by coercion
Here is what I get, only 6 documents are included in the dfm object generated and the code doesn't estimate the ANEW mean score for each document in the original corpus.
tail(dfm_fan_weighted)
## Document-feature matrix of: 6 documents, 335 features (72.6% sparse).
tail(dfm_fan_weighted)[, c("absent", "politique")]
## Error in intI(j, n = x#Dim[2], dn[[2]], give.dn = FALSE) : invalid character indexing
tail(rowSums(dfm_fan_weighted))
## text22 text23 text24 text25 text26 text27
## NA NA NA NA NA NA
tail(dfm_fan_weighted)[, c("beau")]
## Document-feature matrix of: 6 documents, 1 feature (100% sparse).
## 6 x 1 sparse Matrix of class "dfm"
## features
## docs beau
## text22 0
## text23 0
## text24 0
## text25 0
## text26 0
## text27 0
Any idea to fix it? I think the code needs just some small changes to work properly.
Edit: I edited the code following Ken Benoit comment.
I have a list of keywords:
library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))
I want to match these keywords to text in a data frame column (df$text) and count the number of times a keyword occurs in a different data.frame (matchdf):
matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match
However, I've noticed that this method counts EACH occurrence of a keyword within a column. eg)
"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"
Would then return a count of 2. However, I only want to count the first instance of "decomposed" within a field.
I thought there would be a way to only count the first instance using str_count but there doesn't seem to be one.
The stringr isn't strictly necessary in this example, grepl from base R will suffice. That said, use str_detect instead of grepl, if you prefer the package function (as pointed out by #Chi-Pak in comment)
library(stringr)
words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots",
"poor body", "poor","not suitable", "not possible")
df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")
matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)
# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))
# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))
matchdf
Result
Keywords matches1 matches2
1 decomposed 1 1
2 no diagnosis 0 0
3 decomposition 0 0
4 autolysed 0 0
5 maggots 0 0
6 poor body 0 0
7 poor 0 0
8 not suitable 0 0
9 not possible 0 0
Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.
An example:
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
gives an error.
The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.
Also recall that 'dfm' from 'quanteda' is a S4 class.
Should work "out of the box", if you are using the latest version:
packageVersion("quanteda")
## [1] ‘0.9.6.9’
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## is one sample surprise text this
## doc1 1 1 2 0 1 1
## doc2 1 1 2 1 1 1
See also ?selectFeatures where features is a dfm object (there are examples in the help file).
Added:
Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:
require(tm)
dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
rbind(dtm1, dtm2)
## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.
This almost gets it, but seems to duplicate the repeated feature:
as.matrix(rbind(c(dtm1, dtm2)))
## Terms
## Docs one sample sample. text this surprise!
## 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1
Text mining with package tm, using removeWords(). I have a list of about 500 relevant words out of several thousand total. Can I use use removeWords() to reverse the logic and remove the words from the Corpus that are NOT in the list?
With Perl, I could do something like this:
$diminishedText = (fullText =! s/$wordlist//g); #not tested
In R, this removes the words in the word list:
text <- tm_map(text, removeWords, wordList)
What would be the correct syntax for doing something like this?
text <- tm_map(text, removeWords, not in wordList)
This feels pretty klunky, but might work. A different possibility is at the end.
library(tm)
library(qdap); library(gtools)
library(stringr)
docs <- c("cat", "dog", "mouse", "oil", "crude", "tanker") # starting documents
EDIT
I ran across this approach:
tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]
keepWords <- c("oil", "crude", "tanker") # choose the words to keep from the starting documents
keeppattern <- paste0(keepWords, collapse = "|") # create a regex pattern of the keepWords
Text <- unlist(str_extract_all(string = docs, pattern = keeppattern)) # remove only the keepWords, as a vector
Text.tdm <- TermDocumentMatrix(Text) # create the tdm based on keepWords only
Here is another possibility, but I did not work it through.
R remove stopwords from a character vector using %in%
EDIT:
Another method:
tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]
'%nin%' <- Negate('%in%') # assign to an operator the opposite of %in%
Text <- tm_map(crude, removeWords(crude %nin% keepWords))
# Error because removeWords can't take a logical argument
The text analysis package quanteda has functions for feature selection that are both positive (keep) and negative (remove). Here is the example where we want to keep just a set of economic words, from the US presidential inaugural corpus:
require(quanteda)
dfm(inaugTexts[50:57], keptFeatures = c("tax*", "econom*", "mone*"), verbose = FALSE)
# Document-feature matrix of: 8 documents, 5 features.
# 8 x 5 sparse Matrix of class "dfmSparse"
# features
# docs economic taxes tax economy money
# 1985-Reagan 4 2 4 5 1
# 1989-Bush 0 0 0 0 1
# 1993-Clinton 0 0 0 3 0
# 1997-Clinton 0 0 0 2 0
# 2001-Bush 0 1 0 2 0
# 2005-Bush 1 0 0 0 0
# 2009-Obama 0 0 0 3 0
# 2013-Obama 2 0 1 1 0
Here the match was using the default "glob" format, but fixed and regular expression matches for feature selection are also possible. See ?dfm and ?selectFeatures.
Maybe you can brute-force it.
Download some dictionary and remove the words that are in wordList from it.
Try passing that dictionary in tm_map().
I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?
Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.