I want to analyze answers from open ended questions. Single Word Cloud first, and then I met a problem when I want to count the frequency of 2-3 words phrase.
Here is my codes:
library('tm')
tokenize_ngrams <- function(x,n=2)return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
corpus <- Corpus(VectorSource(texts))
matrix <- TermDocumentMatrix(corpus,control=list(tokenize=tokenize_ngrams))
inspect(matrix[1:4, 1:3])
The results should be the 2-word phrases and the frequency.
But I got the results as follows:
Docs
Terms 1 2 3
document 1 0 0
first 1 0 0
the 1 1 1
this 1 1 1
Don't know the answer using tm, but this will work fine:
require(quanteda)
matrix <- dfm(texts, ngrams = 2)
head(matrix)
Related
I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.
For example:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
Thank you!
You could use quanteda framework for this:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.
An example using tidytext / dplyr looks like this:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.
This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!
The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}
So:
dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1
I want to match a list of words with a list of sentences and form a data frame with the matched words(coma separated) in one column and the corresponding sentences in another column. I want the words to be exactly match with the words in the sentence For eg:
Sample sentences and words:
sentences <- c("This is crap","You are awesome","A great app",
"My advice would be to improve the look and feel of the app")
words <- c("crap","awesome","great","vice","advice","awe","prove","improve")
Expected Result:
sentences words
This is crap "crap"
You are awesome "awesome"
A great app "great"
My advice would be to improve the look and feel of the app "advice","improve"
I have thousands of sentences(28k) like this to be matched with thousands of words(65k). I follow the below approach to acheive this but the problem is i am unable to get the exact word match.
df <- data.frame(sentences) ;
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]));
I followed different approaches but nothing seems to be faster than this. But I could'nt use this approach as this does not match the exact word, instead matches the string containing the word. Can someone suggest me a solution that matches the exact words at the same time does not lose much performance?
You can use str_extract_all from stringr package,
library(stringi)
data.frame(sentences = sentences,
words = sapply(stri_extract_all_regex(sentences, paste(words, collapse = '|')), paste, collapse = ','),
stringsAsFactors = FALSE)
# sentences words
#1 This is crap crap
#2 You are awesome awesome
#3 A great app great
#4 My advice would be to improve the look and feel of the app advice,improve
I think #Sotos answers the question well. But I'd like to add another way of representing it which I think might be more helpful for further analysis.
sentences <- c("This is crap","You are awesome","A great app",
"My advice would be to improve the look and feel of the app")
words <- c("crap","awesome","great","vice","advice","awe","prove","improve")
library(stringr)
mat <- do.call(rbind,lapply(words, function(x) str_count(sentences, x)))
dimnames(mat) <- list(words,c())
data.frame(mat)
Output :
X1 X2 X3 X4
crap 1 0 0 0
awesome 0 1 0 0
great 0 0 1 0
vice 0 0 0 1
advice 0 0 0 1
awe 0 1 0 0
prove 0 0 0 1
improve 0 0 0 1
Just adding a slight adjustment to the neat approach of prateek1592 as it has the flaw of returning a match for the string "awe" in the second sentence, while the second sentence only contains that pattern but not that particular string. Changing the input of the str_count() from x to paste0("\b", x, "\b"), we get:
sentences <- c("This is crap","You are awesome","A great app",
"My advice would be to improve the look and feel of the app")
words <- c("crap","awesome","great","vice","advice","awe","prove","improve")
library(stringr)
mat <- do.call(rbind,lapply(words, function(x) str_count(sentences, paste0("\\b", x, "\\b"))))
dimnames(mat) <- list(words,c())
data.frame(mat)
Output:
X1 X2 X3 X4
crap 1 0 0 0
awesome 0 1 0 0
great 0 0 1 0
vice 0 0 0 0
advice 0 0 0 1
awe 0 0 0 0
prove 0 0 0 0
improve 0 0 0 1
I am implementing LDA for some simple data Sets , I am able to do the topic modelling but the issue is when i am trying to organise the top 6 terms according to their Topics , I am getting some numerical values ( maybe their indexes )
# docs is the dataset formatted and cleaned properly
dtm<- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords=TRUE))
ldaOut<-LDA(dtm,k,method="Gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin))
# 6 top terms in each topic
ldaOut.terms<-as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
The TopicsToTerms file is Generated like ,
Topic 1 Topic 2 Topic 3
1 1 5 3
2 2 1 4
3 3 2 1
4 4 3 2
5 5 4 5
While I want The Terms (top words for each topic) In the tables , like the following -
Topic 1 Topic 2 Topic 3
1 Hat Cat Food
You just need one line of code to fix your problem:
> text = read.csv("~/Desktop/your_data.csv") #your initial dataset
> docs = Corpus(VectorSource(text)) #converting to corpus
> docs = tm_map(docs, content_transformer(tolower)) #cleaning
> ... #cleaning
> dtm = DocumentTermMatrix(docs) #creating a document term matrix
> rownames(dtm) = text
After adding that last line, you can proceed with the remaining code, and you'll get the Terms, and not their indexes. Hope that helped.
This question is related to to my earlier question. Treat words separated by space in the same manner
Posting it as a separate one since it might help other users find it easily.
The question is regarding the way the term document matrix is calculated by tm package currently. I want to tweak this way a little bit as explained below.
Currently any term document matrix gets created by looking for a word say 'milky' as a separate word (and not as a string) in a document. For example, let us assume 2 documents
document 1: "this is a milky way galaxy"
document 2: "this is a milkyway galaxy"
As per the way current algorithm works (tm package) 'milky' would get found in first document but not in second document since the algorithm looks for the term milky as a separate word. But if the algorithm had looked for the term milky a strings like function grepl does, it would have found the term 'milky' in second document as well.
grepl('milky', 'this is a milkyway galaxy')
TRUE
Can someone please help me create a term document matrix meeting my requirement (which is to be able to find term milky in both the documents. Please note that I don't want a solution specific to a word or milky, I want a general solution which I will apply on a larger scale to take care of all such cases)? Even if the solution does not use tm package, it is fine. I just have to get a term document matrix meeting my requirement in the end. Ultimately I want to be able to get a term document matrix such that each term in it should get looked for as string (not just as word) inside all the strings of the document in question (grepl like functionality while calculating term document matrix).
Current code which I use to get term document matrix is
doc1 <- "this is a document about milkyway"
doc2 <- "milky way is huge"
library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df
1 2
document 1 0
huge 0 1
milky 0 1
milkyway 1 0
way 0 1
I am not sure that tm makes it easy (or possible) to select or group features based on regular expressions. But the text package quanteda does, through a thesaurus argument that groups terms according to a dictionary, when constructing its document-feature matrix.
(quanteda uses the generic term "feature" since here, your category is terms containing the phrase milky rather than original "terms".)
The valuetype argument can be the "glob" format (default), a regular expression ("regex"), or as-is fixed ("fixed"). Below I show the versions with glob and regular expressions.
require(quanteda)
myDictGlob <- dictionary(list(containsMilky = c("milky*")))
myDictRegex <- dictionary(list(containsMilky = c("^milky")))
(plainDfm <- dfm(c(doc1, doc2)))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 9 feature types
## ... created a 2 x 9 sparse dfm
## ... complete.
## Elapsed time: 0.008 seconds.
## Document-feature matrix of: 2 documents, 9 features.
## 2 x 9 sparse Matrix of class "dfmSparse"
## features
## docs this is a document about milkyway milky way huge
## text1 1 1 1 1 1 1 0 0 0
## text2 0 1 0 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictGlob, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictRegex, valuetype = "regex")
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1