I am implementing LDA for some simple data Sets , I am able to do the topic modelling but the issue is when i am trying to organise the top 6 terms according to their Topics , I am getting some numerical values ( maybe their indexes )
# docs is the dataset formatted and cleaned properly
dtm<- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords=TRUE))
ldaOut<-LDA(dtm,k,method="Gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin))
# 6 top terms in each topic
ldaOut.terms<-as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
The TopicsToTerms file is Generated like ,
Topic 1 Topic 2 Topic 3
1 1 5 3
2 2 1 4
3 3 2 1
4 4 3 2
5 5 4 5
While I want The Terms (top words for each topic) In the tables , like the following -
Topic 1 Topic 2 Topic 3
1 Hat Cat Food
You just need one line of code to fix your problem:
> text = read.csv("~/Desktop/your_data.csv") #your initial dataset
> docs = Corpus(VectorSource(text)) #converting to corpus
> docs = tm_map(docs, content_transformer(tolower)) #cleaning
> ... #cleaning
> dtm = DocumentTermMatrix(docs) #creating a document term matrix
> rownames(dtm) = text
After adding that last line, you can proceed with the remaining code, and you'll get the Terms, and not their indexes. Hope that helped.
Related
To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.
I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author #KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.
So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
For extracting the top 10 unigrams, this strategy will work just fine:
sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.
combine this with the 2- to 5-gram dfm using cbind() (not rbind())).
That should do it:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1
I want to analyze answers from open ended questions. Single Word Cloud first, and then I met a problem when I want to count the frequency of 2-3 words phrase.
Here is my codes:
library('tm')
tokenize_ngrams <- function(x,n=2)return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
corpus <- Corpus(VectorSource(texts))
matrix <- TermDocumentMatrix(corpus,control=list(tokenize=tokenize_ngrams))
inspect(matrix[1:4, 1:3])
The results should be the 2-word phrases and the frequency.
But I got the results as follows:
Docs
Terms 1 2 3
document 1 0 0
first 1 0 0
the 1 1 1
this 1 1 1
Don't know the answer using tm, but this will work fine:
require(quanteda)
matrix <- dfm(texts, ngrams = 2)
head(matrix)
I'm struggling with inputes for arulesSequences in R
My data, lets call the dataframe df, looks like this
sequenceID eventID SIZE event
1 1 1 1 E_351-
2 1 2 1 1-
3 2 1 1 30006+
4 2 2 1 20198+
5 2 3 1 111+
6 2 4 1 610-
7 2 5 1 26+
8 2 6 1 30006-
9 2 7 2 11+, 11
the next step as(df,"transactions") gives the following error
error in asMethod(object) :
can not coerce list with transactions with duplicated items
Calls: as ... .nextMethod -> callNextMethod -> .nextMethod -> as -> asMethod
I just spent 2 days trying to just input my data in cspade without success !
After many try-and-fail I managed to convert the file to a transactions object.
Tricks for those who would struggle the same :
I had to remove the commas (use paste rather than toString)
I wrote the table in csv fil : BEAWARE : no header and no rownames or the import with read-baskets will fail. Hope this helps future users.
I did it similiar to how you did it. I also included a size column, I saw it in another example, Im not sure what it does though.
My data is bulit like this, but with > 200 000 unique ID.
mytxt <- data.frame(ID=c(1,1,1,2,2),
Time=c(1,2,3,1,2),
Size=1,
Event=c("A","B","E", "B","A"))
I simply just save it as a txt file with no column or row names.
write.table(mytxt, "C:\\mytxt.txt", sep=" ", row.names = FALSE, col.names = FALSE, quote = FALSE)
And then I read it with the follwing line
data <- read_baskets(con = "C:\\mytxt.txt", info = c("sequenceID","eventID","SIZE"))
So it is similiar to what you describe in the comment.
This question is related to to my earlier question. Treat words separated by space in the same manner
Posting it as a separate one since it might help other users find it easily.
The question is regarding the way the term document matrix is calculated by tm package currently. I want to tweak this way a little bit as explained below.
Currently any term document matrix gets created by looking for a word say 'milky' as a separate word (and not as a string) in a document. For example, let us assume 2 documents
document 1: "this is a milky way galaxy"
document 2: "this is a milkyway galaxy"
As per the way current algorithm works (tm package) 'milky' would get found in first document but not in second document since the algorithm looks for the term milky as a separate word. But if the algorithm had looked for the term milky a strings like function grepl does, it would have found the term 'milky' in second document as well.
grepl('milky', 'this is a milkyway galaxy')
TRUE
Can someone please help me create a term document matrix meeting my requirement (which is to be able to find term milky in both the documents. Please note that I don't want a solution specific to a word or milky, I want a general solution which I will apply on a larger scale to take care of all such cases)? Even if the solution does not use tm package, it is fine. I just have to get a term document matrix meeting my requirement in the end. Ultimately I want to be able to get a term document matrix such that each term in it should get looked for as string (not just as word) inside all the strings of the document in question (grepl like functionality while calculating term document matrix).
Current code which I use to get term document matrix is
doc1 <- "this is a document about milkyway"
doc2 <- "milky way is huge"
library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df
1 2
document 1 0
huge 0 1
milky 0 1
milkyway 1 0
way 0 1
I am not sure that tm makes it easy (or possible) to select or group features based on regular expressions. But the text package quanteda does, through a thesaurus argument that groups terms according to a dictionary, when constructing its document-feature matrix.
(quanteda uses the generic term "feature" since here, your category is terms containing the phrase milky rather than original "terms".)
The valuetype argument can be the "glob" format (default), a regular expression ("regex"), or as-is fixed ("fixed"). Below I show the versions with glob and regular expressions.
require(quanteda)
myDictGlob <- dictionary(list(containsMilky = c("milky*")))
myDictRegex <- dictionary(list(containsMilky = c("^milky")))
(plainDfm <- dfm(c(doc1, doc2)))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 9 feature types
## ... created a 2 x 9 sparse dfm
## ... complete.
## Elapsed time: 0.008 seconds.
## Document-feature matrix of: 2 documents, 9 features.
## 2 x 9 sparse Matrix of class "dfmSparse"
## features
## docs this is a document about milkyway milky way huge
## text1 1 1 1 1 1 1 0 0 0
## text2 0 1 0 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictGlob, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictRegex, valuetype = "regex")
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1
I am trying to get the number of occurrences of each word in a csv file with r.
My dataset looks like this:
TITLE
1 My first Android app after a year
2 Unmanned drone buzzes French police car
3 Make anything editable with HTML5
4 Predictive vs Reactive control
5 What was it like to move to San Antonio and go through TechStars Cloud?
6 Health-care sector vulnerable to hackers, researchers say
And I have tried using the funciton used in 'Machine Learning for Hackers':
get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
But I get an error I dont understand:
Error: is.Source(s) is not TRUE
In addition: Warning message:
In is.Source(s) : vectorized sources must have a positive length entry
What could possibly the problem?
This works for me (calling your dataframe df)
library(tm)
doc.corpus <- Corpus(VectorSource(df))
freq <- data.frame(count=termFreq(doc.corpus[[1]]))
freq
# count
# after 1
# and 1
# android 1
# antonio 1
# anything 1
# ...
# unmanned 1
# vulnerable 1
# was 1
# what 1
# with 1
# year 1