Remove words from a dtm - r

I have created a dtm.
library(tm)
corpus = Corpus(VectorSource(dat$Reviews))
dtm = DocumentTermMatrix(corpus)
I used it to remove rare terms.
dtm = removeSparseTerms(dtm, 0.98)
After removeSparseTermsthere are still some terms in the dtm which are useless for my analysis.
The tm package has a function to remove words. However, this function can only be applied to a corpus or a vector.
How can I remove defined terms from a dtm?
Here is a small sample of the input data:
samp = dat %>%
select(Reviews) %>%
sample_n(20)
dput(samp)
structure(list(Reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"work perfect time", "amaze buy phone smoothly update charm glte yet comparably fast several different provider sims perfectly small size definitely replacemnent simple",
"phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend",
"perfect", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy",
"", "phone verizon contract phone buyer beware", "good phone",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"
)), row.names = c(12647L, 10088L, 14055L, 3720L, 6588L, 10626L,
10362L, 1428L, 12580L, 5381L, 10431L, 2803L, 6644L, 12969L, 348L,
10582L, 3215L, 13358L, 12708L, 7049L), class = "data.frame")

You should try quanteda, which calls a DocumentTermMatrix a "dfm" (document feature matrix) and has more options to trim it to reduce sparsity, including a function dfm_remove() for removing specific features (terms).
If we rename your samp object as dat, then:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
corp <- corpus(dat, text_field = "Reviews")
corp
## Corpus consisting of 20 documents and 0 docvars.
tail(texts(corp), 2)
## 12708 7049
## "good phone price fine" "phone star battery little soon yes"
dtm <- dfm(corp)
dtm
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).
Now we can trim this. For this small one, the sparsity setting of 0.98 has no effect, but we can trim based on frequency thresholds.
# does not actually have an effect
dfm_trim(dtm, sparsity = 0.98, verbose = TRUE)
## Note: converting sparsity into min_docfreq = 1 - 0.98 = NULL .
## No features removed.
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).
# trimming based on rare terms
dtm <- dfm_trim(dtm, min_termfreq = 3, verbose = TRUE)
## Removing features occurring:
## - fewer than 3 times: 119
## Total features removed: 119 (93.0%).
head(dtm)
## Document-feature matrix of: 6 documents, 9 features (83.3% sparse).
## 6 x 9 sparse Matrix of class "dfm"
## features
## docs phone screen sim card work good perfect buy never
## 12647 0 0 0 0 0 0 0 0 0
## 10088 0 0 0 0 0 0 0 0 0
## 14055 0 0 0 0 0 0 0 0 0
## 3720 1 0 0 0 0 0 0 0 0
## 6588 1 1 1 1 1 1 0 0 0
## 10626 0 0 0 0 1 0 1 0 0
Anyway to answer your question directly, you want dfm_remove() to get rid of specific features.
# removing from a specific list of terms
dtm <- dfm_remove(dtm, c("screen", "buy", "sim", "card"), verbose = TRUE)
## removed 4 features
##
dtm
## Document-feature matrix of: 20 documents, 5 features (75.0% sparse).
head(dtm)
## Document-feature matrix of: 6 documents, 5 features (80.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs phone work good perfect never
## 12647 0 0 0 0 0
## 10088 0 0 0 0 0
## 14055 0 0 0 0 0
## 3720 1 0 0 0 0
## 6588 1 1 1 0 0
## 10626 0 1 0 1 0
And finally, if you still really want to, you can convert the dtm into the tm format using quanteda's convert() function:
convert(dtm, to = "tm")
## <<DocumentTermMatrix (documents: 20, terms: 5)>>
## Non-/sparse entries: 25/75
## Sparsity : 75%
## Maximal term length: 7
## Weighting : term frequency (tf)

Related

How to compute a numeric sentiment score using quanteda from a custom dictionary

I have been using the AWESOME quanteda library for text analysis lately and it has been quite a joy, recently I have stumbled with a task and that is to use a dictionary relating words to a numeric sentiment score to summarize a measure per document called: NetSentScore which is calculating in the following manner:
NetSentScore per document= sum(Positive_wordscore)+sum(Negative_wordscore)
I have the following dictionary:
ScoreDict<- tibble::tibble(
score= c(-5,-9,1,8,9,-10),
word = c("bad", "horrible", "open","awesome","gorgeous","trash")
)
My corpus
text<-c("this is a bad movie very bad","horrible movie, just awful","im open to new dreams",
"awesome place i loved it","she is gorgeous","that is trash")
by definition quanteda will not allow to have numeric data in a dictionary, but I can have this:
> text %>%
+ corpus() %>%
+ tokens(remove_punct = TRUE) %>%
+ tokens_remove(stopwords("en")) %>%
+ dfm()
Document-feature matrix of: 6 documents, 14 features (82.14% sparse) and 0 docvars.
features
docs bad movie horrible just awful im open new dreams awesome
text1 2 1 0 0 0 0 0 0 0 0
text2 0 1 1 1 1 0 0 0 0 0
text3 0 0 0 0 0 1 1 1 1 0
text4 0 0 0 0 0 0 0 0 0 1
text5 0 0 0 0 0 0 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 4 more features ]
which gives me the number or times a word was found in a document, I will only need to "join" or "merge" with my dictionary so I have have the score by each word and then compute the NetSentScore, is there a way to do this in quanteda?
Please keep in mind that I do have a quite massive large corpus so converting my dfm to a dataframe will make the RAM die as I have over 500k documents and approx 800 features.
to illustrate the NetSentScore of text1 will be:
2*-5+0=-10, this is because the word bad appears two times and according to the dictionary it has a score of -5
As #stomper suggests, you can do this with the quanteda.sentiment package, by setting the numeric values as "valences" for the dictionary. Here's how to do it.
This ought to work on 500k documents but of course this will depend on your machine's capacity.
library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.sentiment")
#>
#> Attaching package: 'quanteda.sentiment'
#> The following object is masked from 'package:quanteda':
#>
#> data_dictionary_LSD2015
dict <- dictionary(list(
sentiment = c("bad", "horrible", "open", "awesome", "gorgeous", "trash")
))
valence(dict) <- list(
sentiment = c(bad = -5,
horrible = -9,
open = 1,
awesome = 8, gorgeous = 9,
trash = -10)
)
print(dict)
#> Dictionary object with 1 key entry.
#> Valences set for keys: sentiment
#> - [sentiment]:
#> - bad, horrible, open, awesome, gorgeous, trash
text <- c("this is a bad movie very bad",
"horrible movie, just awful",
"im open to new dreams",
"awesome place i loved it",
"she is gorgeous",
"that is trash")
Now to compute the document scores, you use textstat_valence() but you sent the normalisation to "none" in order to sum the valences rather than average them. Normalisation is the default because raw sums are affected by documents having different lengths, but as this package is still in a developmental stage, it's easy to imagine that other choices might be preferable to the default.
textstat_valence(tokens(text), dictionary = dict, normalization = "none")
#> doc_id sentiment
#> 1 text1 -10
#> 2 text2 -9
#> 3 text3 1
#> 4 text4 8
#> 5 text5 9
#> 6 text6 -10
Created on 2023-01-11 with reprex v2.0.2

Identify WHICH words in a document have been matched by dictionary lookup and how many times

Quanteda question.
For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much.
Put differently, I want to get a matrix of the features in each dictionary category that have been matched using the tokens_lookup and dfm_lookup functions, and their frequency per document. So not the aggregated frequency of all words in the category, but of each of them separately.
Is there an easy way to get this?
The easiest way to do this is to iterate over your dictionary "keys" (what you call "categories") and select the matches to create one dfm per key. There are a few steps needed to deal with the non-matches and the compound dictionary values (such as "not fail").
I can demonstrate this using the built-in inaugural address corpus and the LSD2015 dictionary, which has four keys and includes multi-word values.
The loop iterates over the dictionary keys to build up a list, each time doing the following:
select the tokens but leave a pad for ones not selected;
compound the multi-word tokens into single tokens;
rename the pad ("") to OTHER, so that we can count non-matches; and
create the dfm.
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
Now we have all of the dictionary matches for each key in a list of dfm objects:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658

merging two document term matrices by row

I have customer queries and answers from customer services in a csv file. I need to identify the subject of each question and then later develop a classification model on this. I have created two document term matrices (after cleaning the documents), one for questions and the other for the answers. I have reduced the size by only taking those terms that occur more than 400 times in the whole document (about 40k questions and answers).
I want to create a data frame that merges these two matrices by rows and retain only the words that are common in question and answer dtm (and add up their frequency. How should I do this in R? I'll use the highest frequency word to label the question.
Any help/ suggestion on the approach is highly appreciated.
> str(inspect(dtmaf))
<<DocumentTermMatrix (documents: 38697, terms: 237)>>
Non-/sparse entries: 326124/8845065
Sparsity : 96%
Maximal term length: 13
Weighting : term frequency (tf)
Sample :
Terms
Docs booking card change check confirm confirmation email make port wish
12316 3 1 0 0 0 0 0 0 1 1
137 4 1 2 0 1 0 0 0 0 0
17618 4 1 0 0 0 0 0 2 0 2
18082 2 1 3 1 1 0 0 0 1 0
19141 3 0 2 0 1 0 0 0 1 0
21862 2 0 0 0 0 0 0 1 0 0
2756 1 0 2 0 0 0 0 1 0 1
27578 2 1 5 0 0 0 0 0 0 1
30312 4 1 2 0 0 0 0 2 0 2
9019 1 1 1 0 0 0 0 0 0 0
num [1:10, 1:10] 3 4 4 2 3 2 1 2 4 1 ...
- attr(*, "dimnames")=List of 2
..$ Docs : chr [1:10] "12316" "137" "17618" "18082" ...
..$ Terms: chr [1:10] "booking" "card" "change" "check" ...
> str(inspect(dtmc))
<<DocumentTermMatrix (documents: 38697, terms: 189)>>
Non-/sparse entries: 204107/7109626
Sparsity : 97%
Maximal term length: 13
Weighting : term frequency (tf)
Sample :
Terms
Docs booking car change confirmation like number possible reservation return ticket
14091 0 0 0 0 2 0 0 2 0 0
18220 6 0 0 2 0 0 0 0 0 0
20103 1 0 1 0 0 1 0 0 0 0
20184 0 3 0 0 0 1 0 4 1 0
21005 3 5 0 1 2 0 1 0 0 0
24877 0 1 1 0 0 0 0 2 0 1
26135 0 0 0 0 0 0 0 1 0 0
28200 5 2 1 0 0 0 0 1 0 0
2979 12 7 2 0 1 0 0 0 0 0
680 0 0 1 2 0 1 0 0 0 0
num [1:10, 1:10] 0 6 1 0 3 0 0 5 12 0 ...
- attr(*, "dimnames")=List of 2
..$ Docs : chr [1:10] "14091" "18220" "20103" "20184" ...
..$ Terms: chr [1:10] "booking" "car" "change" "confirmation" ...
Expected output is a matrix with (237+189) terms and 38697 rows. Matching terms in both dtms will have one column per term and their frequencies summed up and the non-matching terms will be reproduced as such.
Here is a reproducible example with 10 documents:
> dput(datamsg)
structure(list(cmessage = c("No answer ?", "Hello the third number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number can not be found in the system. Therefore I request to return money. It was not my fault !",
"Hi I forget probably choose items on the How can I do this now. ",
"Hi I forget probably choose items How can i do this now. ",
"Hello I tell if I have booked . If not is it possible and what would it cost? ",
"First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ",
"Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you. But rather ask more questions. ",
"Dear booked everything again. Also the journey through In my previous message I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ",
"Thank you. When will the new registration show ?...as it still shows the . Thanks",
"So my phone number is .Please tell me how this works."), afreply = c("Hello afraid there is no space on the September. I have also checked but are all fully booked. Would you like us to check any other dates for you? ",
"Hello As far as we can see the booking No was a valid reservation. We have however contacted and can confirm that administration fee was refunded back to your card. ",
"Good afternoon You are currently booked as high plane. You have requested an amendment to change the height which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request please submit a new one with an accurate height ofreply to this message. ",
"Hello thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking please contact us.",
"Hello you booked any In order to make a change to your booking kindly send us a amendment request via",
"Dear Mr. what dimensions you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested if you call us an alternative travel date.",
"Dear Sir or Madam we will send you the address ", "Hello your crossing with was already refunded. As my colleague told you your with was still valid. In case you have booked a second ticket with please send us the new booking reference number but we cannot guarantee that you will be entitle to a refund. ",
"if you can authorise us to take the payment from the card you used to make the we can then make the change.",
"Good morning we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. "
)), .Names = c("cmessage", "afreply"), class = "data.frame", row.names = c(NA,
-10L))
corpus1<-Corpus(VectorSource(datamsg$cmessage))
corpus2<-Corpus(VectorSource(datamsg$afreply))
dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf))
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf))
Here's a simpler way using the quanteda package.
library("quanteda")
packageVersion("quanteda")
# [1] ‘0.99.9’
First, we create two document-feature matrixes, and figure out their common terms:
dfm_c <- dfm(datamsg$cmessage, remove_punct = TRUE)
dfm_af <- dfm(datamsg$afreply, remove_punct = TRUE)
common_feature_names <- intersect(featnames(dfm_c), featnames(dfm_af))
Then we can combine them using cbind(), which (correctly) issues a warning that you now have duplicated features. The second line selects just the common features, and the third line combines the identically named features in the dfm by summing them, which is what you want.
combined_dfm <- cbind(dfm_c, dfm_af) %>%
dfm_select(pattern = common_feature_names) %>%
dfm_compress()
head(combined_dfm)
# Document-feature matrix of: 6 documents, 6 features (41.7% sparse).
# 6 x 6 sparse Matrix of class "dfmSparse"
# features
# docs no hello the number is i
# text1 2 1 1 0 1 1
# text2 1 2 6 2 1 2
# text3 0 0 3 0 0 2
# text4 0 1 0 0 0 3
# text5 0 2 0 0 1 2
# text6 0 0 3 0 1 2
If you really, really want it back in tm, you can convert this using:
convert(combined_dfm, to = "tm")
# <<DocumentTermMatrix (documents: 10, terms: 49)>>
# Non-/sparse entries: 189/301
# Sparsity : 61%
# Maximal term length: 8
# Weighting : term frequency (tf)
Note: You have not specified clearly that you might need to merge a dfm with different documents, so here I have assumed (from the example) that the documents are the same. If they are different, that is also easily solved, but it was not specified in the question.
Your code:
#dput(datamsg)
datamsg <-
structure(
list(
cmessage = c(
"No answer ?",
"Hello the third number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number can not be found in the system. Therefore I request to return money. It was not my fault !",
"Hi I forget probably choose items on the How can I do this now. ",
"Hi I forget probably choose items How can i do this now. ",
"Hello I tell if I have booked . If not is it possible and what would it cost? ",
"First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ",
"Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you. But rather ask more questions. ",
"Dear booked everything again. Also the journey through In my previous message I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ",
"Thank you. When will the new registration show ?...as it still shows the . Thanks",
"So my phone number is .Please tell me how this works."
),
afreply = c(
"Hello afraid there is no space on the September. I have also checked but are all fully booked. Would you like us to check any other dates for you? ",
"Hello As far as we can see the booking No was a valid reservation. We have however contacted and can confirm that administration fee was refunded back to your card. ",
"Good afternoon You are currently booked as high plane. You have requested an amendment to change the height which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request please submit a new one with an accurate height ofreply to this message. ",
"Hello thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking please contact us.",
"Hello you booked any In order to make a change to your booking kindly send us a amendment request via",
"Dear Mr. what dimensions you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested if you call us an alternative travel date.",
"Dear Sir or Madam we will send you the address ",
"Hello your crossing with was already refunded. As my colleague told you your with was still valid. In case you have booked a second ticket with please send us the new booking reference number but we cannot guarantee that you will be entitle to a refund. ",
"if you can authorise us to take the payment from the card you used to make the we can then make the change.",
"Good morning we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. "
)
),
.Names = c("cmessage", "afreply"),
class = "data.frame",
row.names = c(NA,-10L)
)
corpus1<-Corpus(VectorSource(datamsg$cmessage)) # 10 docs
corpus2<-Corpus(VectorSource(datamsg$afreply)) # 10 docs
dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf))
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf))
My code continues:
library(tm)
library(dplyr)
library(stringr)
# rename anonymous document ids:
rownames(dtmc) <- dtmc %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .)
rownames(dtmaf) <- dtmaf %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .)
# transform to termDocumentmatrix
tdmc <- dtmc %>% t()
tdmaf<- dtmaf %>% t()
# introduce new first column "word"
tdmc_df <- tdmc %>% as.matrix() %>% as.data.frame() %>% rownames_to_column( var = "word")
tdmaf_df <- tdmaf %>% as.matrix() %>% as.data.frame() %>% rownames_to_column( var = "word")
# find common words
tdm_df <- tdmc_df %>% inner_join(tdmaf_df, by=c("word"))
tdm_df <- tdm_df %>% arrange(word)
dtm_df <- tdm_df %>% column_to_rownames("word") %>% t()
# count occurences of matching words
colSums(dtm_df)
# find nonmatching words
dtm_df_nonmatching <- tdmc_df %>% anti_join(tdmaf_df, by=c("word")) %>% arrange(word) %>% column_to_rownames("word")
# count occurences of nonmatching words
rowSums(dtm_df_nonmatching)
Common words, count:
colSums(dtm_df)
address also and booked but can card dear for from have hello message
4 2 5 7 3 13 3 3 4 2 12 8 3
more new not number pay please possible request still thanks that the then
2 3 8 4 2 5 2 3 2 2 3 32 3
this told travel was what will with would you
6 2 2 5 2 4 7 2 25

Feature extraction using Chi2 with Quanteda

I have a dataframe df with this structure :
Rank Review
5 good film
8 very good film
..
Then I tried to create a DocumentTermMatris using quanteda package :
mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)
I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value
Can you help me to resolve this problem please?
EDIT :
head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
features
docs bon accueil conseillèr efficac écout répond
text1 0 0 0 0 0 0
text2 1 1 1 1 1 1
text3 0 0 0 0 0 0
text4 0 0 0 0 0 0
text5 0 0 1 0 0 0
text6 0 0 0 0 1 0
...
text60300 0 0 1 1 1 1
Here I have my dfm matrix, then I create my tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :
textstat_keyness(mydfm, target = 2)
But, since I have 60300 target, I don't know how to do this automatically .
I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(
EDIT 2 :
Rank Review
10 always good
1 nice film
3 fine as usual
Here I try to group document with dfm :
mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)
But it fails to group documents
Can you help me please to resolve this problem
Thank you
See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.
textstat_keyness(mydfm, target = 1)
for the first document against the frequencies of all others, or
textstat_keyness(mydfm, target = 2)
for the second against all others, etc.
If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.

using findAssocs in R to find frequently occurred words with central term

As I was working with findAssocs in R, I realised that the function don't actually pick up the words that occur together with the searched term across documents, but rather the words that occur when searched term frequently appeared.
I've tried using a simple test script below:
test <- list("housekeeping bath towel housekeeping room","housekeeping dirty","housekeeping very dirty","housekeeping super dirty")
test <-Corpus(VectorSource(test))
test_dtm<-DocumentTermMatrix(test)
test_dtms<-removeSparseTerms(test_dtm,0.99)
findAssocs(test_dtms,"housekeeping",corlimit = 0.1)
And the returning result from R is:
$housekeeping
bath room towel
1 1 1
Noticed that the word "dirty" occur in 3 out of the 4 documents, compared to the returned keywords which only occurred once in all documents.
Anyone has any idea what went wrong in my script or if there is a better way to do this?
The result I want to achieve is the model should reflect the words that occurs frequently with the search term across all documents and not within a specific document. I've tried combining the 4 documents into 1 but it doesn't work as findAssocs doesn't work on a single document.
Any advise?
How about an alternative, using the quanteda package? It imposes no mystery restrictions on the correlations returned, and has many other options (see ?similarity).
require(quanteda)
testDfm <- dfm(unlist(test), verbose = FALSE)
## Document-feature matrix of: 4 documents, 7 features.
## 4 x 7 sparse Matrix of class "dfmSparse"
## features
## docs housekeeping bath towel room dirty very super
## text1 2 1 1 1 0 0 0
## text2 1 0 0 0 1 0 0
## text3 1 0 0 0 1 1 0
## text4 1 0 0 0 1 0 1
similarity(testDfm, "housekeeping", margin = "features")
## similarity Matrix:
## $housekeeping
## bath towel room very super dirty
## 1.0000 1.0000 1.0000 -0.3333 -0.3333 -1.0000

Resources