Using DocumentTermMatrix on a Vector of First and Last Names - r

I have a column in my data frame (df) as follows:
> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that occur the most are used as columns. I have tried the following code:
> people_list = strsplit(people, ", ")
> corp = Corpus(VectorSource(people_list))
> dtm = DocumentTermMatrix(corp, people_dict)
where people_dict is a list of the most commonly occurring people (~150 full names of people) from people_list as follows:
> people_dict[1:3]
[[1]]
[1] "Christian Slater"
[[2]]
[1] "Tara Reid"
[[3]]
[1] "Stephen Dorff"
However, the DocumentTermMatrix function seems to not be using the people_dict at all because I have way more columns than in my people_dict. Also, I think that the DocumentTermMatrix function is splitting each name string into multiple strings. For example, "Danny Devito" becomes a column for "Danny" and "Devito".
> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity : 100%
Maximal term length: 9
Weighting : term frequency (tf)
Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
I have read through all the TM documentation that I can find, and I have spent hours searching on stackoverflow for a solution. Please help!

The default tokenizer splits text into individual words. You need to provide a custom function
commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))
Note that you do not separate the actors before creating the corpus.
people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")
The control options didn't work with just Coprus, I used VCorpus
corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize =
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))
All of the options are passed within control, including:
tokenize - function
dictionary
tolower = FALSE
Results:
as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
1 0 1 0
2 0 0 0
3 0 0 1
I hope this helps

Related

How to compute a numeric sentiment score using quanteda from a custom dictionary

I have been using the AWESOME quanteda library for text analysis lately and it has been quite a joy, recently I have stumbled with a task and that is to use a dictionary relating words to a numeric sentiment score to summarize a measure per document called: NetSentScore which is calculating in the following manner:
NetSentScore per document= sum(Positive_wordscore)+sum(Negative_wordscore)
I have the following dictionary:
ScoreDict<- tibble::tibble(
score= c(-5,-9,1,8,9,-10),
word = c("bad", "horrible", "open","awesome","gorgeous","trash")
)
My corpus
text<-c("this is a bad movie very bad","horrible movie, just awful","im open to new dreams",
"awesome place i loved it","she is gorgeous","that is trash")
by definition quanteda will not allow to have numeric data in a dictionary, but I can have this:
> text %>%
+ corpus() %>%
+ tokens(remove_punct = TRUE) %>%
+ tokens_remove(stopwords("en")) %>%
+ dfm()
Document-feature matrix of: 6 documents, 14 features (82.14% sparse) and 0 docvars.
features
docs bad movie horrible just awful im open new dreams awesome
text1 2 1 0 0 0 0 0 0 0 0
text2 0 1 1 1 1 0 0 0 0 0
text3 0 0 0 0 0 1 1 1 1 0
text4 0 0 0 0 0 0 0 0 0 1
text5 0 0 0 0 0 0 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 4 more features ]
which gives me the number or times a word was found in a document, I will only need to "join" or "merge" with my dictionary so I have have the score by each word and then compute the NetSentScore, is there a way to do this in quanteda?
Please keep in mind that I do have a quite massive large corpus so converting my dfm to a dataframe will make the RAM die as I have over 500k documents and approx 800 features.
to illustrate the NetSentScore of text1 will be:
2*-5+0=-10, this is because the word bad appears two times and according to the dictionary it has a score of -5
As #stomper suggests, you can do this with the quanteda.sentiment package, by setting the numeric values as "valences" for the dictionary. Here's how to do it.
This ought to work on 500k documents but of course this will depend on your machine's capacity.
library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.sentiment")
#>
#> Attaching package: 'quanteda.sentiment'
#> The following object is masked from 'package:quanteda':
#>
#> data_dictionary_LSD2015
dict <- dictionary(list(
sentiment = c("bad", "horrible", "open", "awesome", "gorgeous", "trash")
))
valence(dict) <- list(
sentiment = c(bad = -5,
horrible = -9,
open = 1,
awesome = 8, gorgeous = 9,
trash = -10)
)
print(dict)
#> Dictionary object with 1 key entry.
#> Valences set for keys: sentiment
#> - [sentiment]:
#> - bad, horrible, open, awesome, gorgeous, trash
text <- c("this is a bad movie very bad",
"horrible movie, just awful",
"im open to new dreams",
"awesome place i loved it",
"she is gorgeous",
"that is trash")
Now to compute the document scores, you use textstat_valence() but you sent the normalisation to "none" in order to sum the valences rather than average them. Normalisation is the default because raw sums are affected by documents having different lengths, but as this package is still in a developmental stage, it's easy to imagine that other choices might be preferable to the default.
textstat_valence(tokens(text), dictionary = dict, normalization = "none")
#> doc_id sentiment
#> 1 text1 -10
#> 2 text2 -9
#> 3 text3 1
#> 4 text4 8
#> 5 text5 9
#> 6 text6 -10
Created on 2023-01-11 with reprex v2.0.2

Feature extraction using Chi2 with Quanteda

I have a dataframe df with this structure :
Rank Review
5 good film
8 very good film
..
Then I tried to create a DocumentTermMatris using quanteda package :
mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)
I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value
Can you help me to resolve this problem please?
EDIT :
head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
features
docs bon accueil conseillèr efficac écout répond
text1 0 0 0 0 0 0
text2 1 1 1 1 1 1
text3 0 0 0 0 0 0
text4 0 0 0 0 0 0
text5 0 0 1 0 0 0
text6 0 0 0 0 1 0
...
text60300 0 0 1 1 1 1
Here I have my dfm matrix, then I create my tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :
textstat_keyness(mydfm, target = 2)
But, since I have 60300 target, I don't know how to do this automatically .
I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(
EDIT 2 :
Rank Review
10 always good
1 nice film
3 fine as usual
Here I try to group document with dfm :
mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)
But it fails to group documents
Can you help me please to resolve this problem
Thank you
See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.
textstat_keyness(mydfm, target = 1)
for the first document against the frequencies of all others, or
textstat_keyness(mydfm, target = 2)
for the second against all others, etc.
If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.

Stemming a text column in a dataframe with R

I have a dataframe with this structure :
#Load lexicon
Lexicon_DF <- read.csv("LexiconFrancais.csv",header=F, sep=";")
The structure of the "LexiconFrancais.csv" is like this :
French Translation (Google Translate);Positive;Negative
un dos;0;0
abaque;0;0
abandonner;0;1
abandonné;0;1
abandon;0;1
se calmer;0;0
réduction;0;0
abba;1;0
abbé;0;0
abréger;0;0
abréviation;0;0
> Lexicon_DF
V1 V2 V3
1 French Translation (Google Translate) Positive Negative
2 un dos 0 0
3 abaque 0 0
4 abandonner 0 1
5 abandonné 0 1
6 abandon 0 1
7 se calmer 0 0
8 réduction 0 0
9 abba 1 0
10 abbé 0 0
11 abréger 0 0
12 abréviation 0 0
I try to stemm the first column of the dataframe, for this I did :
Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
But after this command I find only the first column in the Lexicon_DF dataframe, the two other column disappear.
> Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
> Lexicon_DF
[1] "French Translation (Google Translate)" "un dos" "abaqu"
[4] "abandon" "abandon" "abandon"
[7] "se calm" "réduct" "abba"
[10] "abbé" "abreg" "abrévi"
How can I do the stemming wtihout missing the two other columns?
thank you
You are trying to replace the whole content of Lexicon_DF with the o/p of wordStem-
Try this :
Lexicon_DF$V1 <-SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')

r string parsing challenge

I am dealing with a column that contains strings as follows
Col1
------------------------------------------------------------------
Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery
What I am trying to do is separate strings containing the words starting with either, Department or Divison or Center until comma(,) the final output should look like this
Dept_Mechanical_Eng Dept_Computer_Science Div_Adv_Machining Cntr_Mining_Metallurgy Dept_Aerospace Cntr_Science_Delivery
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 1
I have butchered the actual names just for aesthetic purpose in the expected output. Any help on parsing this string is much appreciated.
This is very similar to a question I just did tabulating another text example. Are you in the same class as the questioner here? Count the number of times (frequency) a string occurs
inp <- "Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery"
inp2 <- factor(scan(text=inp,what="",sep=","))
#Read 6 items
inp3 <- readLines(textConnection(inp))
as.data.frame( setNames( lapply(levels(inp2), function(ll) as.numeric(grepl(ll, inp3) ) ), trimws(levels(inp2) )) )
Department.of.Aerospace Division.of.Advanced.Machining
1 0 0
2 0 1
3 1 0
Center.for.Mining.and.Metallurgy Center.for.Science.and.Delivery
1 0 0
2 1 0
3 0 1
Department.of.Computer.Science Department.of.Mechanical.Engineering
1 1 1
2 0 0
3 0 0

R text mining how to segment document into phrases not terms

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to segment document into phases but not word(term).
You can do this in R using the quanteda package, which can detect multi-word expressions as statistical collocates, which would be the multi-word expressions that you are probably referring to in English. To remove the collocations containing stop words, you would first tokenise the text, then remove the stop words leaving a "pad" in place to prevent false adjacencies in the results (two words that were not actually adjacent before the removal of stop words between them).
require(quanteda)
pres_tokens <-
tokens(data_corpus_inaugural) %>%
tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%
tokens_remove(stopwords("english"), padding = TRUE)
pres_collocations <- textstat_collocations(pres_tokens, size = 2)
head(pres_collocations)
# collocation count count_nested length lambda z
# 1 united states 157 0 2 7.893307 41.19459
# 2 let us 97 0 2 6.291128 36.15520
# 3 fellow citizens 78 0 2 7.963336 32.93813
# 4 american people 40 0 2 4.426552 23.45052
# 5 years ago 26 0 2 7.896626 23.26935
# 6 federal government 32 0 2 5.312702 21.80328
# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])
tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon" "shall_endeavor" "high_sense" "official_act"
Using this "compounded" token set, we can now turn this into a document-feature matrix where the features consist of a mixture of original terms (those not found in a collocation) and the collocations. As can be seen below, "united" occurs alone and as part of the collocation "united_states".
pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
# features
# docs united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
# 1789-Washington 4 2 0 0 0 0 0 0 0 0
# 1793-Washington 1 0 0 0 0 0 0 0 0 0
# 1797-Adams 3 9 0 0 0 0 0 0 0 0
# 1801-Jefferson 0 0 0 0 0 0 0 0 0 0
# 1805-Jefferson 1 4 0 0 0 0 0 0 0 0
If you want a more brute-force approach, it's possible simply to create a document-by-bigram matrix this way:
# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens_of of_the the_senate senate_and and_of the_house
## 1789-Washington 1 20 1 1 2 2
## 1797-Adams 0 29 0 0 2 0
## 1793-Washington 0 4 0 0 1 0
## 1801-Jefferson 0 28 0 0 3 0
## 1805-Jefferson 0 17 0 0 1 0
## 1809-Madison 0 20 0 0 2 0

Resources