Compute chi square value between ngrams and documents with Quanteda - r

I use Quanteda R package in order to extract ngrams (here 1grams and 2grams) from text Data_clean$Review, but I am looking for a way with R to compte Chi-square between document and the extracted ngrams :
Here the R code that I did to clean Up text (revoiew) and generate the n-grams.
Any idea please?
thank you
#delete rows with empty value columns
Data_clean <- Data[Data$Note!="" & Data$Review!="",]
Data_clean$id <- seq.int(nrow(Data_clean))
train.index <- 1:50000
test.index <- 50001:nrow(Data_clean)
#clean up
# remove grammar/punctuation
Data_clean$Review.clean <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))
train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]
temp.tf <- Data_clean$Raison.Reco.clean %>% tokens(ngrams = 1:2) %>% # generate tokens
dfm # generate dfm

You would not use ngrams for this, but rather a function called textstat_collocations().
It's a bit hard to follow your exact example since none of those objects are explained or supplied, but let's try it with some of quanteda's built-in data. I'll get the texts from the inaugural corpus and apply some filters similar to what you have above.
So to score bigrams for chi^2, you would use:
# create the corpus, subset on some conditions (could be Note != "" for instance)
corp_example <- data_corpus_inaugural
corp_example <- corpus_subset(corp_example, Year > 1960)
# this will remove punctuation and numbers
toks_example <- tokens(corp_example, remove_punct = TRUE, remove_numbers = TRUE)
# find and score chi^2 bigrams
coll2 <- textstat_collocations(toks_example, method = "chi2", max_size = 2)
head(coll2, 10)
# collocation count X2
# 1 reverend clergy 2 28614.00
# 2 Majority Leader 2 28614.00
# 3 Information Age 2 28614.00
# 4 Founding Fathers 3 28614.00
# 5 distinguished guests 3 28614.00
# 6 Social Security 3 28614.00
# 7 Chief Justice 9 23409.82
# 8 middle class 4 22890.40
# 9 Abraham Lincoln 2 19075.33
# 10 society's ills 2 19075.33
Added:
# needs to be a list of the collocations as separate character elements
coll2a <- sapply(coll2$collocation, strsplit, " ", USE.NAMES = FALSE)
# compound the tokens using top 100 collocations
toks_example_comp <- tokens_compound(toks_example, coll2a[1:100])
toks_example_comp[[1]][1:20]
# [1] "Vice_President" "Johnson" "Mr_Speaker" "Mr_Chief" "Chief_Justice"
# [6] "President" "Eisenhower" "Vice_President" "Nixon" "President"
# [11] "Truman" "reverend_clergy" "fellow_citizens" "we" "observe"
# [16] "today" "not" "a" "victory" "of"

Related

How to keep the text id of removed text in lda

I have a dataframe like this
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
I perform text clean for lda with this
library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
dtm <- convert(myDfm, to = "topicmodels")
lda <- LDA(dtm, k = 2, control = list(seed = 1234))
However I noticed that in dtm when the text column doesn't not contain anything it remove it.
gammaDF <- as.data.frame(lda#gamma)
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
However it gives me a problem when I want to take the topic and related id of the first dataframe. What can I do to have the right results?
id, topic
2 1
3 2
4 1
The problem here is that LDA() removes the rownames from your document-term matrix and replaces them with a simple serial number. This no longer corresponds to your original dtext$id. But you can replace the LDA id with the document name, and then link this back to your input text.
To make this more clear, we are first going to replace your dtext$id with something that can be more clearly distinguished from the serial number that LDA() returns.
# to distinguish your id from those from LDA()
dtext$id <- paste0("doc_", dtext$id)
# this takes the document name from "id"
toks <- corpus(dtext, docid_field = "id") %>%
tokens()
Then run your other steps exactly as above.
We can see that the first document is empty (has zero feature counts). This is the one that is dropped in the conversion of the dfm to the "topicmodels" format.
ntoken(myDfm)
## text1 text2 text3 text4
## 0 49 63 201
as.matrix(dtm[, 1:3])
## Terms
## Docs dataset_contain contain_movi movi_review
## text2 1 1 1
## text3 1 0 0
## text4 0 0 0
These document names are obliterated by LDA(), however.
toptopics
## document topic
## 1 1 V2
## 2 2 V2
## 3 3 V1
But we can (re)assign them from the rownames of dtm, which will correspond 1:1 to the documents returned by LDA().
toptopics$docname <- rownames(dtm)
toptopics
## document topic docname
## 1 1 V2 text2
## 2 2 V2 text3
## 3 3 V1 text4
And now, toptopics$docname can be merged with dtext$id, solving your problem.
You can grab the ids of any texts with 0 words prior to converting to a dtm using apply and which:
library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
removed <- which(apply(myDfm, 1, sum) == 0)
Result:
> removed
text1
1

Using the French ANEW dictionary for sentiment analysis

Similarly to this post, I'm trying to use the Affective Norms for English Words (in French) for a sentiment analysis with Quanteda. I ultimately want to create a "mean sentiment" per text in my corpus.
First, I load in the ANEW dictionary (FAN in French) and create a named vector of weights. ANEW differs from other dictionaries since it does not use a key: value pair format, but rather assigns a numerical score to each word. The goal is to select features and then scoring them using weighted counts.
The ANEW file looks like this : MOT/ VALENCE cancer: 1.01, potato: 3.56, love: 6.56
#### FAN DATA ####
# read in the FAN data
df_fan <- read.delim("fan_anew.txt", stringsAsFactors = FALSE)
# construct a vector of weights with the term as the name
vector_fan <- df_fan$valence
names(vector_fan) <- df_fan$mot
Then I tried to apply dfm_weight() to my corpus of 27 documents.
# create a dfm selecting on the FAN words
dfm_fan <- dfm(my_corpus, select = df_fan$mot, language = "French")
dfm_fan_weighted <- dfm_fan %>%
dfm_weight(scheme = "prop") %>%
dfm_weight(weights = vector_fan)
## Warning messages:
## 1: dfm_weight(): ignoring 696 unmatched weight features
## 2: In diag(weight) : NAs introduced by coercion
Here is what I get, only 6 documents are included in the dfm object generated and the code doesn't estimate the ANEW mean score for each document in the original corpus.
tail(dfm_fan_weighted)
## Document-feature matrix of: 6 documents, 335 features (72.6% sparse).
tail(dfm_fan_weighted)[, c("absent", "politique")]
## Error in intI(j, n = x#Dim[2], dn[[2]], give.dn = FALSE) : invalid character indexing
tail(rowSums(dfm_fan_weighted))
## text22 text23 text24 text25 text26 text27
## NA NA NA NA NA NA
tail(dfm_fan_weighted)[, c("beau")]
## Document-feature matrix of: 6 documents, 1 feature (100% sparse).
## 6 x 1 sparse Matrix of class "dfm"
## features
## docs beau
## text22 0
## text23 0
## text24 0
## text25 0
## text26 0
## text27 0
Any idea to fix it? I think the code needs just some small changes to work properly.
Edit: I edited the code following Ken Benoit comment.

R: find ngram using dfm when there are multiple sentences in one document

I have a big dataset (>1 million rows) and each row is a multi-sentence text. For example following is a sample of 2 rows:
mydat <- data.frame(text=c('I like apple. Me too','One two. Thank you'),stringsAsFactors = F)
What I was trying to do is extracting the bigram terms in each row (the "." will be able to separate ngram terms). If I simply use the dfm function:
mydfm = dfm(mydat$text,toLower = T,removePunct = F,ngrams=2)
dtm = as.DocumentTermMatrix(mydfm)
txt_data = as.data.frame(as.matrix(dtm))
These are the terms I got:
"i_like" "like_apple" "apple_." "._me" "me_too" "one_two" "two_." "._thank" "thank_you"
These are What I expect, basically "." is skipped and used to separate the terms:
"i_like" "like_apple" "me_too" "one_two" "thank_you"
Believe writing slow loops can solve this as well but given it is a huge dataset I would prefer efficient ways similar to the dfm() in quanteda to solve this. Any suggestions would be appreciated!
#Jota's answer works but there is a way to control the tokenisation more finely while calling it only once:
(toks <- tokenize(toLower(mydat$text), removePunct = 2, ngrams = 2))
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "apple_me" "me_too"
##
## Component 2 :
## [1] "one_two" "two_thank" "thank_you"
dfm(toks)
## Document-feature matrix of: 2 documents, 7 features.
## 2 x 7 sparse Matrix of class "dfmSparse"
## features
## docs i_like like_apple apple_me me_too one_two two_thank thank_you
## text1 1 1 1 1 0 0 0
## text2 0 0 0 0 1 1 1
Added:
Then to remove any ngram with . punctuation, you can use: the following, which defaults to valuetype = "glob":
removeFeatures(toks2, "*.*")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "me_too"
##
## Component 2 :
## [1] "one_two" "thank_you"
If your goal is just to extract those bigrams, then you could use tokens twice. Once to tokenize to sentences, then again to make the ngrams for each sentence.
library("quanteda")
mydat$text %>%
tokens(mydat$text, what = "sentence") %>%
as.character() %>%
tokens(ngrams = 2, remove_punct = TRUE) %>%
as.character()
#[1] "I_like" "like_apple" "Me_too" "One_two" "Thank_you"
Insert a tokens_tolower() after the first tokens() call if you like, or use char_tolower() at the end.

Concatenate dfm matrices in 'quanteda' package

Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.
An example:
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
gives an error.
The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.
Also recall that 'dfm' from 'quanteda' is a S4 class.
Should work "out of the box", if you are using the latest version:
packageVersion("quanteda")
## [1] ‘0.9.6.9’
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## is one sample surprise text this
## doc1 1 1 2 0 1 1
## doc2 1 1 2 1 1 1
See also ?selectFeatures where features is a dfm object (there are examples in the help file).
Added:
Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:
require(tm)
dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
rbind(dtm1, dtm2)
## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.
This almost gets it, but seems to duplicate the repeated feature:
as.matrix(rbind(c(dtm1, dtm2)))
## Terms
## Docs one sample sample. text this surprise!
## 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1

Map vectors into integers in a bijective way

I have 100,000 5-length vectors (the list VECTORS below) whose elements are chosen among one million values.
# dictionary
dictionary=seq(1:1e6)
# generate 100,000 5-length vectors whose elements are chosen from dictionary
VECTORS <- lapply(c(1:1e5), sample, x = dictionary, size =5)
My problem is to map each exact same vector into one integer, i.e. I need a mappy function that inputs a vector and yields an integer.
mappy(c(58431, 976854, 661294, 460685, 341123))=15, for example. Do you know how to do this in an efficient way?
Subsidiary question : what if my vectors aren't the same length anymore?
I assume here you want a bijection between the vectors you have in your list and integers. One approach would be to create a factor variable out of character representations of your vectors. Let's start with a reproducible version of your code (I'll make it a smaller vector):
set.seed(144)
VECTORS <- replicate(1e2, sample(seq_len(1e6), 5), FALSE)
Now you can create a factor variable from the character representation of each vector:
fvar <- factor(sapply(VECTORS, paste, collapse=" "))
Now we have a bijection between string representations of elements of VECTORS and integers:
vec <- c(894025, 153892, 98596, 218401, 36616) # 15th element of VECTORS
which(levels(fvar) == paste(vec, collapse=" "))
# [1] 90
levels(fvar)[90]
# [1] "894025 153892 98596 218401 36616"
as.numeric(strsplit(levels(fvar)[90], " ")[[1]])
# [1] 894025 153892 98596 218401 36616
If you wanted to wrap them up into nice functions:
id.from.vec <- function(vec) which(levels(fvar) == paste(vec, collapse=" "))
id.from.vec(c(894025, 153892, 98596, 218401, 36616))
# [1] 90
vec.from.id <- function(id) as.numeric(strsplit(levels(fvar)[id], " ")[[1]])
vec.from.id(90)
# [1] 894025 153892 98596 218401 36616
Note that this works out of the box even if the vectors are different lengths.
A keyed data.table has nice lookup properties:
library(data.table)
set.seed(1)
VECTORS <- lapply(seq(1e5), sample, x = 1e6, size = 5)
VECmap <- setkey(rbindlist(lapply(unique(VECTORS), as.list)))[, ID := .I]
# V1 V2 V3 V4 V5 ID
# 1: 13 897309 366563 678873 6571 1
# 2: 15 557977 640484 732531 848939 2
# 3: 48 18120 911805 188728 805726 3
# 4: 48 830301 862433 506297 877432 4
# 5: 52 873436 824165 86251 576173 5
# ---
# 99996: 999911 583599 803402 240910 931996 99996
# 99997: 999931 146505 287431 180259 230904 99997
# 99998: 999937 175888 266336 874987 982951 99998
# 99999: 999950 960139 455084 586956 875504 99999
# 100000: 999993 191750 258982 518519 78087 100000
mapVEC <- function(...) VECmap[.(...)]$ID
mapID <- function(id) unlist(VECmap[ID==id,!"ID",with=FALSE], use.names=FALSE)
# example usage
mapVEC(52, 873436, 824165, 86251, 576173)
# 5
mapID(5)
# 52 873436 824165 86251 576173
Comments As mentioned by #Roland, a bijection between (a) 1..1e6 and (b) all 5-length sequences of distinct numbers from 1..1e5 is not possible, so I'm just guessing that this is what the OP is after.
When you write a function with ... as an argument, that means an arbitrary number of unnamed arguments are accepted. Within the function, these arguments can be referred to with ..., but are often also seen with c(...) and list(...). Within a data.table, .(...) is an alias for list(...). To see documentation for writing functions, type help.start() and click through to the "R Language Definition."

Resources