Implementing N-grams in my corpus, Quanteda Error - r

I am trying to implement quanteda on my corpus in R, but I am getting:
Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, :
duplicate row.names: character(0)
I don't have much experience with this. Here is a download of the dataset: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0
Here is the code:
tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)
quanteda.corpus <- corpus(corpus)

The processing that you're doing with tm is preparing a object for tm and quanteda doesn't know what to do with it...quanteda does all of these steps itself, help("dfm"), as can be seen from the options.
If you try the following you can move ahead:
dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english", ignoredFeatures=stopwords("english"), stem=TRUE)
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing documents: 6,943 documents
... indexing features: 15,164 feature types
... removed 161 features, from 174 supplied (glob) feature types
... stemming features (English), trimmed 2175 feature variants
... created a 6943 x 12828 sparse dfm
... complete.
Elapsed time: 0.756 seconds.
HTH

No need to start with the tm package, or even to use read.csv() at all - this is what the quanteda companion package readtext is for.
So to read in the data, you can send the object created by readtext::readtext() straight to the corpus constructor:
myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
##
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1 19 21 1 2 0.7579
## text2 18 20 2 2 0.8775
## text3 23 24 1 -1 0.6805
## text5 17 19 2 0 1.0000
## text4 18 19 1 -1 0.8820
##
## Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:
From there, you can perform all of the pre-processing stems directly in the dfm() call, including the choice of ngrams:
# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete.
## Elapsed time: 0.662 seconds.
# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete.
## Elapsed time: 1.419 seconds.

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

Clean corpus using Quanteda

What's the Quanteda way of cleaning a corpus like shown in the example below using tm (lowercase, remove punct., remove numbers, stem words)? To be clear, I don't want to create a document-feature matrix with dfm(), I just want a clean corpus that I can use for a specific downstream task.
# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
PS I am aware that I could just do quanteda_corpus <- quanteda::corpus(crude)to get what I want, but I would much prefer being able to do everything in Quanteda.
I think what you want to do is deliberately impossible in quanteda.
You can, of course, do the cleaning quite easily without losing the order of words using the tokens* set of functions:
library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_wordstem()
print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effect"
#> [7] "today" "it" "had" "cut" "it" "contract"
#> [ ... and 78 more ]
#>
#> reut-00002.xml :
#> [1] "OPEC" "may" "be" "forc" "to" "meet" "befor"
#> [8] "a" "schedul" "June" "session" "to"
#> [ ... and 427 more ]
#>
#> reut-00004.xml :
#> [1] "Texaco" "Canada" "said" "it" "lower" "the"
#> [7] "contract" "price" "it" "will" "pay" "for"
#> [ ... and 40 more ]
#>
#> [ reached max_ndoc ... 17 more documents ]
But it is not possible to return this tokens object into a corpus. Now it would be possible to write a new function to do this:
corpus.tokens <- function(x, ...) {
quanteda:::build_corpus(
unlist(lapply(x, paste, collapse = " ")),
docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
)
}
corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#>
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#>
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#>
#> [ reached max_ndoc ... 17 more documents ]
But this object, while technically being a corpus class object, is not what a corpus is supposed to be. From ?corpus [emphasis added]:
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level
metadata, and default settings for subsequent processing of the
corpus.
The object above does not meet this description as the original texts have been processed already. Yet the class of the object communicates otherwise. I don't see a reason to break this logic as all subsequent analyses steps should be possible using either tokens* or dfm* functions.

How to maintain ngrams in a quanteda dfm?

I'm using quanteda to create a document feature matrix (dfm) from a tokens object. My tokens object contains many ngrams (ex: "united_states"). When I create a dfm using the dfm() function, my ngrams are split by the understcore ("united_states" gets split into "united" "states"). How can I create a dfm while maintaining my ngrams?
Here's my process:
my_tokens <- tokens(my_corpus, remove_symbols=TRUE, remove_punct = TRUE, remove_numbers = TRUE)
my_tokens <- tokens_compound(pattern=phrase(my_ngrams))
my_dfm <- dfm(my_tokens, stem= FALSE, tolower=TRUE)
I see "united_states" in my_tokens, but in the dfm it becomes "united" and "states" as separate tokens.
Thank you for any help you can offer!
It's not clear which version of quanteda you are using, but basically this should work, since the default tokenizer (from tokens()) will not split words containing an inner _.
Demonstration:
library("quanteda")
## Package version: 2.1.1
# tokens() will not separate _ words
tokens("united_states")
## Tokens consisting of 1 document.
## text1 :
## [1] "united_states"
Here's a reproducible example for the phrase "United States":
my_corpus <- tail(data_corpus_inaugural, 3)
# show that the phrase exists
head(kwic(my_corpus, phrase("united states"), window = 2))
##
## [2009-Obama, 2685:2686] bless the | United States | of America
## [2013-Obama, 13:14] of the | United States | Congress,
## [2013-Obama, 2313:2314] bless these | United States | of America
## [2017-Trump, 347:348] , the | United States | of America
## [2017-Trump, 1143:1144] to the | United States | of America
my_tokens <- tokens(my_corpus,
remove_symbols = TRUE,
remove_punct = TRUE, remove_numbers = TRUE
)
my_tokens <- tokens_compound(my_tokens, pattern = phrase("united states"))
my_dfm <- dfm(my_tokens, stem = FALSE, tolower = TRUE)
dfm_select(my_dfm, "*_*")
## Document-feature matrix of: 3 documents, 1 feature (0.0% sparse) and 4 docvars.
## features
## docs united_states
## 2009-Obama 1
## 2013-Obama 2
## 2017-Trump 2

Stopwords Remaining in Corpus After Cleaning

I am attempting to remove the stopword "the" from my corpus, however not all instances are being removed.
library(RCurl)
library(tm)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
tm_map(shakespeare, content_transformer(tolower))
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
The first inspect call reveals:
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
And the second, after cleaning:
Terms
Docs the thee
1 1916 1298
2 1711 1140
3 760 740
What am I missing here about the logic of removeWords that it would ignore all these instances of "the"?
EDIT
I was able to get the instances of "the" down to below 1000 by a slight call change and making the removewords call the very first cleaning step:
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the","The"))
Which gets me down to:
Docs the thee
1 145 752
2 130 660
3 71 382
Still though, I'd like to know why I can't seem to eliminate them all.
Hereby reproducable code which leads to 0 instances of "the". I solved your typo and used your code from before the edit.
library(RCurl)
library(tm)
library(SnowballC)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
leads to:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 6/0
Sparsity : 0%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
and after cleaning and solving the typo:
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
shakespeare = tm_map(shakespeare, content_transformer(tolower)) ## FIXED TYPO
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
it leads to:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 3/3
Sparsity : 50%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 0 1298
2 0 1140
3 0 740

Create a corpus from df including doc names

I'm reading all my textfiles into a df with the readtext package.
df <- readtext(directory, "*.txt")
The .txt files get stored in a df with doc_id (name of the document) and text (content).
Before I upgraded to the newest version of quanteda, the doc_id was stored in the corpus object when I created my corpus using:
corpus <- corpus(df)
But now this doesn't work anymore, the 'documents'-df of the corpus object only stores the 'texts'-values, but not the doc_id-values anymore.
How do I get back my doc_id into my corpus object?
That's because of a bug that we fixed prior to v1.2.0. When constructing a corpus from a data.frame, some field is required for a document id, and by default this is the readtext doc_id.
If you want it also as a document variable, you can do it this way. First, I read in some texts from the system files of the readtext package, for a reproducible example.
library("readtext")
library("quanteda")
packageVersion("readtext")
## [1] ‘0.50’
packageVersion("quanteda")
## [1] ‘1.2.0’
df <- readtext(paste0(DATA_DIR, "txt/EU_manifestos/*.txt"), encoding = "LATIN1")
df
## readtext object consisting of 17 documents and 0 docvars.
## # data.frame [17 × 2]
## doc_id text
## <chr> <chr>
## 1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..."
## 2 EU_euro_2004_de_V.txt "\"Gemeinsame\"..."
## 3 EU_euro_2004_en_PSE.txt "\"PES · PSE \"..."
## 4 EU_euro_2004_en_V.txt "\"Manifesto\n\"..."
## 5 EU_euro_2004_es_PSE.txt "\"PES · PSE \"..."
## 6 EU_euro_2004_es_V.txt "\"Manifesto\n\"..."
When we create the corpus from this, we see no document variables.
crp <- corpus(df)
crp
## data frame with 0 columns and 17 rows
But it's trivial to add them:
docvars(crp, "doc_id") <- df$doc_id
head(docvars(crp))
## doc_id
## EU_euro_2004_de_PSE.txt EU_euro_2004_de_PSE.txt
## EU_euro_2004_de_V.txt EU_euro_2004_de_V.txt
## EU_euro_2004_en_PSE.txt EU_euro_2004_en_PSE.txt
## EU_euro_2004_en_V.txt EU_euro_2004_en_V.txt
## EU_euro_2004_es_PSE.txt EU_euro_2004_es_PSE.txt
## EU_euro_2004_es_V.txt EU_euro_2004_es_V.txt
Note that you are strongly discouraged from accessing the internals of the corpus object through its data.frame element df$documents. Using the accessor docvars() and replacement docvars()<- will work in the future, but the internals of the corpus are likely to change.

Resources