how to extract ngrams from a text in R (newspaper articles)

how to extract ngrams from a text in R (newspaper articles) - r

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm:
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE)
I am trying to extract bigrams (e.g. "climate change", "global warming") but keep getting an error message when I type the following, saying the ngrams argument is not used.
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE, ngrams = 2)
I have installed the tokenizer, tidyverse, dplyr, ngram, readtext, quanteda and stm libraries.
Below is a screenshot of my corpus.
Doc_iD is the article titles. I need the bigrams to be extracted from the "texts" column.
Do I need to extract the ngrams from the corpus first or can I do it from the dfm? Am I missing some piece of code that allows me to extract the bigrams?

Strictly speaking, if ngrams are what you want, then you can use tokens_ngrams() to form them. But sounds like you rather get more interesting multi-word expressions than "of the" etc. For that, I would use textstat_collocations(). You will want to do this on tokens, not on a dfm - the dfm will have already split your tokens into bag of words features, from which ngrams or MWEs can no longer be formed.
Here's an example from the built-in inaugural corpus. It removes stopwords but leaves a "pad" so that words that were not adjacent before the stopword removal will not appear as adjacent after their removal.
library("quanteda")
## Package version: 2.0.1
toks <- tokens(data_corpus_inaugural) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 united states 157 0 2 7.893348 41.19480
## 2 let us 97 0 2 6.291169 36.15544
## 3 fellow citizens 78 0 2 7.963377 32.93830
## 4 american people 40 0 2 4.426593 23.45074
## 5 years ago 26 0 2 7.896667 23.26947
## 6 federal government 32 0 2 5.312744 21.80345
These are by default scored and sorted in order of descending score.
To "extract" them, just take the collocation column:
head(colls$collocation, 50)
## [1] "united states" "let us" "fellow citizens"
## [4] "american people" "years ago" "federal government"
## [7] "almighty god" "general government" "fellow americans"
## [10] "go forward" "every citizen" "chief justice"
## [13] "four years" "god bless" "one another"
## [16] "state governments" "political parties" "foreign nations"
## [19] "solemn oath" "public debt" "religious liberty"
## [22] "public money" "domestic concerns" "national life"
## [25] "future generations" "two centuries" "social order"
## [28] "passed away" "good faith" "move forward"
## [31] "earnest desire" "naval force" "executive department"
## [34] "best interests" "human dignity" "public expenditures"
## [37] "public officers" "domestic institutions" "tariff bill"
## [40] "first time" "race feeling" "western hemisphere"
## [43] "upon us" "civil service" "nuclear weapons"
## [46] "foreign affairs" "executive branch" "may well"
## [49] "state authorities" "highest degree"

I think you need to create the ngram directly from the corpus. This is an example adapted from the quanteda tutorial website:
library(quanteda)
corp <- corpus(data_corpus_inaugural)
toks <- tokens(corp)
tokens_ngrams(toks, n = 2)
Tokens consisting of 58 documents and 4 docvars.
1789-Washington :
[1] "Fellow-Citizens_of" "of_the" "the_Senate" "Senate_and" "and_of" "of_the" "the_House"
[8] "House_of" "of_Representatives" "Representatives_:" ":_Among" "Among_the"
[ ... and 1,524 more ]

EDITED Hi this example from the help dfm may be useful
library(quanteda)
# You say you're already creating the corpus?
# where it says "data_corpus_inaugaral" put your corpus name
# Where is says "the_senate" put "climate change"
# where is says "the_house" put "global_warming"
tokens(data_corpus_inaugural) %>%
tokens_ngrams(n = 2) %>%
dfm(stem = TRUE, select = c("the_senate", "the_house"))
#> Document-feature matrix of: 58 documents, 2 features (89.7% sparse) and 4 docvars.
#> features
#> docs the_senat the_hous
#> 1789-Washington 1 2
#> 1793-Washington 0 0
#> 1797-Adams 0 0
#> 1801-Jefferson 0 0
#> 1805-Jefferson 0 0
#> 1809-Madison 0 0
#> [ reached max_ndoc ... 52 more documents ]

Related

quanteda collocations and lemmatization

I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) Generate collocations using the corpus object:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) preprocess text and identify collocations and lemmatize for downstream tasks.
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) test results
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature
count
this
1
column
1
has
1
a
2
lot
1
of
1
almost
1
i
2
am
1
interested
1
in
1
problems
1
is
1
headache
1
how
1
do
1
you
1
handle
1
missing data
4
"missing data" should be "miss datum".
This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.

The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase() wrapper in tokens_compound(), you are telling tokens_replace() to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"

Find in a dfm non-english tokens and remove them

In a dfm how is it possible to detect non english words and remove them?
dftest <- data.frame(id = 1:3,
text = c("Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"))
Example the construction of dfm is this:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% %>% tokens_wordstem() %>%
dfm()
I found textcat package as an alterative solution but there are many case in a real dataset where a whole row which is in the English language it recognize it as another language only for a character. Is there any alternative to find non-English rows in a dataframe or token in the dfm using quanteda?

You can do this using a word list of all English words. One place where this exists is in the hunspell pacakges, which is meant for spell checking.
library(quanteda)
# find the path in which the right dictionary file is stored
hunspell::dictionary(lang = "en_US")
#> <hunspell dictionary>
#> affix: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.aff
#> dictionary: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic
#> encoding: UTF-8
#> wordchars: ’
#> added: 0 custom words
# read this into a vector
english_words <- readLines("/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic") %>%
# the vector contains extra information on the words, which is removed
gsub("/.+", "", .)
# let's display a sample of the words
set.seed(1)
sample(english_words, 50)
#> [1] "furnace" "steno" "Hadoop" "alumna"
#> [5] "gonorrheal" "multichannel" "biochemical" "Riverside"
#> [9] "granddad" "glum" "exasperation" "restorative"
#> [13] "appropriate" "submarginal" "Nipponese" "hotting"
#> [17] "solicitation" "pillbox" "mealtime" "thunderbolt"
#> [21] "chaise" "Milan" "occidental" "hoeing"
#> [25] "debit" "enlightenment" "coachload" "entreating"
#> [29] "grownup" "unappreciative" "egret" "barre"
#> [33] "Queen" "Tammany" "Goodyear" "horseflesh"
#> [37] "roar" "fictionalization" "births" "mediator"
#> [41] "resitting" "waiter" "instructive" "Baez"
#> [45] "Muenster" "sleepless" "motorbike" "airsick"
#> [49] "leaf" "belie"
Armed with this vector which should, in theory, contain all English words but only words in English, we can remove non-English tokens:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_keep(english_words, valuetype = "fixed") %>%
tokens_wordstem() %>%
dfm()
testDfm
#> Document-feature matrix of: 3 documents, 9 features (66.7% sparse).
#> features
#> docs this a spanish word english onlin here comment va
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 0 0
#> text3 0 0 0 0 0 0 0 1 1
As you can see, this works pretty well but isn't perfect. The "va" from "ça va" has been picked up as an English word as has "comment". What you want to do is thus a matter of finding the right word list and/or cleaning it. You can also think about removing texts in which too many words have been removed.

The question is not entirely clear as to whether you want to remove non-English "rows" first, or remove non-English words later. There are a lot of cognates between European languages (words that are homographs appearing in more than one language) so the tokens_keep() strategy will be imperfect.
You could remove the non-English documents after detecting the language, using the cld3 library:
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
And then input that into quanteda::dfm().

substituting several ngrams in quanteda

In my text of news articles I would like to convert several different ngrams that refer to the same political party to an acronym. I would like to do this because I would like to avoid any sentiment dictionaries confusing the words in the party's name (Liberal Party) with the same word in different contexts (liberal helping).
I can do this below with str_replace_all and I know about the token_compound() function in quanteda, but it doesn't seem to do exactly what I need.
library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')
Should I somehow just preprocess the text before turning it into a corpus? Or is there a way to do this after turning it into a corpus in quanteda.
Here is some expanded sample code that specifies the problem a little better:
`text<-c('a text about some political parties called the new democratic party
the new democrats and the liberal party and the liberals. I would like the
word democratic to be counted in the dfm but not the words new democratic.
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))
dfm(text, dictionary=partydict)`
This example counts democratic in both the new democratic and the democratic sense, but I would those counted separately.

You want the function tokens_lookup(), after having defined a dictionary that defines the canonical party labels as keys, and lists all the ngram variations of the party names as values. By setting exclusive = FALSE it will keep the tokens that are not matched, in effect acting as a substitution of all variations with the canonical party names.
In the example below, I've modified your input text a bit to illustrate the ways that the party names will be combined to be different from the phrases using "liberal" but not "liberal party".
library("quanteda")
text<-c('a text about some political parties called the new democratic party
which is conservative the new democrats and the liberal party and the
liberals which are liberal helping poor people')
toks <- tokens(text)
partydict <- dictionary(list(
olp = c("liberal party", "the liberals"),
ndp = c("new democrats", "new democratic party")
))
(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
## [1] "a" "text" "about" "some" "political" "parties"
## [7] "called" "the" "NDP" "which" "is" "conservative"
## [13] "the" "NDP" "and" "the" "OLP" "and"
## [19] "OLP" "which" "are" "liberal" "helping" "poor"
## [25] "people"
So that has replaced the party name variances with the party keys.
Constructing a dfm from this new tokens now occurs on these new tokens, preserving the uses of (e.g.) "liberal" that might be linked to sentiment, but having already combined the "liberal party" and replaced it with "OLP". Applying a dictionary to the dfm will now work for your example of "liberal" in "liberal helping" without having confused it with the use of "liberal" in the party name.
sentdict <- dictionary(list(
left = c("liberal", "left"),
right = c("conservative", "")
))
dfm(toks2) %>%
dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
## features
## docs olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
## text1 2 2 1 1 1 1 1 1 1 3 2 1 1 2 1 1 1
## features
## docs poor people
## text1 1 1
Two additional notes:
If you do not want the keys uppercased in the replacement tokens, set capkeys = FALSE.
You can set different matching types using the valuetype argument, including valuetype = regex. (And note that your regular expression in the example is probably not correctly formed, since the scope of your | operator in the ndp example will get "new democrats" OR "new" and then " democratic party". But with tokens_lookup() you won't need to worry about that!)

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.

I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.

Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

Document term matrix in R

I have the following code:
rm(list=ls(all=TRUE)) #clear data
setwd("~/UCSB/14 Win 15/Issy/text.fwt") #set working directory
files <- list.files(); head(files) #load & check working directory
fw1 <- scan(what="c", sep="\n",file="fw_chp01.fwt")
library(tm)
corpus2<-Corpus(VectorSource(c(fw1)))
skipWords<-(function(x) removeWords(x, stopwords("english")))
#remove punc, numbers, stopwords, etc
funcs<-list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc<-tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(1,110))) #create document term matrix
I'm trying use some of the operations detailed in the tm reference manual (http://cran.r-project.org/web/packages/tm/tm.pdf) with little success. For example, when I try to use the findFreqTerms, I get the following error:
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE
Can anyone clue me in as to why this isn't working and what I can do to fix it?
Edited for #lawyeR:
head(fw1) produces the first six lines of the text (Episode 1 of Finnegans Wake by James Joyce):
[1] "003.01 riverrun, past Eve and Adam's, from swerve of shore to bend"
[2] "003.02 of bay, brings us by a commodius vicus of recirculation back to"
[3] "003.03 Howth Castle and Environs."
[4] "003.04 Sir Tristram, violer d'amores, fr'over the short sea, had passen-"
[5] "003.05 core rearrived from North Armorica on this side the scraggy"
[6] "003.06 isthmus of Europe Minor to wielderfight his penisolate war: nor"
inspect(corpus2) outputs each line of the text in the following format (this is the final line of the text):
[[960]]
<<PlainTextDocument (metadata: 7)>>
029.36 borough. #this part differs by line of course
inspect(corpus2a.dtm) returns a table of all the types (there are 4163 in total( in the text in the following format:
Docs youths yoxen yu yurap yutah zee zephiroth zine zingzang zmorde zoom
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0

Here is a simplified form of what you provided and did, and tm does its job. It may be that one or more of your cleaning steps caused a problem.
> library(tm)
> fw1 <- c("riverrun, past Eve and Adam's, from swerve of shore to bend
+ of bay, brings us by a commodius vicus of recirculation back to
+ Howth Castle and Environs.
+ Sir Tristram, violer d'amores, fr'over the short sea, had passen-
+ core rearrived from North Armorica on this side the scraggy
+ isthmus of Europe Minor to wielderfight his penisolate war: nor")
>
> corpus<-Corpus(VectorSource(c(fw1)))
> inspect(corpus)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
riverrun, past Eve and Adam's, from swerve of shore to bend
of bay, brings us by a commodius vicus of recirculation back to
Howth Castle and Environs.
Sir Tristram, violer d'amores, fr'over the short sea, had passen-
core rearrived from North Armorica on this side the scraggy
isthmus of Europe Minor to wielderfight his penisolate war: nor
> dtm <- DocumentTermMatrix(corpus)
> findFreqTerms(dtm)
[1] "adam's," "and" "armorica" "back" "bay," "bend"
[7] "brings" "castle" "commodius" "core" "d'amores," "environs."
[13] "europe" "eve" "fr'over" "from" "had" "his"
[19] "howth" "isthmus" "minor" "nor" "north" "passen-"
[25] "past" "penisolate" "rearrived" "recirculation" "riverrun," "scraggy"
[31] "sea," "shore" "short" "side" "sir" "swerve"
[37] "the" "this" "tristram," "vicus" "violer" "war:"
[43] "wielderfight"
As another point, I find it useful at the start to load a few other complementary packages to tm.
library(SnowballC); library(RWeka); library(rJava); library(RWekajars)
For what its worth, as compared to your somewhat complicated cleaning steps, I usually trudge along like this (replace comments$comment with your text vector):
comments$comment <- tolower(comments$comment)
comments$comment <- removeNumbers(comments$comment)
comments$comment <- stripWhitespace(comments$comment)
comments$comment <- str_replace_all(comments$comment, " ", " ")
# replace all double spaces internally with single space
# better to remove punctuation with str_ because the tm function doesn't insert a space
library(stringr)
comments$comment <- str_replace_all(comments$comment, pattern = "[[:punct:]]", " ")
comments$comment <- removeWords(comments$comment, stopwords(kind = "english"))

From another ticket this should help tm 0.6.0 has a bug and it can be addressed with this statement.
corpus_clean <- tm_map( corp_stemmed, PlainTextDocument)
Hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to extract ngrams from a text in R (newspaper articles) - r

Related

quanteda collocations and lemmatization

Find in a dfm non-english tokens and remove them

substituting several ngrams in quanteda

removing special apostrophes from French article contractions when tokenizing

Document term matrix in R

Categories

Resources