I am using the quanteda package by Ken Benoit and Paul Nulty to work with textual data.
My corpus contains texts with full German sentences and I want to work with the nouns of every text only. One trick in German is to use the upper case words only, but this would fail at the beginning of a sentence.
Text1 <- c("Halle an der Saale ist die grünste Stadt Deutschlands")
Text2 <- c("In Hamburg regnet es immer, das ist also so wie in London.")
Text3 <- c("James Bond trinkt am liebsten Martini")
myCorpus <- corpus(c(Text1, Text2, Text3))
metadoc(myCorpus, "language") <- "german"
summary(myCorpus, showmeta = T)
myDfm <- dfm(myCorpus, tolower = F, remove_numbers = T,
remove = stopwords("german"), remove_punct = TRUE,
remove_separators = T)
topfeatures(myDfm, 20)
From this minimal example, I would like to retrieve:
"Halle", "Saale", "Stadt", "Deutschland", "Hamburg", "London", "Martini", "James", "Bond".
I assume I need a dictionary, which defines verbs/nouns/etc. and the proper names (James Bond, Hamburg etc.), or is there a build in function/dict?
Bonus Question: Does the solution work for English texts too?
You need some help from a part-of-speech tagger. Fortunately there is a great one, with a German language model, in the form of spaCy, and a package we wrote as a wrapper around it, spacyr. Installation instructions are at the spacyr page.
This code will do what you want:
txt <- c("Halle an der Saale ist die grünste Stadt Deutschlands",
"In Hamburg regnet es immer, das ist also so wie in London.",
"James Bond trinkt am liebsten Martini")
library("spacyr")
spacy_initialize(model = "de")
txtparsed <- spacy_parse(txt, tag = TRUE, pos = TRUE)
head(txtparsed, 20)
# doc_id sentence_id token_id token lemma pos tag entity
# 1 text1 1 1 Halle halle PROPN NE LOC_B
# 2 text1 1 1 an an ADP APPR LOC_I
# 3 text1 1 1 der der DET ART LOC_I
# 4 text1 1 1 Saale saale PROPN NE LOC_I
# 5 text1 1 1 ist ist AUX VAFIN
# 6 text1 1 1 die die DET ART
# 7 text1 1 1 grünste grünste ADJ ADJA
# 8 text1 1 1 Stadt stadt NOUN NN
# 9 text1 1 1 Deutschlands deutschlands PROPN NE LOC_B
# 10 text2 1 1 In in ADP APPR
# 11 text2 1 1 Hamburg hamburg PROPN NE LOC_B
# 12 text2 1 1 regnet regnet VERB VVFIN
# 13 text2 1 1 es es PRON PPER
# 14 text2 1 1 immer immer ADV ADV
# 15 text2 1 1 , , PUNCT $,
# 16 text2 1 1 das das PRON PDS
# 17 text2 1 1 ist ist AUX VAFIN
# 18 text2 1 1 also also ADV ADV
# 19 text2 1 1 so so ADV ADV
# 20 text2 1 1 wie wie CONJ KOKOM
(nouns <- with(txtparsed, subset(token, pos == "NOUN")))
# [1] "Stadt"
(propernouns <- with(txtparsed, subset(token, pos == "PROPN")))
# [1] "Halle" "Saale" "Deutschlands" "Hamburg" "London"
# [6] "James" "Bond" "Martini"
Here, you can see that the nouns you wanted are marked in the simpler pos field as "proper nouns". The tag field is a more detailed, German-language tagset that you could also select from.
The lists of selected nouns can then be used in quanteda:
library("quanteda")
myDfm <- dfm(txt, tolower = FALSE, remove_numbers = TRUE,
remove = stopwords("german"), remove_punct = TRUE)
head(myDfm)
# Document-feature matrix of: 3 documents, 14 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale grünste Stadt Deutschlands Hamburg
# text1 1 1 1 1 1 0
# text2 0 0 0 0 0 1
# text3 0 0 0 0 0 0
head(dfm_select(myDfm, pattern = propernouns))
# Document-feature matrix of: 3 documents, 8 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale Deutschlands Hamburg London James
# text1 1 1 1 0 0 0
# text2 0 0 0 1 1 0
# text3 0 0 0 0 0 1
Related
In order to identify nonsense text (e.g. djsarejslslasdfhsl) from real (German) words, I would like to do an analysis of letter-frequencies.
My idea is to calculate the relative frequencies of two-letter-combinations ("te", "ex", "xt", "is" etc.) using a long text. Based on that information I would like to calculate the probability that a given word (or sentence) is real German.
But my first problem is, how to extract all the two-letter-combinations and to count them? I fear that using substring(string, start, stop) and increasing the values of start and stop in a loop might not be a very efficient solution. Do you have any idea?
# A short sample text
text <- 'Es ist ein Freudentag – ohne Zweifel. Gesundheitsminister Alain Berset und der Bundesrat gehen weiter, als man annehmen durfte. Die Zertifikatspflicht wird aufgehoben, die Maskenpflicht gilt nur noch im ÖV und in Gesundheitseinrichtungen.
Die beste Meldung des Tages aber ist: Die Covid-19-Task-Force, inzwischen als «Task-Farce» verballhornt, wird auf Ende März aufgehoben – zwei Monaten früher als geplant. Die Dauerkritik war wohl mit ein Grund, dass dieses Gremium sich jetzt rasch auflösen will.
Keine Rosen ohne Dornen: Einzelne Punkte von Bersets Ausführungen geben zu denken.
Die «Isolationshaft» für positiv Getestete bleibt zwingend. Das ist Unsinn und steht in einem scharfen Kontrast zu den übrigen Öffnungsschritten. Die Grundimmunität der Bevölkerung beträgt über 90 Prozent, das Virus ist nicht mehr gefährlich, warum will man weiter Leute zu Hause einsperren? Wer schwer krank ist, geht von sich aus nicht zur Arbeit. Die krankheitsbedingte Bettruhe muss man den Menschen nicht vorschreiben.
Gesundheitsminister Berset findet, das Modell Task-Force habe eine interessante Möglichkeit aufgezeigt für die Zusammenarbeit zwischen Regierung und Wissenschaft. Unter Umständen eigne sich dieses Modell auch für andere Bereiche.
Nein danke, Herr Berset.
Die Task-Force war mit ihrem öffentlichen Dauer-Alarmismus und ihren haarsträubenden Falsch-Prognosen vor allem eine Manipulationsmaschine.
Und dann noch dies: Irgendwann während der heutigen Pressekonferenz gab Alain Berset zu verstehen, man habe mit all diesen Massnahmen die Bevölkerung schützen wollen. Vielleicht hatte man diese hehre Absicht einmal im Hinterkopf. Alle Massnahmen ab der zweiten Welle erfolgten nicht zum Schutz der Bevölkerung, sondern, um einen Zusammenbruch des Spital-Systems zu verhindern.
Doch jetzt stossen wir erst einmal auf das Ende der Apartheit an.'
# Some cleaning:
library(stringr)
text <- str_replace_all(text, "[^[:alnum:]]", " ")
text <- tolower(text)
words <- strsplit(text, "\\s+")[[1]]
words
for(word in words){
???
}
Clean, replacing any sequence of non-alphanumeric with a space
text = tolower(gsub("[^[:alnum:]]+", " ", text))
Find all pairs of sequential letters
twos = substring(text, 1:(nchar(text) - 1), 2:nchar(text))
but only keep those that did not overlap a space
twos[nchar(trimws(twos)) == 2L]
Here's the result
> twos[nchar(trimws(twos)) == 2L] |> table()
19 90 aa ab af ag äg ah äh ai al am an än ap ar är as at ät au äu ba be bl br
1 1 1 6 2 2 1 2 2 2 14 2 16 1 1 10 1 15 6 1 12 1 1 24 1 2
bs bt bu ce ch co da de dh di do du dw eb ed ef eg eh ei ek el em en ep er es
1 1 1 4 34 1 9 23 3 18 2 2 1 1 1 1 1 9 32 1 7 5 54 1 42 19
et eu ev ez fa fä fe ff fg fi fl fn fo fr ft fü ga ge gi gl gn gr gs gt ha he
12 3 3 1 2 1 4 2 3 2 3 1 4 2 3 4 1 19 2 1 2 3 1 4 8 17
hi hk hl hm hn ho hr ht hu hü hw ib ic id ie if ig ih ik il im in io ip ir is
3 1 1 3 2 3 9 11 1 1 1 2 16 1 18 2 4 2 2 3 3 28 2 1 5 12
it iu iv je ka ke kh ko kr kt la ld le lg lh li lk ll ln lö ls lt ma mä me mi
19 1 1 2 1 8 1 3 3 1 6 1 7 1 1 5 3 11 1 1 4 1 12 1 8 7
mm mo mö ms mu na nb nd ne nf ng ni nk nm nn no np nr ns nt nu nz ob oc od öf
3 3 1 2 3 4 1 23 13 1 10 8 5 2 4 3 1 1 6 10 2 3 2 3 2 2
og ög oh ol öl on op or os ös ov öv oz pa pe pf pi pl po pr pu ra rä rb rc rd
1 1 3 3 3 8 1 7 4 1 1 1 1 1 1 3 1 1 1 3 2 5 2 3 4 2
re rf rg rh ri rk rl rm rn ro rr rs rt ru rü rz sa sb sc se sf sh si sk sm sn
14 3 1 1 4 2 1 1 4 3 2 9 2 11 1 1 3 1 13 17 1 1 6 5 4 2
so sp sr ss st su sy ta tä te th ti tl to tr ts tt tu tz ub üb uc ud ue uf uh
2 3 1 9 17 3 1 7 2 24 1 6 1 1 4 6 3 1 4 1 2 2 1 2 6 1
üh ul um un ur ür us ut üt ve vi vo vö wa wä we wi wo ys ze zt zu zw
2 1 5 24 3 3 8 3 1 3 3 4 3 4 1 8 9 2 1 5 2 9 6
The algorithm seems to generalize to sequences of any number of letters by separating words with
chartuples <-
function(text, n = 2)
{
n0 <- n - 1
text <- tolower(gsub(
"[^[:alnum:]]+", paste(rep(" ", n0), collapse = ""), text
))
tuples <- substring(text, 1:(nchar(text) - n0), n:nchar(text))
tuples[nchar(trimws(tuples)) == n]
}
This is also easy to use for looking up the values of any 'word'
counts <- table(charuples(text))
counts[chartuples("djsarejslslasdfhsl")] |> as.vector()
(the NA's in the resulting vector mean letters not present in your original corpus).
words <- unlist(strsplit(text, '[^[:alnum:]]+'))
cmbs2 <- sapply(words, function(x)substring(x, len <- seq(nchar(x) - 1), len + 1),USE.NAMES = TRUE)
head(cmbs2) ## Just to show a few words.
$Es
[1] "Es"
$ist
[1] "is" "st"
$ein
[1] "ei" "in"
$Freudentag
[1] "Fr" "re" "eu" "ud" "de" "en" "nt" "ta" "ag"
$ohne
[1] "oh" "hn" "ne"
$Zweifel
[1] "Zw" "we" "ei" "if" "fe" "el"
If I'm not wrong, this should be pretty efficient:
tokens_char <- function(str, window = 2) {
# remove non-word characters
str <- stringi::stri_replace_all_regex(str, "\\W", "")
# lowercase
str <- tolower(str)
# prep window variable
win <- window - 1
len1 <- seq_len(nchar(str) - win)
# split into strings of length window
stringi::stri_sub(str, from = len1, to = len1 + win)
}
The key is stringi::stri_sub which is a vectorised version of substr. A string is split by moving the window one character at the time. So "This text" is turned into "th" "hi" "is" "st" "te" "ex" "xt". After doing this, we can use some tidyverse code to count occurrences of tokens:
library(tidyverse)
tibble(
token = tokens_char(text, window = 2)
) %>%
count(token, sort = TRUE)
#> # A tibble: 308 × 2
#> token n
#> <chr> <int>
#> 1 en 55
#> 2 er 43
#> 3 ei 35
#> 4 ch 34
#> 5 nd 34
#> 6 in 28
#> 7 te 28
#> 8 be 24
#> 9 un 24
#> 10 de 23
#> # … with 298 more rows
Note that I also included a window argument, which I believe might be useful for your analysis.
tibble(
token = tokens_char(text, window = 3)
) %>%
count(token, sort = TRUE)
#> # A tibble: 851 × 2
#> token n
#> <chr> <int>
#> 1 die 16
#> 2 ich 16
#> 3 ein 15
#> 4 end 13
#> 5 sch 13
#> 6 und 12
#> 7 eit 11
#> 8 nde 10
#> 9 cht 9
#> 10 der 9
#> # … with 841 more rows
And finally, you can also first split your string into words so that letters following each other over word boundaries do not count. For example, "This text" is turned into "th" "hi" "is" "te" "ex" "xt":
tokens_char_words <- function(str, window = 2) {
str <- unlist(tokenizers::tokenize_words(str))
# prep window variable
win <- window - 1
len1 <- lapply(nchar(str) - win, seq_len)
# split into strings of length window
unlist(stringi::stri_sub_all(str = str, from = len1, to = lapply(len1, function(x) x + win)))
}
tokens_char_words("This text", window = 2)
#> [1] "th" "hi" "is" "te" "ex" "xt"
Created on 2022-02-18 by the reprex package (v2.0.1)
Quanteda question.
For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much.
Put differently, I want to get a matrix of the features in each dictionary category that have been matched using the tokens_lookup and dfm_lookup functions, and their frequency per document. So not the aggregated frequency of all words in the category, but of each of them separately.
Is there an easy way to get this?
The easiest way to do this is to iterate over your dictionary "keys" (what you call "categories") and select the matches to create one dfm per key. There are a few steps needed to deal with the non-matches and the compound dictionary values (such as "not fail").
I can demonstrate this using the built-in inaugural address corpus and the LSD2015 dictionary, which has four keys and includes multi-word values.
The loop iterates over the dictionary keys to build up a list, each time doing the following:
select the tokens but leave a pad for ones not selected;
compound the multi-word tokens into single tokens;
rename the pad ("") to OTHER, so that we can count non-matches; and
create the dfm.
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
Now we have all of the dictionary matches for each key in a list of dfm objects:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658
I found a very useful piece of code within Stackoverflow - Finding 2 & 3 word Phrases Using R TM Package
(credit #patrick perry) to show the frequency of 2 and 3 word phrases within a corpus:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
How do you ensure that frequency counts of phrases like "the tin" are not also included in the frequency count of "the tin woodman" or the "tin woodman"?
Thanks
Removing stopwords can remove noise from the data, causing issues such as those you are having a above:
library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>%
arrange(desc(count)) %>%
group_by(grp = str_extract(as.character(term), "\\w+\\s+\\w+")) %>%
mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>%
ungroup() %>%
select(-grp)
In the following codes, my aim is to reduce the number of words with the same stem.
For example, kompis in Swedish refer a friend in English, and the words with similar roots are kompisar, kompiserna.
rm(list=ls())
Sys.setlocale("LC_ALL","sv_SE.UTF-8")
library(tm)
library(SnowballC)
kompis <- c("kompisar", "kompis", "kompiserna")
stem_doc <- stemDocument(kompis, language="swedish")
stem_doc
1] "kompis" "kompis" "kompis"
I create a sample text file including the word kompis, kompisar, kompiserna.
Then, I did some preproceses in the corpus via following codes:
text <- c("TV och vara med kompisar.",
"Jobba på kompis huset",
"Ta det lugnt, umgås med kompisar.",
"Umgås med kompisar, vänner ",
"kolla anime med kompiserna")
corpus.prep <- Corpus(VectorSource(text), readerControl =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, PlainTextDocument)
corpus.prep <- tm_map(corpus.prep, stemDocument,language = "swedish")
head(content(corpus.prep[[1]]))
The results as follows. However, it includes the original words rather than same stem: kompis.
1] "TV och vara med kompisar."
2] "Jobba på kompi huset"
3] "Ta det lugnt, umgå med kompisar."
4] "Umgås med kompisar, vänner"
5] "kolla anim med kompiserna"
Do you know how to fix it?
You are almost there, but using PlainTextDocument is interfering with your goal.
The following code will return your expected result. I'm using remove punctuation otherwise the stemming will not work on the works that are at the end of the sentence. Also you will see warning messages appearing after both tm_map calls. You can ignore these.
corpus.prep <- Corpus(VectorSource(text), readerControl =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, removePunctuation)
corpus.prep <- tm_map(corpus.prep, stemDocument, language = "swedish")
head(content(corpus.prep))
[1] "TV och var med kompis" "Jobb på kompis huset" "Ta det lugnt umgås med kompis" "Umgås med kompis vänn"
[5] "koll anim med kompis"
For this kind of work I tend to use quanteda. Better support and works a lot better than tm.
library(quanteda)
# remove_punct not really needed as quanteda treats the "." as a separate token.
my_dfm <- dfm(text, remove_punct = TRUE)
dfm_wordstem(my_dfm, language = "swedish")
Document-feature matrix of: 5 documents, 15 features (69.3% sparse).
5 x 15 sparse Matrix of class "dfm"
features
docs tv och var med kompis jobb på huset ta det lugnt umgås vänn koll anim
text1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
text3 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0
text4 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0
text5 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
Using tidytext, see issue #17
library(dplyr)
library(tidytext)
library(SnowballC)
txt <- c("TV och vara med kompisar.",
"Jobba på kompis huset",
"Ta det lugnt, umgås med kompisar.",
"Umgås med kompisar, vänner ",
"kolla anime med kompiserna")
data_frame(txt = txt) %>%
unnest_tokens(word, txt) %>%
mutate(word = wordStem(word, "swedish"))
The wordStem function is from the snowballC package which comes with multiple languages, see getStemLanguages
I am working on a text mining assignment and am stuck at the moment. The following is based on Zhaos Text Mining with Twitter. I cannot get it to work, maybe one of you has a good idea?
Goal: I would like to remove all terms from the corpus with a word count of one instead of using a stopword list.
What I did so far: I have downloaded the tweets and converted them into a data frame.
tf1 <- Corpus(VectorSource(tweets.df$text))
tf1 <- tm_map(tf1, content_transformer(tolower))
removeUser <- function(x) gsub("#[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeUser))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeNumPunct))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeURL))
tf1 <- tm_map(tf1, stripWhitespace)
#Using TermDocMatrix in order to find terms with count 1, dont know any other way
tdmtf1 <- TermDocumentMatrix(tf1, control = list(wordLengths = c(1, Inf)))
ones <- findFreqTerms(tdmtf1, lowfreq = 1, highfreq = 1)
tf1Copy <- tf1
tf1List <- setdiff(tf1Copy, ones)
tf1CList <- paste(unlist(tf1List),sep="", collapse=" ")
tf1Copy <- tm_map(tf1Copy, removeWords, tf1CList)
tdmtf1Test <- TermDocumentMatrix(tf1Copy, control = list(wordLengths = c(1, Inf)))
#Just to test success...
ones2 <- findFreqTerms(tdmtf1Test, lowfreq = 1, highfreq = 1)
(ones2)
The Error:
Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : invalid regular expression '(*UCP)\b(senior data scientist global strategy firm
25.0010230541229 48 17 6 6 115 1 186 0 1 en kdnuggets poll primary programming language for analytics data mining data scienc
25.0020229816437 48 17 6 6 115 1 186 0 2 en iapa canberra seminar mining the internet of everything official statistics in the information age anu june 25.0020229816437 48 17 6 6 115 1 186 0 3 en handling and processing strings in r an ebook in pdf format pages
25.0020229816437 48 17 6 6 115 1 186 0 4 en webinar getting your data into r by hadley wickham am edt june th
25.0020229816437 48 17 6 6 115 1 186 0 5 en before loading the rdmtweets dataset please run librarytwitter to load required package
25.0020229816437 48 17 6 6 115 1 186 0 6 en an infographic on sas vs r vs python datascience via
25.0020229816437 48 17 6 6 115 1 186 0 7 en r is again the kdnuggets poll on top analytics data mining science software
25.0020229816437 48 17 6 6 115 1 186 0 8 en i will run
In Addition:
Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error
'regular expression is too large'
at ''
PS sorry for the bad format at the end could not get it fixed.
Here's a way how to remove all terms from the corpus with a word count of one:
library(tm)
mytweets <- c("This is a doc", "This is another doc")
corp <- Corpus(VectorSource(mytweets))
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
#
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is another doc
## ^^^
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
# Terms
# Docs another doc this
# 1 0 1 1
# 2 1 1 1
(stopwords <- findFreqTerms(dtm, 1, 1))
# [1] "another"
corp <- tm_map(corp, removeWords, stopwords)
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
#
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is doc
## ^ 'another' is gone
(As a side note: The token 'a' from 'This is a...' is gone, too, because DocumentTermMatrix cuts out tokens with a length < 3 by default.)
Here's a simpler method using the dfm() and trim() functions from the quanteda package:
require(quanteda)
mydfm <- dfm(c("This is a doc", "This is another doc"), verbose = FALSE)
mydfm
## Document-feature matrix of: 2 documents, 5 features.
## 2 x 5 sparse Matrix of class "dfmSparse"
## features
## docs a another doc is this
## text1 1 0 1 1 1
## text2 0 1 1 1 1
trim(mydfm, minCount = 2)
## Features occurring less than 2 times: 2
## Document-feature matrix of: 2 documents, 3 features.
## 2 x 3 sparse Matrix of class "dfmSparse"
## features
## docs doc is this
## text1 1 1 1
## text2 1 1 1