Create dfm step by step with quanteda - r

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures.
I would like this sequence to be implemented:
1) remove the punctuation and numbers
2) remove stopwords (i.e. before the tokenization to avoid useless tokens)
3) tokenize using unigrams and bigrams
4) create the dfm
My attempt:
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))
> class(text.corpus)
[1] "corpus" "list"
> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"
# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
Bonus question
How do I remove sparse tokens in quanteda? (i.e. equivalent of removeSparseTerms() in tm.
UPDATE
At the light of #Ken's answer, here is the code to proceed step by step with quanteda:
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
1) Remove custom punctuation and numbers. E.g. notice that the "\n" in the ie2010 corpus
text.corpus <- ie2010Corpus
texts(text.corpus)[1] # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is
texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1]) # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
A further note on the reason why one may prefer to preprocess. My present corpus is in Italian, a language that has articles connected to the words with an apostrophe. Thus, the straight dfm() can lead to inexact tokenization.
e.g.:
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
will produce two separated tokens for the same word ("un'abile" and "l'abile"), hence the need of an additional step with gsub() here.
2) In quanteda it is not possible to remove stopwords directly in the text before the tokenization. In my previous example "l" and "un" have to be removed not to produce misleading bigrams. This can be handled in tm with tm_map(..., removeWords).
3) Tokenization
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
4) Create the dfm:
dfm <- dfm(token)
5) Remove sparse features
dfm <- trim(dfm, minCount = 5)

We designed dfm() not as a "black box" but more as a Swiss army knife that combines many of the options that typical users want to apply when converting their texts to a matrix of documents and features. However all of these options are also available through lower-level processing commands, should you wish to exert a finer level of control.
However one of the design principles of quanteda is that text only becomes "features" through the process of tokenisation. If you have a set of tokenised features that you wish to exclude, you must first tokenise your text, or you cannot exclude them. Unlike other text packages for R (e.g. tm), these steps are applied "downstream" from a corpus, so that the corpus remains an unprocessed set of texts to which manipulations will be applied (but will not itself be a transformed set of texts). The purpose of this is to preserve generality but also to promote reproducibility and transparency in text analysis.
In response to your questions:
You can however override our encouraged behaviour using the texts(myCorpus) <- function, where what is assigned to the texts will override the existing texts. So you could use regular expressions to remove punctuation and numbers -- for example the stringi commands and using the Unicode classes for punctuation and numerals to identify patterns.
I would recommend you tokenise before removing stopwords. Stop "words" are tokens, so there is no way to remove these from the text before you tokenise the text. Even applying regular expressions to substitute them for "" involves specifying some form of word boundary in the regex - again, this is tokenisation.
To tokenise into unigrams and bigrams:
tokens(myCorpus, ngrams = 1:2)
To create the dfm, simply call dfm(myTokens). (You could also have applied step 3, for ngrams, at this stage.
Bonus 1: n=2 collocations produces the same list as bigrams, except in a different format. Did you intend something else? (Separate SO question perhaps?)
Bonus 2: See dfm_trim(x, sparsity = ). The removeSparseTerms() options are quite confusing to most people, but this included for migrants from tm. See this post for a full explanation.
BTW: Use texts() instead of ie2010Corpus$documents$texts -- we will rewrite the object structure of a corpus soon, so you should not access its internals this way when there is an extractor function. (Also, this step is unnecessary - here you have simply recreated the corpus.)
Update 2018-01:
The new name for the corpus object is data_corpus_irishbudget2010, and the collocation scoring function is textstat_collocations().

Related

How can I dynamically get words surrounding a keyword?

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?
You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"
Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"
Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

remove/replace/gsub all words matching list in df (size 6001) from df (size 29175) in r, different size dataframes

i have been trying to remove any words in dfmedia (size 29175) matching any contained in dfvocab (size 6001).
dfmedia: each row is a sentence of words in chinese.
我喜歡吃蘋果; 我愛吃饅頭; 我不喜歡菠菜; 我最討厭蘋果!;我很愛菠菜啊;哪個中國人敢不喜歡饅頭?;哎呀饅頭蘋果菠菜都是食物管人家喜歡否?
dfvocab: 蘋果,饅頭,菠菜
desired result: 我喜歡吃; 我愛吃; 我不喜歡; 我最討厭!;我很愛啊;哪個中國人敢不喜歡?;哎呀都是食物管人家喜歡否?
i don't think the results will be any different in chinese or english since it is a simple match and remove/replace, but i'm including the chinese here just in case since my og data is chinese.
I have tried gsub(), mapply(), and using stringr to bind dfmedia and dfvocab together into one dataframe/removing. however since dfvocab and dfmedia are different sized, I am unsure how to approach this with the suggested methods online.
any help would be really appreciated!!
It's pretty straightforward with gsub. Just paste0 all the vocab together with the regex OR operator and replace with ""
> gsub(paste0(dfvocab, collapse="|"), "", dfmedia)
[1] "我喜歡吃" " 我愛吃" " 我不喜歡" " 我最討厭!" "我很愛啊" "哪個中國人敢不喜歡?"
[7] "哎呀都是食物管人家喜歡否"
(I do not speak or read Chinese.) I imagine with such a large vocab set to be deleted you might need to break the 6000 vocab words in chunks and I suspect it will be slow. You might want to look at the tm package since text mining might a task that would require such operations to be optimized.
Here's a way to build a reproducible example:
> dfmedia <- scan(text="我喜歡吃蘋果; 我愛吃饅頭; 我不喜歡菠菜; 我最討厭蘋果!;我很愛菠菜啊;哪個中國人敢不喜歡饅頭?;哎呀饅頭蘋果菠菜都是食物管人家喜歡否", what="", sep=";")
Read 7 items
>
> dfvocab <- scan(text="蘋果,饅頭,菠菜", what="", sep=",")
Read 3 items

Replacing all semicolons with a space pt2

Im trying to run text analysis on a list of 2000+ rows of keywords, but they are listed like
"Strategy;Management Styles;Organizations"
So when I use tm to remove punctuation it becomes
"StrategyManagement StylesOrganizations"
and I assume this breaks my frequently used terms analysis some how.
Ive tried using
vector<-gsub(';', " ",vector)
but this takes my vector data "List of 2000" and makes it a value, with the description "Large character (3 elements)" when I inspected this Value it gave me a really long list of words and stuff which took forever to load! Any ideas what Im doing wrong?
Should I use gsub on my vector or on my corpus? They are just
vector<-VectorSource(dataset$Keywords)
corpus<-VCorpus(vector)
I tried using
inspect(corpus[[1]])
on my corpus after using gsub to make it a value, but I got error "no applicable method for 'inspect' applied to an object of class "character""
You need to split the data into a vector of strings, one of the ways to do this is by using stringr package as follows;
library(tm)
library(stringr)
vector <- c("Strategy;Management Styles;Organizations")
keywords <- unlist(stringr::str_split(vector, ";"))
vector <- VectorSource(keywords)
corpus <- VCorpus(vector)
inspect(corpus[[1]])
#<<PlainTextDocument>>
# Metadata: 7
#Content: chars: 8
#Strategy
Maybe you can try strsplit
X <- c("Global Mindset;Management","Auditor;Accounting;Selection Process","segmantation;banks;franchising")
res <- Map(function(v) unlist(strsplit(v,";")),X)
such that
> res
$`Global Mindset;Management`
[1] "Global Mindset" "Management"
$`Auditor;Accounting;Selection Process`
[1] "Auditor" "Accounting" "Selection Process"
$`segmantation;banks;franchising`
[1] "segmantation" "banks" "franchising"

Alignment of multiple (non-biological, discrete state) sequences

I have some data that describes an ordered set of discrete events (or states). There are 34 possible states, which may occur in any order and may repeat. Each sequence of events can contain any number of events, and crucially there are more than 2 sequences of events. My eventual aim is to cluster these sequences into similar subsets, but my hunch is that this cannot be meaningful unless these sequences are aligned such that equivalent events occupy the same position in all sequences.
I'm very familiar with multiple alignment of biological sequences, but all the software I've come across for this (MUSCLE, MAFFT, T-COFFEE, Clustal*, etc) require DNA, RNA or AA sequences, and I have more states than any of these, so I can't get them to work.
I've found various implementations of the pairwise alignment algorithms such as Needleman-Wunsch in R, but so far haven't come across any generic (non-biological) implementations of any multiple sequence alignment algorithms.
For example, say my data looks like this:
1: ABCDEFG
2: ACDGH
3: BDEFEGI
4: AH
5: DEGHI
My aim is to have it look like this:
1: ABCDEF-G--
2: A-CD---GH-
3: -B-DEFE--I
4: A-------H-
5: ---DE--GHI
Where the - symbol denotes the absence of an event in this sequence. This is a simplified example, in reality I'm looking for something that penalises the opening of gaps (-) in the same way that biological sequence MSA algorithms do.
The only piece of software I've found that seems to possibly do this is Alphamalig (http://alggen.lsi.upc.es/recerca/align/alphamalig/intro-alphamalig.html) but it's old and I can't get it working on my machine. Ideally I'd like something that can be implemented in R.
I would advise using MAFFT sequence alignment. Typically, this is used to align biological sequences, but it has the option to align text using --anysymbol. Note that MAFFT is a bash script and requires an input/output file.
input file (mafft_anysymbol_input.txt):
>Seq1
ABCDEFG
>Seq2
ACDGH
>Seq3
BDEFEGI
>Seq4
AH
>Seq5
DEGHI
R code to run bash script:
#Be sure that input/output and R files share the same path, otherwise you'll have to specify the path in the mafft script call.
x <- 'mafft --anysymbol mafft_anysymbol_input.txt > mafft_anysymbol_output.txt'
system(x)
Contents of output file (mafft_anysymbol_output.txt):
>Seq1
ABCDEFG--
>Seq2
-ACDGH---
>Seq3
--BDEFEGI
>Seq4
----AH---
>Seq5
---DEGHI-
Edit - I see now that you are familiar with biological alignment tools. If you want to make a customized scoring matrix for your text alignments, check out mafft options --text and --textmatrix. It requires ascii code input (extra data type conversions), but you would have the option of associating similar letters (however you choose to define similar) by score. For example, you could associate upper and lowercase letters, or letters with/without accent marks.
Assuming that we need to match with LETTERS, one option is str_match, then change the NA to -, paste
library(stringr)
library(dplyr)
f1 <- Vectorize(function(x) str_match(x, LETTERS))
out1 <- f1(v1)
do.call(paste0, as.data.frame(t(replace_na(out1[!!rowSums(!is.na(out1)),], '-'))))
#[1] "ABCDEFG--" "A-CD--GH-" "-B-DEFG-I" "A------H-" "---DE-GHI"
It can be also done with match after splitting
lst <- strsplit(v1, "")
mx <- match(max(sapply(lst, tail, 1)), LETTERS)
sapply(lst, function(x) paste(replace_na(x[match(LETTERS[seq_len(mx)],
x)], '-'), collapse=""))
data
v1 <- c("ABCDEFG", "ACDGH", "BDEFEGI", "AH", "DEGHI")

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included?
I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint.
You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the content_transformer function for more information:
library(tm)
# Create a corpus from the text listed below
corp = VCorpus(VectorSource(doc))
# Custom function to keep only the terms in "pattern" and remove everything else
(f <- content_transformer(function(x, pattern)
regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))
(FYI, the second line of code just above is adapted from this SO answer.)
# The pattern we'll search for
keep = "sleep|dream|die"
# Run the transformation function using the pattern above
tm_map(corp, f, keep)[[1]]
Here's the result of running the transformation function:
<<PlainTextDocument (metadata: 7)>>
c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")
Here's the original text I used to create the corpus:
doc = "To be, or not to be, that is the question—
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub"
An another way of filtering a corpus;
First assign your value to the meta part, say language; by looping elements of the corpus with the variable i, check whatever you want, then filter by using with these meta attribute.
corpusz[[i]]$meta["language"] <- 'tur'
idx <- meta(corpusz, "language") == 'tur'
filtered <- corpusz[idx]
Now filtered containes only the corpus elements we want.

Resources