Quanteda: applying Yoshikoder dictionary with multiple levels - r

I use quanteda for my quantitative text analysis with a dictionary-based approach. I am building my own dictionary with Lowe's Yoshikoder. I can apply my Yoshikoder dictionary with quanteda (see below) - yet, the function only accounts for the first level of the dictionary. I need to see all values of each category, including all subcategories (at least 4 levels). How can I do this?
# load my Yoshikoder dictionary with multiple levels
mydict <- dictionary(file = "mydictionary.ykd",
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto")
# apply dictionary
mydfm <- dfm(mycorpus, dictionary = mydict)
mydfm
# problem: shows only results for the first level of the dictionary

dfm_lookup (and tokens_lookup) have a levels argument whose default is 1:5. Try applying the lookup separately:
mydfm <- dfm(mycorpus)
dfm_lookup(mydfm, dictionary = mydict)
or:
mytoks <- tokens(mycorpus)
mytoks <- tokens_lookup(mytoks, dictionary = mydict)
dfm(mytoks)
Update:
Fixed now in v0.9.9.55.
> library(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.55
# Using 7 of 8 cores for parallel computing
> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd")
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE)
# Creating a dfm from a corpus ...
# ... tokenizing texts
# ... lowercasing
# ... found 14 documents, 5,058 features
# ... applying a dictionary consisting of 19 keys
# ... created a 14 x 19 sparse dfm
# ... complete.
# Elapsed time: 0.422 seconds.
> mydict
# Dictionary object with 9 primary key entries and 2 nested levels.
# - Economy:
# - +State+:
# - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow*
# - =State=:
# - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work*
# - -State-:
# - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher*
# - Institutions:
# - Radical:
# - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice*
# - Neutral:
# - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote*
# - Conservative:
# - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike*
# - Values:
# - Liberal:
# - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex*
# - Conservative:
# - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition*
# - Law and Order:
# - Liberal:
# - harassment, non-custodial
# - Conservative:
# - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan*
# - Environment:
# - Pro:
# - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
#
# ...

While I'm fixed it in Quanteda, try this replacement function that collapses over categories:
library(xml2)
read_dict_yoshikoder <- function(path, sep=">"){
doc <- xml2::read_xml(path)
pats <- xml2::xml_find_all(doc, ".//pnode")
pnode_names <- xml2::xml_attr(pats, "name")
get_pnode_path <- function(pn) {
pars <- xml2::xml_attr(xml2::xml_parents(pn), "name")
paste0(rev(na.omit(pars)), collapse = sep)
}
pnode_paths <- lapply(pats, get_pnode_path)
lst <- split(pnode_names, unlist(pnode_paths))
dictionary(lst)
}
Usage:
read_dict_yoshikoder("laver-garry-ajps.ykd")
Dictionary object with 19 key entries.
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre*
- Laver and Garry>Culture>Popular: media
- Laver and Garry>Culture>Sport: angler*
- Laver and Garry>Environment>Con: produc*
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci*
...

Related

how to calculate R1 (lexical richness index) in R?

Hi I need to write a function to calculate R1 which is defined as follows :
R1 = 1 - ( F(h) - h*h/2N) )
where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. Using quanteda package I managed to calculate the Hirsch point
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
To calculate F(h)—the cumulative relative frequencies up to h_point— to put in R1, I need two values; one of them needs to be from Fs$rank and the other must be from h_values. Consider the first original texts (tstat_by[[1]], tstat_by[[2]], and tstat_by[[3]]) and their respective h_values(h_values[[1]], h_values[[2]], and h_values[[3]]):
fh_txt1 <- tail(prop.table(cumsum(tstat_by[[1]][["text1"]]$rank:h_values[[1]][["text1"]])), n=1)
fh_txt2 <-tail(prop.table(cumsum(tstat_by[[1]][["text2"]]$rank:h_values[[1]][["text2"]])), n=1)
...
tail(prop.table(cumsum(tstat_by[[4]][["text2"]]$rank:h_values[[4]][["text2"]])), n=1)
[1] 1
tail(prop.table(cumsum(tstat_by[[4]][["text3"]]$rank:h_values[[4]][["text3"]])), n=1)
[1] 0.75
As you can see, the grouping is the same— docnames for each chunk of the original character vectors are the same (text1, text2, text3, etc.). my question is how to write a function for fh_txt(s) so that using lapply can be an option to calculate F(h) for R1.
Please note that the goal is to write a function to calculate R1, and what I`ve put here is what has been done in this regard.
I've simplified your inputs below, and used the groups argument in textstat_frequency() instead of your approach to creating lists of dfm objects.
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.")
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfmat <- c(a, a1, a2) %>%
tokens() %>%
dfm()
tstat <- quanteda.textstats::textstat_frequency(dfmat, groups = docnames(dfmat), ties = "first")
tstat_by <- split(tstat, tstat$group)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
h_values <- vapply(tstat_by, get_h_point, double(1))
h_values
## text1 text2 text3
## 2.000014 1.500000 2.000024
tstat_by <- lapply(
names(tstat_by),
function(x) subset(tstat_by[[x]], cumsum(rank) <= h_values[[x]])
)
do.call(rbind, tstat_by)
## feature frequency rank docfreq group
## 1 the 2 1 1 text1
## 29 the 2 1 1 text2
## 48 the 3 1 1 text3
You didn't specify what you wanted for output, but with this result, you should be able to compute your own either on the list using lapply(), or on the combined data.frame using for instance dplyr.
Created on 2021-04-05 by the reprex package (v1.0.0)

tokens_compound() in quanteda changes the order of features

I found tokens_compound() in quanteda changes the order of tokens across different R sessions. That is, the result varies every time after restarting a session even if a seed value is fixed, though it does not change in a single session.
Here is the replication procedure:
Find collocations, compound tokens, and save them.
library(quanteda)
set.seed(12345)
data(data_corpus_inaugural)
toks <- data_corpus_inaugural %>%
tokens(remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE) %>%
tokens_tolower()
col <- toks %>%
textstat_collocations()
toks.col <- toks %>%
tokens_compound(pattern = col[col$z > 3])
write(attr(toks.col, "types"), "col1.txt")
End and restart R session and run the above code again with "col1.txt" replaced by "col2.txt".
Compare the two sets of tokens and find they are different.
col1 <- read.table("col1.txt")
col2 <- read.table("col2.txt")
identical(col1$V1, col2$V1) # This should return FALSE.
col1$V1[head(which(col1$V1 != col2$V1))]
col2$V1[head(which(col1$V1 != col2$V1))]
This does not matter for many cases but the result of LDA (by {topicmodels}) changes in different sessions. I guess so because the result of LDA is constant if I reset the order of features in tokens by as.list() and thereafter as.tokens() (dfm_sort() does not work for this).
I wonder whether this happens only for me (Ubuntu 18.04.5, R 4.0.4, and quanteda 2.1.2) and would be happy to hear another (easier) solution.
Updated on Feb 20
For example, the output of LDA is not reproduced.
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
lis[[i]] <- tokens_compound(toks, pattern = col[col$z > 3]) %>%
dfm() %>%
convert(to = "topicmodels") %>%
LDA(k = 5,
method = "Gibbs",
control = list(seed = 12345,
iter = 100))
}
head(lis[[1]]#gamma)
head(lis[[2]]#gamma)
An interesting investigation but this is neither an error nor anything to be concerned with. Within a quanteda tokens object, the types are not determinate in order, after a processing step such as textstat_compound(). This is because this function is parallelised in C++ and how these threads operate is not fixed by set.seed() from R. But this will not affect the important part, which is the set of types, or anything about the tokens themselves. If you want the order of the types that you extract to be the same, then you should sort them upon extraction.
library("quanteda")
## Package version: 2.1.2
toks <- data_corpus_inaugural %>%
tokens(
remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE
) %>%
tokens_tolower()
col <- quanteda.textstats::textstat_collocations(toks)
It turns out that you do not need to save the output or restart R - this happens within a single session.
# types are differently indexed, but are the same set
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(types = types(toks.col)))
}
dframe <- data.frame(lis)
sum(dframe$types != dframe$types.1)
## [1] 19898
head(dframe[dframe$types != dframe$types.1, ])
## types types.1
## 8897 at_this_second my_fellow_citizens
## 8898 to_take_the_oath_of_the_presidential_office no_people
## 8899 there_is on_earth
## 8900 occasion_for cause_to_be_thankful
## 8901 an_extended this_is_said
## 8902 there_was spirit_of
However the (unordered) set of types is identical:
# but
setequal(dframe$types, dframe$types.1)
## [1] TRUE
More important is that when we compare the values of each token, which is ordered, these are identical:
# tokens are the same
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(toks = as.character(toks.col)))
}
dframe <- data.frame(lis)
all.equal(dframe$toks, dframe$toks.1)
## [1] TRUE
Created on 2021-02-18 by the reprex package (v1.0.0)
An additional comment, whose importance is underscored by this analysis: We strongly discourage direct access to object attributes. Use types(x) as above, not attr(x, "types"). The former will always work. The latter relies on our implementation of the object, which may change as we improve the package.

randomly split dictionary into n parts

I have a quanteda dictionary I want to randomly split into n parts.
dict <- dictionary(list(positive = c("good", "amazing", "best", "outstanding", "beautiful", "wonderf*"),
negative = c("bad", "worst", "awful", "atrocious", "deplorable", "horrendous")))
I have tried using the split function like this: split(dict, f=factor(3)) but was not successful.
I would like to get three dictionaries back but I get
$`3`
Dictionary object with 2 key entries.
- [positive]:
- good, amazing, best, outstanding, beautiful, wonderf*
- [negative]:
- bad, worst, awful, atrocious, deplorable, horrendous
EDIT
I have included a different entry containing * in the dictionary. The solution suggested by Ken Benoit throws an error in this case but works perfectly fine otherwise.
The desired output is something like this:
> dict_1
Dictionary object with 2 key entries.
- [positive]:
- good, wonderf*
- [negative]:
- deplorable, horrendous
> dict_2
Dictionary object with 2 key entries.
- [positive]:
- amazing, best
- [negative]:
- bad, worst
> dict_3
Dictionary object with 2 key entries.
- [positive]:
- outstanding, beautiful
- [negative]:
- awful, atrocious
In case the number of entries cannot be divided by n without remainders, I have no specification but ideally I would be able to decide that I want (i) the 'remainder' separately or (ii) that I want all values to be distributed (which results in some splits being slightly larger).
There is a lot unspecified in the question since with dictionary keys of different lengths it's unclear how this should be handled, and since there is no pattern to the pairs in your expected answer.
Here, I've assumed you have keys of equal length, divisible by the split without a remainder, and that you want to split it in running, adjacent intervals for each dictionary key.
This should do it.
library("quanteda")
## Package version: 1.5.1
dict <- dictionary(
list(
positive = c("good", "amazing", "best", "outstanding", "beautiful", "delightful"),
negative = c("bad", "worst", "awful", "atrocious", "deplorable", "horrendous")
)
)
dictionary_split <- function(x, len) {
maxlen <- max(lengths(x)) # change to minumum to avoid recycling
subindex <- split(seq_len(maxlen), ceiling(seq_len(maxlen) / len))
splitlist <- lapply(subindex, function(y) lapply(x, "[", y))
names(splitlist) <- paste0("dict_", seq_along(splitlist))
lapply(splitlist, dictionary)
}
dictionary_split(dict, 2)
## $dict_1
## Dictionary object with 2 key entries.
## - [positive]:
## - good, amazing
## - [negative]:
## - bad, worst
##
## $dict_2
## Dictionary object with 2 key entries.
## - [positive]:
## - best, outstanding
## - [negative]:
## - awful, atrocious
##
## $dict_3
## Dictionary object with 2 key entries.
## - [positive]:
## - beautiful, delightful
## - [negative]:
## - deplorable, horrendous

Slow wordcloud in R

Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)

Partial string matching & replacement in R

I have a dataframe like this
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Invezzstment LLC
4 Investment_LLC
5 Haiperloop LLC
6 Inwestment LLC
I need to match all these fuzzy strings, so the end result should look like this:
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Investment LLC
4 Investment LLC
5 Hyperloop LLC
6 Investment LLC
So, actually, I must solve a partial match-and-replace task for categorical variable. There are a lot great functions in base R and packages to solve string matching, but I'm stuck to find a single solution for this kind of match-and-replace.
I don't care which occurrence will replace other, for example "Investment LLC" or "Invezzstment LLC" are both equally fine. Just need them to be consistent.
Is there any single all-in-one function or a loop for this?
If you have a vector of correct spellings, agrep makes this reasonably easy:
myDataFrame$company <- sapply(myDataFrame$company,
function(val){agrep(val,
c('Investment LLC', 'Hyperloop LLC'),
value = TRUE)})
myDataFrame
# company
# 1 Investment LLC
# 2 Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5 Hyperloop LLC
# 6 Investment LLC
If you don't have such a vector, you can likely make one with clever application of adist or even just table if the correct spelling is repeated more than the others, which it likely will be (though isn't here).
So, after some time I ended up with this dumb code. Attention: It is not fully automating the process of replacement, because every time the proper matches should be verified by human, and every time we need a fine tune of agrep max.distance argument. I am totally sure there are ways to make it better and quicker, but this can help to get the job done.
##########
# Manual renaming with partial matches
##########
# a) Take a look at the desired column of factor variables
sort(unique(MYDATA$names)) # take a look
# ****
Sensthreshold <- 0.2 # sensitivity of agrep, usually 0.1-0.2 get it right
Searchstring <- "Invesstment LLC" # what should I search?
# ****
# User-defined function: returns similar string on query in column
Searcher <- function(input, similarity = 0.1) {
unique(agrep(input,
MYDATA$names, # <-- define your column here
ignore.case = TRUE, value = TRUE,
max.distance = similarity))
}
# b) Make a search of desired string
Searcher(Searchstring, Sensthreshold) # using user-def function
### PLEASE INSPECT THE OUTPUT OF THE SEARCH
### Did it get it right?
=============================================================================#
## ACTION! This changes your dataframe!
## Please make backup before proceeding
## Please execute this code as a whole to avoid errors
# c) Make a vector of cells indexes after checking output
vector_of_cells <- agrep(Searchstring,
MYDATA$names, ignore.case = TRUE,
max.distance = Sensthreshold)
# d) Apply the changes
MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
# e) Check result
unique(agrep(Searchstring, MYDATA$names,
ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#

Resources