Let me explain what I want to do. I have a corpus data (15 M words) about a political debate and I want to find the co-ocurrence of two terms within, say, 10k words.
I create two vectors of positions of two terms: "false" and "law".
false.v <- c(133844, 133880, 145106, 150995, 152516, 152557, 153697, 155507)
law.v <- c(48064, 155644, 251315, 297303, 323417, 349576, 368052, 543487)
Then I want to gather them on a matrix to see the co-ocurrence using the 'outer' function. The positions are taken from the same corpus, so I'm creating a matrix of differences:
distances <- outer(false.v, law.v, "-")
To make this easier to read lets name them:
rownames(distances) <- paste0("False", false.v)
colnames(distances) <- paste0("Law", law.v)
Okay, so we have the matrix ready. To find which pairs of positions were within 10000 words of each other I just run:
abs(distances) <= 10000
So I have to identify those moments in the political debate where there is a greater frequency of those co-occurences. Here comes the problem. I have to do it with more than a pair of words (In fact with 5 pair of words or so), so it would be great if I just could search multiple words instead of two pair of words at a time. So instead searching "false" and "law", search "false OR lie OR whatever" and "law OR money OR whatever". I guess I have to use RegEx for this task, isn't it? I just tried everything and nothing worked.
The example I just gave is a simplification. The command I use to search words is creating a vector out of the corpus:
positions.law.v <- which(C1.corpus.v == "law")
Soo it would be great if I can just use something like
which(C1.corpus.v == "law OR money OR prison OR ...")
which(C1.corpus.v == "false OR lie OR country OR ...")
It's like telling R "hey, give me the co-ocurrence positions of any possible combination between the first row of words (law or money or prison...) and the second one (false or lie or country...). I hope I'm explaining it in a clear way. I'm sorry for the language mistakes. Thank you!!
library(dplyr)
I have an extended answer here as well, but it could be as simple as:
mywords = c("law", "money", "prison", "false", "lie", "country")
which(C1.corpus.v %in% mywords)
Try:
library(quanteda)
I'll use the election manifestos of 9 UK political parties from 2010:
data_char_ukimmig2010
Create a tokens object (there are lots of settings - check out https://quanteda.io/)
mytoks <- data_char_ukimmig2010 %>%
char_tolower() %>%
tokens()
mywords = c("law", "money", "prison", "false", "lie", "country")
kwic "return[s] a list of a keyword supplied by the user in its immediate context, identifying the source text and the word index number within the source text" source
mykwic <- kwic(mytoks, mywords)
A kwic builds a data frame with various features, one of which is the integer value starting position of your keywords (because you can use it to look for phrases):
mykwic$from
Gives us:
> mykwic$from
[1] 130 438 943 1259 1281 1305 1339 1356 1743 1836 1859 2126 2187 2443 2546 2640 2763 2952 3186 3270 179 8 201
[24] 343 354 391 498 16 131 552 14 29 388 80 306 487 507
I think your problem is slightly more sophisticated than using regex. For instance, you may be willing to include law, legal and legislation in one group but do not include lawless. Regex like \blaw.*\b wouldn't help you much. In effect, you are interested in:
Creating feature co-occurrence matrix
Incorporating the semantic proximity of the words
Feature co-occurrence matrix
This is a well-established task and I would encourage you to use a tested solution like the fcm function. To introduce an example from the documentation:
txt <- "A D A C E A D F E B A C E D"
fcm(txt, context = "window", window = 2)
fcm(txt, context = "window", count = "weighted", window = 3)
fcm(txt, context = "window", count = "weighted", window = 3,
weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)
Your regex
To suggest a solution to your particular problem. This:
which(C1.corpus.v == "law OR money OR prison OR ...")
where
C1.corpus.v <- c("law", "word", "something","legal", "stuff")
you could do
grep(
pattern = paste("legal", "law", "som.*", sep = "|"),
x = C1.corpus.v,
perl = TRUE,
value = FALSE
)
where sep = "|" serves as your ...OR.... IMHO, this is not what you want as it does not address semantic similarity. I would suggest you have a look at some of the good tutorials that are available on the net 1,2.
1 Taylor Arnold and Lauren Tilton Basic Text Processing in R
2 Islam, Aminul & Inkpen, Diana. (2008). Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity. TKDD. 2. 10.1145/1376815.1376819.
Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)
I use quanteda for my quantitative text analysis with a dictionary-based approach. I am building my own dictionary with Lowe's Yoshikoder. I can apply my Yoshikoder dictionary with quanteda (see below) - yet, the function only accounts for the first level of the dictionary. I need to see all values of each category, including all subcategories (at least 4 levels). How can I do this?
# load my Yoshikoder dictionary with multiple levels
mydict <- dictionary(file = "mydictionary.ykd",
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto")
# apply dictionary
mydfm <- dfm(mycorpus, dictionary = mydict)
mydfm
# problem: shows only results for the first level of the dictionary
dfm_lookup (and tokens_lookup) have a levels argument whose default is 1:5. Try applying the lookup separately:
mydfm <- dfm(mycorpus)
dfm_lookup(mydfm, dictionary = mydict)
or:
mytoks <- tokens(mycorpus)
mytoks <- tokens_lookup(mytoks, dictionary = mydict)
dfm(mytoks)
Update:
Fixed now in v0.9.9.55.
> library(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.55
# Using 7 of 8 cores for parallel computing
> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd")
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE)
# Creating a dfm from a corpus ...
# ... tokenizing texts
# ... lowercasing
# ... found 14 documents, 5,058 features
# ... applying a dictionary consisting of 19 keys
# ... created a 14 x 19 sparse dfm
# ... complete.
# Elapsed time: 0.422 seconds.
> mydict
# Dictionary object with 9 primary key entries and 2 nested levels.
# - Economy:
# - +State+:
# - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow*
# - =State=:
# - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work*
# - -State-:
# - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher*
# - Institutions:
# - Radical:
# - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice*
# - Neutral:
# - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote*
# - Conservative:
# - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike*
# - Values:
# - Liberal:
# - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex*
# - Conservative:
# - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition*
# - Law and Order:
# - Liberal:
# - harassment, non-custodial
# - Conservative:
# - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan*
# - Environment:
# - Pro:
# - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
#
# ...
While I'm fixed it in Quanteda, try this replacement function that collapses over categories:
library(xml2)
read_dict_yoshikoder <- function(path, sep=">"){
doc <- xml2::read_xml(path)
pats <- xml2::xml_find_all(doc, ".//pnode")
pnode_names <- xml2::xml_attr(pats, "name")
get_pnode_path <- function(pn) {
pars <- xml2::xml_attr(xml2::xml_parents(pn), "name")
paste0(rev(na.omit(pars)), collapse = sep)
}
pnode_paths <- lapply(pats, get_pnode_path)
lst <- split(pnode_names, unlist(pnode_paths))
dictionary(lst)
}
Usage:
read_dict_yoshikoder("laver-garry-ajps.ykd")
Dictionary object with 19 key entries.
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre*
- Laver and Garry>Culture>Popular: media
- Laver and Garry>Culture>Sport: angler*
- Laver and Garry>Environment>Con: produc*
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci*
...