how to calculate R1 (lexical richness index) in R? - r

Hi I need to write a function to calculate R1 which is defined as follows :
R1 = 1 - ( F(h) - h*h/2N) )
where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. Using quanteda package I managed to calculate the Hirsch point
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
To calculate F(h)—the cumulative relative frequencies up to h_point— to put in R1, I need two values; one of them needs to be from Fs$rank and the other must be from h_values. Consider the first original texts (tstat_by[[1]], tstat_by[[2]], and tstat_by[[3]]) and their respective h_values(h_values[[1]], h_values[[2]], and h_values[[3]]):
fh_txt1 <- tail(prop.table(cumsum(tstat_by[[1]][["text1"]]$rank:h_values[[1]][["text1"]])), n=1)
fh_txt2 <-tail(prop.table(cumsum(tstat_by[[1]][["text2"]]$rank:h_values[[1]][["text2"]])), n=1)
...
tail(prop.table(cumsum(tstat_by[[4]][["text2"]]$rank:h_values[[4]][["text2"]])), n=1)
[1] 1
tail(prop.table(cumsum(tstat_by[[4]][["text3"]]$rank:h_values[[4]][["text3"]])), n=1)
[1] 0.75
As you can see, the grouping is the same— docnames for each chunk of the original character vectors are the same (text1, text2, text3, etc.). my question is how to write a function for fh_txt(s) so that using lapply can be an option to calculate F(h) for R1.
Please note that the goal is to write a function to calculate R1, and what I`ve put here is what has been done in this regard.

I've simplified your inputs below, and used the groups argument in textstat_frequency() instead of your approach to creating lists of dfm objects.
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.")
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfmat <- c(a, a1, a2) %>%
tokens() %>%
dfm()
tstat <- quanteda.textstats::textstat_frequency(dfmat, groups = docnames(dfmat), ties = "first")
tstat_by <- split(tstat, tstat$group)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
h_values <- vapply(tstat_by, get_h_point, double(1))
h_values
## text1 text2 text3
## 2.000014 1.500000 2.000024
tstat_by <- lapply(
names(tstat_by),
function(x) subset(tstat_by[[x]], cumsum(rank) <= h_values[[x]])
)
do.call(rbind, tstat_by)
## feature frequency rank docfreq group
## 1 the 2 1 1 text1
## 29 the 2 1 1 text2
## 48 the 3 1 1 text3
You didn't specify what you wanted for output, but with this result, you should be able to compute your own either on the list using lapply(), or on the combined data.frame using for instance dplyr.
Created on 2021-04-05 by the reprex package (v1.0.0)

Related

R: How to use RegEx to search multiple words using a disjunction

Let me explain what I want to do. I have a corpus data (15 M words) about a political debate and I want to find the co-ocurrence of two terms within, say, 10k words.
I create two vectors of positions of two terms: "false" and "law".
false.v <- c(133844, 133880, 145106, 150995, 152516, 152557, 153697, 155507)
law.v <- c(48064, 155644, 251315, 297303, 323417, 349576, 368052, 543487)
Then I want to gather them on a matrix to see the co-ocurrence using the 'outer' function. The positions are taken from the same corpus, so I'm creating a matrix of differences:
distances <- outer(false.v, law.v, "-")
To make this easier to read lets name them:
rownames(distances) <- paste0("False", false.v)
colnames(distances) <- paste0("Law", law.v)
Okay, so we have the matrix ready. To find which pairs of positions were within 10000 words of each other I just run:
abs(distances) <= 10000
So I have to identify those moments in the political debate where there is a greater frequency of those co-occurences. Here comes the problem. I have to do it with more than a pair of words (In fact with 5 pair of words or so), so it would be great if I just could search multiple words instead of two pair of words at a time. So instead searching "false" and "law", search "false OR lie OR whatever" and "law OR money OR whatever". I guess I have to use RegEx for this task, isn't it? I just tried everything and nothing worked.
The example I just gave is a simplification. The command I use to search words is creating a vector out of the corpus:
positions.law.v <- which(C1.corpus.v == "law")
Soo it would be great if I can just use something like
which(C1.corpus.v == "law OR money OR prison OR ...")
which(C1.corpus.v == "false OR lie OR country OR ...")
It's like telling R "hey, give me the co-ocurrence positions of any possible combination between the first row of words (law or money or prison...) and the second one (false or lie or country...). I hope I'm explaining it in a clear way. I'm sorry for the language mistakes. Thank you!!
library(dplyr)
I have an extended answer here as well, but it could be as simple as:
mywords = c("law", "money", "prison", "false", "lie", "country")
which(C1.corpus.v %in% mywords)
Try:
library(quanteda)
I'll use the election manifestos of 9 UK political parties from 2010:
data_char_ukimmig2010
Create a tokens object (there are lots of settings - check out https://quanteda.io/)
mytoks <- data_char_ukimmig2010 %>%
char_tolower() %>%
tokens()
mywords = c("law", "money", "prison", "false", "lie", "country")
kwic "return[s] a list of a keyword supplied by the user in its immediate context, identifying the source text and the word index number within the source text" source
mykwic <- kwic(mytoks, mywords)
A kwic builds a data frame with various features, one of which is the integer value starting position of your keywords (because you can use it to look for phrases):
mykwic$from
Gives us:
> mykwic$from
[1] 130 438 943 1259 1281 1305 1339 1356 1743 1836 1859 2126 2187 2443 2546 2640 2763 2952 3186 3270 179 8 201
[24] 343 354 391 498 16 131 552 14 29 388 80 306 487 507
I think your problem is slightly more sophisticated than using regex. For instance, you may be willing to include law, legal and legislation in one group but do not include lawless. Regex like \blaw.*\b wouldn't help you much. In effect, you are interested in:
Creating feature co-occurrence matrix
Incorporating the semantic proximity of the words
Feature co-occurrence matrix
This is a well-established task and I would encourage you to use a tested solution like the fcm function. To introduce an example from the documentation:
txt <- "A D A C E A D F E B A C E D"
fcm(txt, context = "window", window = 2)
fcm(txt, context = "window", count = "weighted", window = 3)
fcm(txt, context = "window", count = "weighted", window = 3,
weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)
Your regex
To suggest a solution to your particular problem. This:
which(C1.corpus.v == "law OR money OR prison OR ...")
where
C1.corpus.v <- c("law", "word", "something","legal", "stuff")
you could do
grep(
pattern = paste("legal", "law", "som.*", sep = "|"),
x = C1.corpus.v,
perl = TRUE,
value = FALSE
)
where sep = "|" serves as your ...OR.... IMHO, this is not what you want as it does not address semantic similarity. I would suggest you have a look at some of the good tutorials that are available on the net 1,2.
1 Taylor Arnold and Lauren Tilton Basic Text Processing in R
2 Islam, Aminul & Inkpen, Diana. (2008). Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity. TKDD. 2. 10.1145/1376815.1376819.

Slow wordcloud in R

Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)

Quanteda: applying Yoshikoder dictionary with multiple levels

I use quanteda for my quantitative text analysis with a dictionary-based approach. I am building my own dictionary with Lowe's Yoshikoder. I can apply my Yoshikoder dictionary with quanteda (see below) - yet, the function only accounts for the first level of the dictionary. I need to see all values of each category, including all subcategories (at least 4 levels). How can I do this?
# load my Yoshikoder dictionary with multiple levels
mydict <- dictionary(file = "mydictionary.ykd",
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto")
# apply dictionary
mydfm <- dfm(mycorpus, dictionary = mydict)
mydfm
# problem: shows only results for the first level of the dictionary
dfm_lookup (and tokens_lookup) have a levels argument whose default is 1:5. Try applying the lookup separately:
mydfm <- dfm(mycorpus)
dfm_lookup(mydfm, dictionary = mydict)
or:
mytoks <- tokens(mycorpus)
mytoks <- tokens_lookup(mytoks, dictionary = mydict)
dfm(mytoks)
Update:
Fixed now in v0.9.9.55.
> library(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.55
# Using 7 of 8 cores for parallel computing
> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd")
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE)
# Creating a dfm from a corpus ...
# ... tokenizing texts
# ... lowercasing
# ... found 14 documents, 5,058 features
# ... applying a dictionary consisting of 19 keys
# ... created a 14 x 19 sparse dfm
# ... complete.
# Elapsed time: 0.422 seconds.
> mydict
# Dictionary object with 9 primary key entries and 2 nested levels.
# - Economy:
# - +State+:
# - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow*
# - =State=:
# - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work*
# - -State-:
# - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher*
# - Institutions:
# - Radical:
# - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice*
# - Neutral:
# - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote*
# - Conservative:
# - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike*
# - Values:
# - Liberal:
# - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex*
# - Conservative:
# - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition*
# - Law and Order:
# - Liberal:
# - harassment, non-custodial
# - Conservative:
# - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan*
# - Environment:
# - Pro:
# - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
#
# ...
While I'm fixed it in Quanteda, try this replacement function that collapses over categories:
library(xml2)
read_dict_yoshikoder <- function(path, sep=">"){
doc <- xml2::read_xml(path)
pats <- xml2::xml_find_all(doc, ".//pnode")
pnode_names <- xml2::xml_attr(pats, "name")
get_pnode_path <- function(pn) {
pars <- xml2::xml_attr(xml2::xml_parents(pn), "name")
paste0(rev(na.omit(pars)), collapse = sep)
}
pnode_paths <- lapply(pats, get_pnode_path)
lst <- split(pnode_names, unlist(pnode_paths))
dictionary(lst)
}
Usage:
read_dict_yoshikoder("laver-garry-ajps.ykd")
Dictionary object with 19 key entries.
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre*
- Laver and Garry>Culture>Popular: media
- Laver and Garry>Culture>Sport: angler*
- Laver and Garry>Environment>Con: produc*
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci*
...

Partial string matching & replacement in R

I have a dataframe like this
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Invezzstment LLC
4 Investment_LLC
5 Haiperloop LLC
6 Inwestment LLC
I need to match all these fuzzy strings, so the end result should look like this:
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Investment LLC
4 Investment LLC
5 Hyperloop LLC
6 Investment LLC
So, actually, I must solve a partial match-and-replace task for categorical variable. There are a lot great functions in base R and packages to solve string matching, but I'm stuck to find a single solution for this kind of match-and-replace.
I don't care which occurrence will replace other, for example "Investment LLC" or "Invezzstment LLC" are both equally fine. Just need them to be consistent.
Is there any single all-in-one function or a loop for this?
If you have a vector of correct spellings, agrep makes this reasonably easy:
myDataFrame$company <- sapply(myDataFrame$company,
function(val){agrep(val,
c('Investment LLC', 'Hyperloop LLC'),
value = TRUE)})
myDataFrame
# company
# 1 Investment LLC
# 2 Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5 Hyperloop LLC
# 6 Investment LLC
If you don't have such a vector, you can likely make one with clever application of adist or even just table if the correct spelling is repeated more than the others, which it likely will be (though isn't here).
So, after some time I ended up with this dumb code. Attention: It is not fully automating the process of replacement, because every time the proper matches should be verified by human, and every time we need a fine tune of agrep max.distance argument. I am totally sure there are ways to make it better and quicker, but this can help to get the job done.
##########
# Manual renaming with partial matches
##########
# a) Take a look at the desired column of factor variables
sort(unique(MYDATA$names)) # take a look
# ****
Sensthreshold <- 0.2 # sensitivity of agrep, usually 0.1-0.2 get it right
Searchstring <- "Invesstment LLC" # what should I search?
# ****
# User-defined function: returns similar string on query in column
Searcher <- function(input, similarity = 0.1) {
unique(agrep(input,
MYDATA$names, # <-- define your column here
ignore.case = TRUE, value = TRUE,
max.distance = similarity))
}
# b) Make a search of desired string
Searcher(Searchstring, Sensthreshold) # using user-def function
### PLEASE INSPECT THE OUTPUT OF THE SEARCH
### Did it get it right?
=============================================================================#
## ACTION! This changes your dataframe!
## Please make backup before proceeding
## Please execute this code as a whole to avoid errors
# c) Make a vector of cells indexes after checking output
vector_of_cells <- agrep(Searchstring,
MYDATA$names, ignore.case = TRUE,
max.distance = Sensthreshold)
# d) Apply the changes
MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
# e) Check result
unique(agrep(Searchstring, MYDATA$names,
ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

Resources