randomly split dictionary into n parts - r

I have a quanteda dictionary I want to randomly split into n parts.
dict <- dictionary(list(positive = c("good", "amazing", "best", "outstanding", "beautiful", "wonderf*"),
negative = c("bad", "worst", "awful", "atrocious", "deplorable", "horrendous")))
I have tried using the split function like this: split(dict, f=factor(3)) but was not successful.
I would like to get three dictionaries back but I get
$`3`
Dictionary object with 2 key entries.
- [positive]:
- good, amazing, best, outstanding, beautiful, wonderf*
- [negative]:
- bad, worst, awful, atrocious, deplorable, horrendous
EDIT
I have included a different entry containing * in the dictionary. The solution suggested by Ken Benoit throws an error in this case but works perfectly fine otherwise.
The desired output is something like this:
> dict_1
Dictionary object with 2 key entries.
- [positive]:
- good, wonderf*
- [negative]:
- deplorable, horrendous
> dict_2
Dictionary object with 2 key entries.
- [positive]:
- amazing, best
- [negative]:
- bad, worst
> dict_3
Dictionary object with 2 key entries.
- [positive]:
- outstanding, beautiful
- [negative]:
- awful, atrocious
In case the number of entries cannot be divided by n without remainders, I have no specification but ideally I would be able to decide that I want (i) the 'remainder' separately or (ii) that I want all values to be distributed (which results in some splits being slightly larger).

There is a lot unspecified in the question since with dictionary keys of different lengths it's unclear how this should be handled, and since there is no pattern to the pairs in your expected answer.
Here, I've assumed you have keys of equal length, divisible by the split without a remainder, and that you want to split it in running, adjacent intervals for each dictionary key.
This should do it.
library("quanteda")
## Package version: 1.5.1
dict <- dictionary(
list(
positive = c("good", "amazing", "best", "outstanding", "beautiful", "delightful"),
negative = c("bad", "worst", "awful", "atrocious", "deplorable", "horrendous")
)
)
dictionary_split <- function(x, len) {
maxlen <- max(lengths(x)) # change to minumum to avoid recycling
subindex <- split(seq_len(maxlen), ceiling(seq_len(maxlen) / len))
splitlist <- lapply(subindex, function(y) lapply(x, "[", y))
names(splitlist) <- paste0("dict_", seq_along(splitlist))
lapply(splitlist, dictionary)
}
dictionary_split(dict, 2)
## $dict_1
## Dictionary object with 2 key entries.
## - [positive]:
## - good, amazing
## - [negative]:
## - bad, worst
##
## $dict_2
## Dictionary object with 2 key entries.
## - [positive]:
## - best, outstanding
## - [negative]:
## - awful, atrocious
##
## $dict_3
## Dictionary object with 2 key entries.
## - [positive]:
## - beautiful, delightful
## - [negative]:
## - deplorable, horrendous

Related

how to calculate R1 (lexical richness index) in R?

Hi I need to write a function to calculate R1 which is defined as follows :
R1 = 1 - ( F(h) - h*h/2N) )
where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. Using quanteda package I managed to calculate the Hirsch point
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
To calculate F(h)—the cumulative relative frequencies up to h_point— to put in R1, I need two values; one of them needs to be from Fs$rank and the other must be from h_values. Consider the first original texts (tstat_by[[1]], tstat_by[[2]], and tstat_by[[3]]) and their respective h_values(h_values[[1]], h_values[[2]], and h_values[[3]]):
fh_txt1 <- tail(prop.table(cumsum(tstat_by[[1]][["text1"]]$rank:h_values[[1]][["text1"]])), n=1)
fh_txt2 <-tail(prop.table(cumsum(tstat_by[[1]][["text2"]]$rank:h_values[[1]][["text2"]])), n=1)
...
tail(prop.table(cumsum(tstat_by[[4]][["text2"]]$rank:h_values[[4]][["text2"]])), n=1)
[1] 1
tail(prop.table(cumsum(tstat_by[[4]][["text3"]]$rank:h_values[[4]][["text3"]])), n=1)
[1] 0.75
As you can see, the grouping is the same— docnames for each chunk of the original character vectors are the same (text1, text2, text3, etc.). my question is how to write a function for fh_txt(s) so that using lapply can be an option to calculate F(h) for R1.
Please note that the goal is to write a function to calculate R1, and what I`ve put here is what has been done in this regard.
I've simplified your inputs below, and used the groups argument in textstat_frequency() instead of your approach to creating lists of dfm objects.
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.")
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfmat <- c(a, a1, a2) %>%
tokens() %>%
dfm()
tstat <- quanteda.textstats::textstat_frequency(dfmat, groups = docnames(dfmat), ties = "first")
tstat_by <- split(tstat, tstat$group)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
h_values <- vapply(tstat_by, get_h_point, double(1))
h_values
## text1 text2 text3
## 2.000014 1.500000 2.000024
tstat_by <- lapply(
names(tstat_by),
function(x) subset(tstat_by[[x]], cumsum(rank) <= h_values[[x]])
)
do.call(rbind, tstat_by)
## feature frequency rank docfreq group
## 1 the 2 1 1 text1
## 29 the 2 1 1 text2
## 48 the 3 1 1 text3
You didn't specify what you wanted for output, but with this result, you should be able to compute your own either on the list using lapply(), or on the combined data.frame using for instance dplyr.
Created on 2021-04-05 by the reprex package (v1.0.0)

R: How to use RegEx to search multiple words using a disjunction

Let me explain what I want to do. I have a corpus data (15 M words) about a political debate and I want to find the co-ocurrence of two terms within, say, 10k words.
I create two vectors of positions of two terms: "false" and "law".
false.v <- c(133844, 133880, 145106, 150995, 152516, 152557, 153697, 155507)
law.v <- c(48064, 155644, 251315, 297303, 323417, 349576, 368052, 543487)
Then I want to gather them on a matrix to see the co-ocurrence using the 'outer' function. The positions are taken from the same corpus, so I'm creating a matrix of differences:
distances <- outer(false.v, law.v, "-")
To make this easier to read lets name them:
rownames(distances) <- paste0("False", false.v)
colnames(distances) <- paste0("Law", law.v)
Okay, so we have the matrix ready. To find which pairs of positions were within 10000 words of each other I just run:
abs(distances) <= 10000
So I have to identify those moments in the political debate where there is a greater frequency of those co-occurences. Here comes the problem. I have to do it with more than a pair of words (In fact with 5 pair of words or so), so it would be great if I just could search multiple words instead of two pair of words at a time. So instead searching "false" and "law", search "false OR lie OR whatever" and "law OR money OR whatever". I guess I have to use RegEx for this task, isn't it? I just tried everything and nothing worked.
The example I just gave is a simplification. The command I use to search words is creating a vector out of the corpus:
positions.law.v <- which(C1.corpus.v == "law")
Soo it would be great if I can just use something like
which(C1.corpus.v == "law OR money OR prison OR ...")
which(C1.corpus.v == "false OR lie OR country OR ...")
It's like telling R "hey, give me the co-ocurrence positions of any possible combination between the first row of words (law or money or prison...) and the second one (false or lie or country...). I hope I'm explaining it in a clear way. I'm sorry for the language mistakes. Thank you!!
library(dplyr)
I have an extended answer here as well, but it could be as simple as:
mywords = c("law", "money", "prison", "false", "lie", "country")
which(C1.corpus.v %in% mywords)
Try:
library(quanteda)
I'll use the election manifestos of 9 UK political parties from 2010:
data_char_ukimmig2010
Create a tokens object (there are lots of settings - check out https://quanteda.io/)
mytoks <- data_char_ukimmig2010 %>%
char_tolower() %>%
tokens()
mywords = c("law", "money", "prison", "false", "lie", "country")
kwic "return[s] a list of a keyword supplied by the user in its immediate context, identifying the source text and the word index number within the source text" source
mykwic <- kwic(mytoks, mywords)
A kwic builds a data frame with various features, one of which is the integer value starting position of your keywords (because you can use it to look for phrases):
mykwic$from
Gives us:
> mykwic$from
[1] 130 438 943 1259 1281 1305 1339 1356 1743 1836 1859 2126 2187 2443 2546 2640 2763 2952 3186 3270 179 8 201
[24] 343 354 391 498 16 131 552 14 29 388 80 306 487 507
I think your problem is slightly more sophisticated than using regex. For instance, you may be willing to include law, legal and legislation in one group but do not include lawless. Regex like \blaw.*\b wouldn't help you much. In effect, you are interested in:
Creating feature co-occurrence matrix
Incorporating the semantic proximity of the words
Feature co-occurrence matrix
This is a well-established task and I would encourage you to use a tested solution like the fcm function. To introduce an example from the documentation:
txt <- "A D A C E A D F E B A C E D"
fcm(txt, context = "window", window = 2)
fcm(txt, context = "window", count = "weighted", window = 3)
fcm(txt, context = "window", count = "weighted", window = 3,
weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)
Your regex
To suggest a solution to your particular problem. This:
which(C1.corpus.v == "law OR money OR prison OR ...")
where
C1.corpus.v <- c("law", "word", "something","legal", "stuff")
you could do
grep(
pattern = paste("legal", "law", "som.*", sep = "|"),
x = C1.corpus.v,
perl = TRUE,
value = FALSE
)
where sep = "|" serves as your ...OR.... IMHO, this is not what you want as it does not address semantic similarity. I would suggest you have a look at some of the good tutorials that are available on the net 1,2.
1 Taylor Arnold and Lauren Tilton Basic Text Processing in R
2 Islam, Aminul & Inkpen, Diana. (2008). Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity. TKDD. 2. 10.1145/1376815.1376819.

Quanteda: applying Yoshikoder dictionary with multiple levels

I use quanteda for my quantitative text analysis with a dictionary-based approach. I am building my own dictionary with Lowe's Yoshikoder. I can apply my Yoshikoder dictionary with quanteda (see below) - yet, the function only accounts for the first level of the dictionary. I need to see all values of each category, including all subcategories (at least 4 levels). How can I do this?
# load my Yoshikoder dictionary with multiple levels
mydict <- dictionary(file = "mydictionary.ykd",
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto")
# apply dictionary
mydfm <- dfm(mycorpus, dictionary = mydict)
mydfm
# problem: shows only results for the first level of the dictionary
dfm_lookup (and tokens_lookup) have a levels argument whose default is 1:5. Try applying the lookup separately:
mydfm <- dfm(mycorpus)
dfm_lookup(mydfm, dictionary = mydict)
or:
mytoks <- tokens(mycorpus)
mytoks <- tokens_lookup(mytoks, dictionary = mydict)
dfm(mytoks)
Update:
Fixed now in v0.9.9.55.
> library(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.55
# Using 7 of 8 cores for parallel computing
> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd")
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE)
# Creating a dfm from a corpus ...
# ... tokenizing texts
# ... lowercasing
# ... found 14 documents, 5,058 features
# ... applying a dictionary consisting of 19 keys
# ... created a 14 x 19 sparse dfm
# ... complete.
# Elapsed time: 0.422 seconds.
> mydict
# Dictionary object with 9 primary key entries and 2 nested levels.
# - Economy:
# - +State+:
# - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow*
# - =State=:
# - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work*
# - -State-:
# - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher*
# - Institutions:
# - Radical:
# - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice*
# - Neutral:
# - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote*
# - Conservative:
# - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike*
# - Values:
# - Liberal:
# - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex*
# - Conservative:
# - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition*
# - Law and Order:
# - Liberal:
# - harassment, non-custodial
# - Conservative:
# - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan*
# - Environment:
# - Pro:
# - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
#
# ...
While I'm fixed it in Quanteda, try this replacement function that collapses over categories:
library(xml2)
read_dict_yoshikoder <- function(path, sep=">"){
doc <- xml2::read_xml(path)
pats <- xml2::xml_find_all(doc, ".//pnode")
pnode_names <- xml2::xml_attr(pats, "name")
get_pnode_path <- function(pn) {
pars <- xml2::xml_attr(xml2::xml_parents(pn), "name")
paste0(rev(na.omit(pars)), collapse = sep)
}
pnode_paths <- lapply(pats, get_pnode_path)
lst <- split(pnode_names, unlist(pnode_paths))
dictionary(lst)
}
Usage:
read_dict_yoshikoder("laver-garry-ajps.ykd")
Dictionary object with 19 key entries.
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre*
- Laver and Garry>Culture>Popular: media
- Laver and Garry>Culture>Sport: angler*
- Laver and Garry>Environment>Con: produc*
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci*
...

Give a new variable value 0 or 1 based on the distance between two words in another variable

I am new to R. In my dataset, I have a variable called Reason . I want to create a new column called Price. If any of the following conditions is met:
word "Price" and word "High" are both mentioned in Reason and the distance between them is less than 6 words
word "Price" and word "expensive" are both mentioned in Reason and the distance between them is less than 6 words
-word "Price" and word "increase" are both mentioned in Reason and the distance between them is less than 6 words
than Price=1. Otherwise, price=0.
I found the following user defined function to get the distance between 2 words
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
but I don't know how to apply it the whole column to get the expected results. I tried the following code but it only give me "logical(0)" as the result.
for (j in seq(Survey$Reason))
{
Survey$Price[[j]]<- distance(Survey$Reason[[j]], " price ", " high ") <=6
}
Any help is highly appreciated.
Thanks
Starting from your sample data:
survey <- structure(list(Reason = c("Their price are extremely high.", "Because my price was increased so much, I wouldn't want anyone else to have to deal with that.", "Just because the intial workings were fine, but after we realised it would affect our contract, it left a sour taste in our mouth.", "Problems with the repair", "They did not handle my complaint as well I would have liked.", "Bad service overall.")), .Names = "Reason", row.names = c(NA, 6L), class = "data.frame")
First, I updated your fonction to remove punctuation and directrly returns your position test
distanceOK <- function(string, term1, term2,n=6) {
words <- strsplit(gsub("[[:punct:]]", "", string), "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
dist <- abs(indices[term1] - indices[term2])
ifelse(is.na(dist)|dist>n,0,1)
}
Then we apply:
survey$Price <- sapply(survey$Reason, FUN=function(str) distanceOK(str, "price","high"))

Efficiently match multiple strings/keywords to multiple texts in R

I am trying to efficiently map exact peptides (short sequences of amino acids in the 26 character alphabet A-Z1) to proteins (longer sequences of the same alphabet). The most efficient way to do this I'm aware of is an Aho-Corasick trie (where peptides are the keywords). Unfortunately I can't find a version of AC in R that will work with a non-nucleotide alphabet (Biostrings' PDict and Starr's match_ac are both hard-coded for DNA).
As a crutch I've been trying to parallelize a basic grep approach. But I'm having trouble figuring out a way to do so without incurring significant IO overhead. Here is a brief example:
peptides = c("FSSSGGGGGGGR","GAHLQGGAK","GGSGGSYGGGGSGGGYGGGSGSR","IISNASCTTNCLAPLAK")
if (!exists("proteins"))
{
biocLite("biomaRt", ask=F, suppressUpdates=T, suppressAutoUpdate=T)
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
proteins = getBM(attributes=c('peptide', 'refseq_peptide'), filters='refseq_peptide', values=c("NP_000217", "NP_001276675"), mart=ensembl)
row.names(proteins) = proteins$refseq_peptide
}
library(snowfall)
library(Biostrings)
library(plyr)
sfInit(parallel=T, cpus=detectCores()-1)
allPeptideInstances = NULL
i=1
increment=100
count=nrow(proteins)
while(T)
{
print(paste(i, min(count, i+increment), sep=":"))
text_source = proteins[i:min(count, i+increment),]
text = text_source$peptide
#peptideInstances = sapply(peptides, regexpr, text, fixed=T, useBytes=T)
peptideInstances = sfSapply(peptides, regexpr, text, fixed=T, useBytes=T)
dimnames(peptideInstances) = list(text_source$refseq_peptide, colnames(peptideInstances))
sparsePeptideInstances = alply(peptideInstances, 2, .fun = function(x) {x[x > 0]}, .dims = T)
allPeptideInstances = c(allPeptideInstances, sparsePeptideInstances, recursive=T)
if (i==count | nrow(text_source) < increment)
break
i = i+increment
}
sfStop()
There are a few issues here:
peptideInstances here is a dense matrix, so
returning it from each worker is very verbose. I have broken it up
into blocks so that I'm not dealing with a 40,000 (proteins) x 60,000
(peptides) matrix.
Parallelizing over peptides, when it would make
more sense to parallelize over the proteins because they're bigger.
But I got frustrated with trying to do it by protein because:
This code breaks if there is only one protein in text_source.
Alternatively, if anyone is aware of a better solution in R, I'm happy to use that. I've spent enough time on this I probably would have been better served implementing Aho-Corasick.
1 Some of those are ambiguity codes, but for simplicity, ignore that.
I learned Rcpp and implemented an Aho-Corasick myself. Now CRAN has a good general purpose multiple-keyword search package.
Here are some usage examples:
listEquals = function(a, b) { is.null(unlist(a)) && is.null(unlist(b)) || !is.null(a) && !is.null(b) && all(unlist(a) == unlist(b)) }
# simple search of multiple keywords in a single text
keywords = c("Abra", "cadabra", "is", "the", "Magic", "Word")
oneSearch = AhoCorasickSearch(keywords, "Is Abracadabra the Magic Word?")
stopifnot(listEquals(oneSearch[[1]][[1]], list(keyword="Abra", offset=4)))
stopifnot(listEquals(oneSearch[[1]][[2]], list(keyword="cadabra", offset=8)))
stopifnot(listEquals(oneSearch[[1]][[3]], list(keyword="the", offset=16)))
stopifnot(listEquals(oneSearch[[1]][[4]], list(keyword="Magic", offset=20)))
stopifnot(listEquals(oneSearch[[1]][[5]], list(keyword="Word", offset=26)))
# search a list of lists
# * sublists are accessed by index
# * texts are accessed by index
# * non-matched texts are kept (to preserve index order)
listSearch = AhoCorasickSearchList(keywords, list(c("What in", "the world"), c("is"), "secret about", "the Magic Word?"))
stopifnot(listEquals(listSearch[[1]][[1]], list()))
stopifnot(listEquals(listSearch[[1]][[2]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[2]][[1]][[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(listSearch[[3]], list()))
stopifnot(listEquals(listSearch[[4]][[1]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[4]][[1]][[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(listSearch[[4]][[1]][[3]], list(keyword="Word", offset=11)))
# named search of a list of lists
# * sublists are accessed by name
# * matched texts are accessed by name
# * non-matched texts are dropped
namedSearch = AhoCorasickSearchList(keywords, list(subject=c(phrase1="What in", phrase2="the world"),
verb=c(phrase1="is"),
predicate1=c(phrase1="secret about"),
predicate2=c(phrase1="the Magic Word?")))
stopifnot(listEquals(namedSearch$subject$phrase2[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$verb$phrase1[[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(namedSearch$predicate1, list()))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[3]], list(keyword="Word", offset=11)))
# named search of multiple texts in a single list with keyword grouping and aminoacid alphabet
# * all matches to a keyword are accessed by name
# * non-matched keywords are dropped
proteins = c(protein1="PEPTIDEPEPTIDEDADADARARARARAKEKEKEKEPEPTIDE",
protein2="DERPADERPAPEWPEWPEEPEERAWRAWWARRAGTAGPEPTIDEKESEQUENCE")
peptides = c("PEPTIDE", "DERPA", "SEQUENCE", "KEKE", "PEPPIE")
peptideSearch = AhoCorasickSearch(peptides, proteins, alphabet="aminoacid", groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(list(keyword="protein1", offset=1),
list(keyword="protein1", offset=8),
list(keyword="protein1", offset=37),
list(keyword="protein2", offset=38))))
stopifnot(listEquals(peptideSearch$DERPA, list(list(keyword="protein2", offset=1),
list(keyword="protein2", offset=6))))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(list(keyword="protein2", offset=47))))
stopifnot(listEquals(peptideSearch$KEKE, list(list(keyword="protein1", offset=29),
list(keyword="protein1", offset=31),
list(keyword="protein1", offset=33))))
stopifnot(listEquals(peptideSearch$PEPPIE, NULL))
# grouping by keyword without text names: offsets are given without reference to the text
names(proteins) = NULL
peptideSearch = AhoCorasickSearch(peptides, proteins, groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(1, 8, 37, 38)))
stopifnot(listEquals(peptideSearch$DERPA, list(1, 6)))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(47)))
stopifnot(listEquals(peptideSearch$KEKE, list(29, 31, 33)))

Resources