How to sub matching words with bracketed words? - r

Trying to create a function to bracket reserved words in Access for a SQL query:
library(dplyr)
tester <- data.frame(names=c("Add", "Date", "Test", "DOB"))
bracket_access <- function(x) {x %>% gsub(c("ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|
AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|
COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|
CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|
CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|
DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|
DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|
FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|
GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|
IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|
INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|
JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|
LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|
NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|
OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|
PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|
Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|
Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|
SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|
StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|
TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|
VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO"), paste0("[",.,"]"), ignore.case = T)
}
bracket_access(tester)
I get a numeric output and I don't really understand why:
> bracket_access(tester)
[1] "[c(1, 2, 4, 3)]"

You may fix the current approach by matching and capturing the strings equal to the alternatives you provided and then replace the names column with [\\1] in the gsub:
bracket_access <- function(x) {
gsub("^(ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO)$",
"[\\1]",
x,
ignore.case = T)
}
bracket_access(tester$names)
## => [1] "[Add]" "[Date]" "Test" "DOB"
Here, the gsub pattern looks like ^(word1|word2|...|wordN)$ and once there is a match, the whole string is wrapped with [...] and put back (the \\1 is a placeholder for the capturing group #1 in the pattern, and there is one defined with a pair of unescaped parentheses).

Related

paste specific text to strings that do not have it

I would like to paste "miR" to strings that do not have "miR" already, and skipping those that have it.
paste("miR", ....)
in
c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
out
c("miR-26b", "miR-26a", "miR-1297", "miR-4465", "miR-26b", "miR-26a")
One way could be by removing "miR" if it is present in the beginning of the string using sub and pasting it to every string irrespectively.
paste0("miR-", sub("^miR-","", x))
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
data
x <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
vec <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
sub("^(?!miR)(.*)$", "miR-\\1", vec, perl = T)
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
If you want to learn more:
type ?sub into R console
learn regex, have a closer look at negative look ahead, capturing groups LEARN REGEX
I've used perl = T because I get an error if I don't. READ MORE

How to concatenate strings with different separator every n-th element

I would like to concatenate words (strings) with different separator every 10-th element, such that each word is separated by a comma until every 10th word then it's separated by a comma and a line break. The ultimate purpose is for printing neatly a list of words into a table.
I can write a loop but I am hoping for a more elegant solution as proposed in these related questions using gsub and regular expressions:
here and here that involves inserting/replacing string after every n-th character but in my case my words have variable length (of characters).
Edit: I am looking for solution I can apply to any vector with variable number of words.
For reproducible data, I generate a vector of 40 random words using code from this source
MHmakeRandomString <- function(n, length) {
randomString <- c(1:n)
for (i in 1:n) {
randomString[i] <- paste(sample(c(0:9, letters, LETTERS), length, replace=TRUE),
collapse="")}
return(randomString)
}
set.seed(4)
word_vector <- MHmakeRandomString(n=40, length=5)
word_vector
# [1] "A0ihO" "gIUW4" "Kh6Xp" "sYAXL" "IZvuE" "PtQvw" "zeSEt" "YsCo0" "WfzbU" "5TTIz"
# [11] "oKTOO" "qaaTK" "y4QUd" "C4vNY" "lDplP" "Gjrg8" "UHzUT" "32ZcV" "c7xgl" "5Lr2H"
# [21] "fDgxt" "zFdYO" "hohuK" "vrNU4" "8oRg5" "IYcyl" "pblbO" "SHhq0" "yFjWa" "rzYLr"
# [31] "m2AXf" "QdhtM" "TWpkh" "4499K" "5Bcv8" "0DeqI" "6BdTy" "fJgKX" "tUZeh" "HPso5"
I usually do a paste(x, collapse) and then print to table using gridExtra
word_sep <- paste(word_vector, collapse=", ")
# [1] "z6LHb, 1ubB1, o9TZ2, 8s8bV, sZmcB, blirI, gMfo1, xXkkt, gFMrA, hXdaO,
# lNP2Q, p9B9G, JXTsJ, qVsWS, ntiT8, d0QRv, uoR1D, L99Bg, THWQo, meuev,
# IO0Au, 0yWmh, 72d3g, FJRDS, PtbJT, JaXVK, OPo9m, i0678, 6BpXZ, b6hzT,
# BDQBk, ANC5h, 7QPgM, JJSxf, nnX7Z, rbEfm, XXl4Z, kHMuI, wFLyM, P8rlp"
library(gridExtra)
plot_grid(tableGrob(word_sep))
Current table output: In this case I have a really long list of words and specified table width so I need line breaks.
My desired output would look like this hacked version:
word_sep2 <- paste(c(paste(MHmakeRandomString(n=10, length=5), collapse=", "), ",\n",
paste(MHmakeRandomString(n=10, length=5), collapse=", "), ",\n",
paste(MHmakeRandomString(n=10, length=5), collapse=", "), ",\n",
paste(MHmakeRandomString(n=10, length=5), collapse=", ")), collapse="")
word_sep2
# [1] "0ahiL, 2pA5c, dKWuR, 79sw5, MeL1I, KpB1w, UNLSo, LlDlN, jNOcI, tv8R5,
# \norf60, avKFo, jZFxE, U7RQW, SSmxD, czlMt, 75zEB, 2jLwG, 08dmN, H3sVW,
# \nCZwQt, ggumo, wHUpj, Z7WGR, BHYLE, eWksX, Lbt3D, P1Brf, OpEvk, 1WFVa,
# \nEeFd4, afX7B, nyBzF, vbNLz, U7MU0, H4rx4, AKgv8, Kbzri, KKajp, Yg6EW"
plot_grid(tableGrob(word_sep2))
Desired table output:
You may use
gsub("((?:[^,]*,){10}) ", "\\1\n", word_sep)
See the online regex demo.
Details
((?:[^,]*,){10}) - Group 1 (referred to with \1 from the replacement pattern) that matches 10 consecutive occurrences of
[^,]* - any 0+ chars other than ,
, - a comma
- a space
See the R demo:
MHmakeRandomString <- function(n, length) {
randomString <- c(1:n)
for (i in 1:n) {
randomString[i] <- paste(sample(c(0:9, letters, LETTERS), length, replace=TRUE),
collapse="")}
return(randomString)
}
set.seed(4)
word_vector <- MHmakeRandomString(n=40, length=5)
word_sep <- paste(word_vector, collapse=", ")
f <- gsub("((?:[^,]*,){10}) ", "\\1\n", word_sep)
cat(f, collapse="\n")
I gues you can do it with paste
paste(word_vector, rep(c(", ", ",\n"), c(9,1)), collapse = "", sep = "")
[1] "A0ihO, gIUW4, Kh6Xp, sYAXL, IZvuE, PtQvw, zeSEt, YsCo0, WfzbU, 5TTIz,\noKTOO, qaaTK, y4QUd, C4vNY, lDplP, Gjrg8, UHzUT, 32ZcV, c7xgl, 5Lr2H,\nfDgxt, zFdYO, hohuK, vrNU4, 8oRg5, IYcyl, pblbO, SHhq0, yFjWa, rzYLr,\nm2AXf, QdhtM, TWpkh, 4499K, 5Bcv8, 0DeqI, 6BdTy, fJgKX, tUZeh, HPso5,\n"
Here's what it looks like when printing it with cat:
res <- paste(word_vector, rep(c(", ", ",\n"), c(9,1)), collapse = "", sep = "")
cat(res)
# A0ihO, gIUW4, Kh6Xp, sYAXL, IZvuE, PtQvw, zeSEt, YsCo0, WfzbU, 5TTIz,
# oKTOO, qaaTK, y4QUd, C4vNY, lDplP, Gjrg8, UHzUT, 32ZcV, c7xgl, 5Lr2H,
# fDgxt, zFdYO, hohuK, vrNU4, 8oRg5, IYcyl, pblbO, SHhq0, yFjWa, rzYLr,
# m2AXf, QdhtM, TWpkh, 4499K, 5Bcv8, 0DeqI, 6BdTy, fJgKX, tUZeh, HPso5,

Efficiently match multiple strings/keywords to multiple texts in R

I am trying to efficiently map exact peptides (short sequences of amino acids in the 26 character alphabet A-Z1) to proteins (longer sequences of the same alphabet). The most efficient way to do this I'm aware of is an Aho-Corasick trie (where peptides are the keywords). Unfortunately I can't find a version of AC in R that will work with a non-nucleotide alphabet (Biostrings' PDict and Starr's match_ac are both hard-coded for DNA).
As a crutch I've been trying to parallelize a basic grep approach. But I'm having trouble figuring out a way to do so without incurring significant IO overhead. Here is a brief example:
peptides = c("FSSSGGGGGGGR","GAHLQGGAK","GGSGGSYGGGGSGGGYGGGSGSR","IISNASCTTNCLAPLAK")
if (!exists("proteins"))
{
biocLite("biomaRt", ask=F, suppressUpdates=T, suppressAutoUpdate=T)
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
proteins = getBM(attributes=c('peptide', 'refseq_peptide'), filters='refseq_peptide', values=c("NP_000217", "NP_001276675"), mart=ensembl)
row.names(proteins) = proteins$refseq_peptide
}
library(snowfall)
library(Biostrings)
library(plyr)
sfInit(parallel=T, cpus=detectCores()-1)
allPeptideInstances = NULL
i=1
increment=100
count=nrow(proteins)
while(T)
{
print(paste(i, min(count, i+increment), sep=":"))
text_source = proteins[i:min(count, i+increment),]
text = text_source$peptide
#peptideInstances = sapply(peptides, regexpr, text, fixed=T, useBytes=T)
peptideInstances = sfSapply(peptides, regexpr, text, fixed=T, useBytes=T)
dimnames(peptideInstances) = list(text_source$refseq_peptide, colnames(peptideInstances))
sparsePeptideInstances = alply(peptideInstances, 2, .fun = function(x) {x[x > 0]}, .dims = T)
allPeptideInstances = c(allPeptideInstances, sparsePeptideInstances, recursive=T)
if (i==count | nrow(text_source) < increment)
break
i = i+increment
}
sfStop()
There are a few issues here:
peptideInstances here is a dense matrix, so
returning it from each worker is very verbose. I have broken it up
into blocks so that I'm not dealing with a 40,000 (proteins) x 60,000
(peptides) matrix.
Parallelizing over peptides, when it would make
more sense to parallelize over the proteins because they're bigger.
But I got frustrated with trying to do it by protein because:
This code breaks if there is only one protein in text_source.
Alternatively, if anyone is aware of a better solution in R, I'm happy to use that. I've spent enough time on this I probably would have been better served implementing Aho-Corasick.
1 Some of those are ambiguity codes, but for simplicity, ignore that.
I learned Rcpp and implemented an Aho-Corasick myself. Now CRAN has a good general purpose multiple-keyword search package.
Here are some usage examples:
listEquals = function(a, b) { is.null(unlist(a)) && is.null(unlist(b)) || !is.null(a) && !is.null(b) && all(unlist(a) == unlist(b)) }
# simple search of multiple keywords in a single text
keywords = c("Abra", "cadabra", "is", "the", "Magic", "Word")
oneSearch = AhoCorasickSearch(keywords, "Is Abracadabra the Magic Word?")
stopifnot(listEquals(oneSearch[[1]][[1]], list(keyword="Abra", offset=4)))
stopifnot(listEquals(oneSearch[[1]][[2]], list(keyword="cadabra", offset=8)))
stopifnot(listEquals(oneSearch[[1]][[3]], list(keyword="the", offset=16)))
stopifnot(listEquals(oneSearch[[1]][[4]], list(keyword="Magic", offset=20)))
stopifnot(listEquals(oneSearch[[1]][[5]], list(keyword="Word", offset=26)))
# search a list of lists
# * sublists are accessed by index
# * texts are accessed by index
# * non-matched texts are kept (to preserve index order)
listSearch = AhoCorasickSearchList(keywords, list(c("What in", "the world"), c("is"), "secret about", "the Magic Word?"))
stopifnot(listEquals(listSearch[[1]][[1]], list()))
stopifnot(listEquals(listSearch[[1]][[2]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[2]][[1]][[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(listSearch[[3]], list()))
stopifnot(listEquals(listSearch[[4]][[1]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[4]][[1]][[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(listSearch[[4]][[1]][[3]], list(keyword="Word", offset=11)))
# named search of a list of lists
# * sublists are accessed by name
# * matched texts are accessed by name
# * non-matched texts are dropped
namedSearch = AhoCorasickSearchList(keywords, list(subject=c(phrase1="What in", phrase2="the world"),
verb=c(phrase1="is"),
predicate1=c(phrase1="secret about"),
predicate2=c(phrase1="the Magic Word?")))
stopifnot(listEquals(namedSearch$subject$phrase2[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$verb$phrase1[[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(namedSearch$predicate1, list()))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[3]], list(keyword="Word", offset=11)))
# named search of multiple texts in a single list with keyword grouping and aminoacid alphabet
# * all matches to a keyword are accessed by name
# * non-matched keywords are dropped
proteins = c(protein1="PEPTIDEPEPTIDEDADADARARARARAKEKEKEKEPEPTIDE",
protein2="DERPADERPAPEWPEWPEEPEERAWRAWWARRAGTAGPEPTIDEKESEQUENCE")
peptides = c("PEPTIDE", "DERPA", "SEQUENCE", "KEKE", "PEPPIE")
peptideSearch = AhoCorasickSearch(peptides, proteins, alphabet="aminoacid", groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(list(keyword="protein1", offset=1),
list(keyword="protein1", offset=8),
list(keyword="protein1", offset=37),
list(keyword="protein2", offset=38))))
stopifnot(listEquals(peptideSearch$DERPA, list(list(keyword="protein2", offset=1),
list(keyword="protein2", offset=6))))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(list(keyword="protein2", offset=47))))
stopifnot(listEquals(peptideSearch$KEKE, list(list(keyword="protein1", offset=29),
list(keyword="protein1", offset=31),
list(keyword="protein1", offset=33))))
stopifnot(listEquals(peptideSearch$PEPPIE, NULL))
# grouping by keyword without text names: offsets are given without reference to the text
names(proteins) = NULL
peptideSearch = AhoCorasickSearch(peptides, proteins, groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(1, 8, 37, 38)))
stopifnot(listEquals(peptideSearch$DERPA, list(1, 6)))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(47)))
stopifnot(listEquals(peptideSearch$KEKE, list(29, 31, 33)))

order strings according to some characters

I have a vector of strings, each of those has a number inside and I like to sort this vector according to this number.
MWE:
> str = paste0('N', sample(c(1,2,5,10,11,20), 6, replace = FALSE), 'someotherstring')
> str
[1] "N11someotherstring" "N5someotherstring" "N2someotherstring" "N20someotherstring" "N10someotherstring" "N1someotherstring"
> sort(str)
[1] "N10someotherstring" "N11someotherstring" "N1someotherstring" "N20someotherstring" "N2someotherstring" "N5someotherstring"
while I'd like to have
[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"
I have thought of using something like:
num = sapply(strsplit(str, split = NULL), function(s) {
as.numeric(paste0(head(s, -15)[-1], collapse = ""))
})
str = str[sort(num, index.return=TRUE)$ix]
but I guess there might be something simpler
There is an easy way to do this via gtools package,
gtools::mixedsort(str)
#[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"

Use grep() to select character strings with "XXX-0000" syntax

Given a character vector:
id.data = c("XXX-2355",
"XYz-03",
"XYU-3",
"ABC-1234",
"AX_2356",
"AbC234")
What is the appropriate way to grep for ONLY the entries that DONT'T follow an "XXX-0000" pattern? In the example above I'd want to end up with only "XXX-2355" and "ABC-1234". There are tens of thousands of records.
I tried selecting by individual issue. For example,
id.error = rep(NA, length(id.data))
id.error[-grep("-", id.data)] = "hyphen"
This was obviously really inefficient and I have no way of knowing every possible error. Strplit was useful to a point, but only when I know where to split.
Thanks!
You seem to be looking for invert:
invert logical. If TRUE return indices or values for elements that do not match.
> id.data = c("XXX-2355",
+ "XYz-03",
+ "XYU-3",
+ "ABC-1234",
+ "AX_2356",
+ "AbC234")
> grep("[A-Z]{3}-[0-9]{4}", id.data)
[1] 1 4
> grep("[A-Z]{3}-[0-9]{4}", id.data, value = TRUE)
[1] "XXX-2355" "ABC-1234"
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE)
[1] 2 3 5 6
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE, value = TRUE)
[1] "XYz-03" "XYU-3" "AX_2356" "AbC234"
>
Not sure whether you want strings that match the said pattern, or those that don't match. The above example lists both options.
One way:
library(stringr)
id.data[str_detect(id.data, "[A-z]{3}-[0-9]{4}")]
> [1] "XXX-2355" "ABC-1234"

Resources