Number followed by hyphen - r

I have dataframe which contains strings. Refer the code below -
mydf = data.frame(x=c("ads 1-x as", "sda 1-xxaa sad", "sda a-x sad"))
I want the word that follows pattern : numeric followed by hyphen followed by a single letter, to be replaced with numeric only.
Expected Output -
"ads 1 as", "sda 1-xxaa sad", "sda a-x sad"

We can use sub (or if there are more number of instances, with gsub) to match one or more digits captured as a group (\\d+ - note the word boundary (\\b) before that) followed by a hyphen (-) and a single letter ([A-Za-z]) (to avoid matching it with more letters - use the word boundary -\\b) and replace with the back reference (\\1) of the captured group
gsub("\\b(\\d+)-[A-Za-z]\\b", "\\1", mydf$x)
#[1] "ads 1 as" "sda 1-xxaa sad" "sda a-x sad"

Related

How to match binomial expressions in R?

I want to match binomials, that is, bisyllabic words, sometimes hyphenated, with slightly varied syllable reduplication; the variation always concerns the first (and, possibly, second) letter in the reduplicated syllable:
x <- c("pow-wow", "pickwick", "easy-peasy", "nitty-gritty", "bzzzzzzz", "mmmmmm", "shish", "wedged", "yaaaaaa")
Here, we have said syllable reduplication in pow-wow, pickwick, easy-peasy, and nitty-gritty (which are then the expected output) but not in bzzzzzzz, mmmmmm, shish, wedged and yaaaaa.
This regex does at least manage to get rid of wedged(which is pronounced as one syllable) as well as monosyllabic words by requiring the presence of a vowel in the capturing group:
grep("\\b\\w?((?!ed)(?=[aeiou])\\w{2,})-?\\w\\w?\\1\\b$", x, value = T, perl = T)
[1] "pow-wow" "pickwick" "easy-peasy" "nitty-gritty" "yaaaaa"
However, yaaaaa is getting matched too. To not match it my feeling is that the capturing group should be disallowed to contain two identical vowels in immediate succession but I don't know how to implement that restriction.
Any ideas?
It looks as though you want to match words that cannot contain ed after the initial chars and 2 or more repeated chars if the same chunk is not found farther in the string. Also, the allowed "difference" window at the start and middle is 0 to 2 characters.
You may use
\b\w{0,2}(?!((.)\2+)(?!.*\1)|ed)([aeiou]\w+)-?\w\w?\3\b
See the regex demo
Details
\b - a word boundary (you may use ^ if your "words" are equal to whole string)
\w{0,2} - two or more word chars (replace with \p{L} to only match letters)
(?!((.)\2+)(?!.*\1)|ed) - no ed or two or more identical chars that do not repeat later in the string are allowed immediately to the right of the current location
([aeiou]\w+) - a vowel (captured in Group 3) and 1+ word chars (replace with \p{L} to only match letters)
-? - an optional hyphen
\w\w? - 1 or 2 word charsd
\3 - same value as captured in Group 3
\b - a word boundary (you may use $ if your "words" are equal to whole string)

Function for regular expression in R

I need to extract certain sequences from a string of text.
Something like 93085k82 will be embedded in text.
Is there a script that identify when 5 numbers, a letter, and then 2 numbers occur?
We can use pattern starting with word boundary (\\b) followed by five digits (\\d{5}), a lower case letter ([a-z]{1}) and two digits (\\d{2}) followed by the word boundary (\\b)
grep("\\b\\d{5}[a-z]{1}\\d{2}\\b", v1)
If we need to extract
library(stringr)
str_extract_all(v1, "\\b\\d{5}[a-z]{1}\\d{2}\\b")

Replace matched string with its sub-groups

I have some DNA sequences to process, they look like:
>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA
>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
I coded a regex in order to detect the title line of each sequence:
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
What function should I use to replace this whole match with its group3 + group4? In addition, I've got 72 matches in one txt file, how can I replace them in one run?
Your current regex won't work for lines where Group 3 or 4 contains a single letter word because [a-zA-Z]+\\-?[a-zA-Z]+ matches 1+ letters, then an optional hyphen, and then again 1+ letters (that means, there must be at least 2 letters). With [a-zA-Z]+(?:-[a-zA-Z]+)?, you can match 1+ letters followed with an optional sequence of - and then 1+ letters.
Also, \s also matches line breaks, and if your title lines are shorter than you assume then .* may grab a sequence line by mistake. You may use \h or [ \t] instead.
Note that \n is not necessary at the end of the pattern because .* matches any 0+ chars other than line break chars with an ICU regex library (it is used in your current code, str_replace_all).
In general, you should only capture with (...) what you need to keep, everything else can be just matched. Remove extra capturing parentheses, and it will save some performance.
If you add (?m)^ at the start, you will make sure you only match > that is at the start of a line.
You may use
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*"
See the regex demo.
Code:
Sequence <- str_replace_all(SequenceRaw,
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*",
"\\1 \\2")
I figured it out myself, with tidyverse packages:
library(tidyverse)
SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta
Sequence <- str_replace_all(SequenceRaw,
"(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n",
">\\3 \\4\n") ## Keep '>' and add a new line with '\n'
write_file(Sequence, "YOUR PATH\\sequence.fasta")
Here is the result:

How to extract a substring by inverse pattern with R?

I trying to extract a substring by pattern using gsub() R function.
# Example: extracting "7 years" substring.
string <- "Psychologist - 7 years on the website, online"
gsub(pattern="[0-9]+\\s+\\w+", replacement="", string)`
`[1] "Psychologist - on the website, online"
As you can see, it's easy to exlude needed substring using gsub(), but I need to inverse the result and getting "7 years" only.
I think about using "^", something like that:
gsub(pattern="[^[0-9]+\\s+\\w+]", replacement="", string)
Please, could anyone help me with correct regexp pattern?
You may use
sub(pattern=".*?([0-9]+\\s+\\w+).*", replacement="\\1", string)
See this R demo.
Details
.*? - any 0+ chars, as few as possible
([0-9]+\\s+\\w+) - Capturing group 1:
[0-9]+ - one or more digits
\\s+ - 1 or more whitespaces
\\w+ - 1 or more word chars
.* - the rest of the string (any 0+ chars, as many as possible)
The \1 in the replacement replaces with the contents of Group 1.
You could use the opposite of \d, which is \D in R:
string <- "Psychologist - 7 years on the website, online"
sub(pattern = "\\D*(\\d+\\s+\\w+).*", replacement = "\\1", string)
# [1] "7 years"
\D* means: no digits as long as possible, the rest is captured in a group and then replaces the complete string.
See a demo on regex101.com.

text manipulation in R

I am trying to add parentheses around certain book titles character strings and I want to be able to paste with the paste0 function. I want to take this string:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)", "My Liffe 1993 07e pdfDrama (amazon.com)")
wrap certain strings in parentheses:
a
[1] “I Like What I Know (1959) (02e) (pdfDrama) (amazon.com)”
[2] ”My Life (1993) (07e) (pdfDrama) (amazon.com)”
I have tried but can't figure out a way to replace them within the string:
paste0("(",str_extract(a, "\\d{4}"),")")
paste0("(",str_extract(a, ”[0-9]+.e”),”)”)
Help?
I can suggest a regex for a fixed number of words of specific type:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)","My Life 1993 07e pdfDrama (amazon.com)")
sub("\\b(\\d{4})(\\s+)(\\d+e)(\\s+)([a-zA-Z]+)(\\s+\\([^()]*\\))", "(\\1)\\2(\\3)\\4(\\5)\\6", a)
See the R demo
And here is the regex demo. In short,
\\b(\\d{4}) - captures 4 digits as a whole word into Group 1
(\\s+) - Group 2: one or more whitespaces
(\\d+e) - Group 3: one or more digits and e
(\\s+) - Group 4: ibid
([a-zA-Z]+) - Group 5: one or more letters
(\\s+\\([^()]*\\)) - Group 6: one or more whitespaces, (, zero or more chars other than ( and ), ).
The contents of the groups are inserted back into the result with the help of backreferences.
If there are more words, and you need to wrap words starting with a letter/digit/underscore after a 4-digit word in the string, use
gsub("(?:(?=\\b\\d{4}\\b)|\\G(?!\\A))\\s*\\K\\b(\\S+)", "(\\1)", a, perl=TRUE)
See the R demo and a regex demo
Details:
(?:(?=\\b\\d{4}\\b)|\\G(?!\\A)) - either the location before a 4-digit whole word (see the positive lookahead (?=\\b\\d{4}\\b)) or the end of the previous successful match
\\s* - 0+ whitespaces
\\K - omitting the text matched so far
\\b(\\S+) - Group 1 capturing 1 or more non-whitespace symbols that are preceded with a word boundary.

Resources