Regular Expression using R Programming Language - r

String<- "46,XX,t(1;19)(p32;q13.3),t(6;9)(p22;q34),del(32)t(12;16)(p12;q21)[cp20]"
The value I want to extract is t(1;19)(p32;q13.3), t(6;9)(p22;q34), t(12;16)(p12;q21)
The regex I'm using
ABC<-str_extract(String, regex("t.{1,16}"))
output I Get: t(1;19)(p32;q13.3
I know my code I incomplete but I'm unable to figure out a way to extract this information.
Thank you in advance

Assuming your String is :
String<- "46,XX,t(1;19)(p32;q13.3),t(6;9)(p22;q34),del(32)t(12;16)(p12;q21)[cp20]"
We can use str_extract_all as :
stringr::str_extract_all(String, "t\\(.*?\\)\\(.*?\\)")[[1]]
#[1] "t(1;19)(p32;q13.3)" "t(6;9)(p22;q34)" "t(12;16)(p12;q21)"
This returns "t" followed by everything in round brackets (()), followed by everything in another round bracket next to it.

Related

Extract string within first two quotation marks using regular expressions?

There is a vector of strings that looks like the following (text with two or more substrings in quotation marks):
vec <- 'ab"cd"efghi"j"kl"m"'
The text within the first pair of quotation marks (cd) contains a useful identifier (cd is the desired output). I have been studying how to use regular expressions but I haven't learned how to find the first and second occurrences of something like quotation marks.
Here's how I have been getting cd:
tmp <- strsplit(vec,split="")[[1]]
paste(tmp[(which(tmp=='\"')[1]+1):(which(tmp=='\"')[2]-1)],collapse="")
"cd"
My question is, is there another way to find "cd" using regular expressions? in order to learn more how to use them. I prefer base R solutions but will accept an answer using packages if that's the only way. Thanks for your help.
Match everything except " then capture everything upto next " and replace captured group by itself.
gsub( '[^"]*"([^"]*).*', '\\1', vec)
[1] "cd"
For detailed explanation of regex you can see this demo

Extract numerical value before a string in R

I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811
Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.
Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

R gsub regular expression syntax error

If I have some string : 2017-01-12T19:00:00.000+000, and I want to have 2017-01-12, so delete all after and including "T" How do I proceed,
gsub("$.*T"," ","2017-01-12T19:00:00.000+000")
, would this not work? I am referring my self to:http://www.endmemo.com/program/R/gsub.php
Thank you!
One approach is to match and capture the date portion of your string using gsub() and then replace the entire string with what was captured.
gsub("(\\d{4}-\\d{2}-\\d{2}).*","\\1","2017-01-12T19:00:00.000+000")
[1] "2017-01-12"
Your original approach:
gsub("T.*","","2017-01-12T19:00:00.000+000")
[1] "2017-01-12"
As others have said, if the need for this format exceeds the scope of this particular timestamp string, then you should consider using a date API instead.
Demo here:
Rextester

Resources