Extract the strings that follows a regex pattern in R [duplicate] - r

This question already has answers here:
Extract a regular expression match
(12 answers)
Closed 3 years ago.
The list of original inputs are a list of free text field. The task is to extract a pattern like "234-5678" from the string.
For example the list in the following:
text <- c("abced 156-8790","kien 3578-562839 bewsd","$nietl 66320-98703","789-55340")
what I would like to extract is:
return <- c("156-8790","578-5628","320-9870","789-5534")
I was considering to use gsub("^[([:digit:]{3})[-]([:digit:]{4})]", replacement = "", text), but the regex does not work the way I wanted. Could anyone please help with this? Many thanks in advance!

We can use str_extract to match 3 digits (\\d{3}) followed by a - , followed 4 digits (\\d{4})
library(stringr)
str_extract(text, "\\d{3}-\\d{4}")
#[1] "156-8790" "578-5628" "320-9870" "789-5534"
Or using base R with regmatches/regexpr
regmatches(text, regexpr("\\d{3}-\\d{4}", text))
#[1] "156-8790" "578-5628" "320-9870" "789-5534"

Related

String split to remove everything after _ [duplicate]

This question already has answers here:
How to extract everything until first occurrence of pattern
(4 answers)
Closed 1 year ago.
I have a list of file names and want to string extract just the part of the name before the _
I tried using the following but was unsuccessful.
condition <- strsplit(count_files, "_*")
also tried
condition <- strsplit(count_files, "_*.[c,t]sv")
Any suggestions?
Just use trimws from base R
trimws(count_files, whitespace = "_.*")
[1] "Fibroblast" "Fibroblast"
The output from strsplit is a list, it may need to be unlisted. Also, when we use _* the regex mentioned is zero or more _. Instead, it should be _.* i.e. _ followed by zero or more other characters (.*)
unlist(strsplit(count_files, "_.*"))
data
count_files <- c("Fibroblast_1.csv", "Fibroblast_2.csv")

Replace "$" in a string in R [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 1 year ago.
I would like to replace $ in my R strings. I have tried:
mystring <- "file.tree.id$HASHd15962267-44c21f1cee1057d95d6840$HASHe92451fece3b3341962516acfa962b2f$checked"
stringr::str_replace(mystring, pattern="$",
replacement="!")
However, it fails and my replacement character is put as the last character in my original string:
[1] "file.tree.id$HASHd15962267-44c21f1cee1057d95d6840$HASHe92451fece3b3341962516acfa962b2f$checked!"
I tried some variation using "pattern="/$" but it fails as well. Can someone point a strategy to do that?
In base R, You could use:
chartr("$","!", mystring)
[1] "file.tree.id!HASHd15962267-44c21f1cee1057d95d6840!HASHe92451fece3b3341962516acfa962b2f!checked"
Or even
gsub("$","!", mystring, fixed = TRUE)
We need fixed to be wrapped as by default pattern is in regex mode and in regex $ implies the end of string
stringr::str_replace_all(mystring, pattern = fixed("$"),
replacement = "!")
Or could escape (\\$) or place it in square brackets ([$]$), but `fixed would be more faster

Split string after last underscore in R [duplicate]

This question already has answers here:
Separate string after last underscore
(2 answers)
Closed 2 years ago.
I have a string like "ABC_Something_Filename". How can I split it into "ABC_Something" and "Filename" in R?
I do not want to remove anything. I want both components - before and after last underscore.
Edit: I tried using what's mentioned for columns separation but that is too extensive for my use case. Hence, I finding a regex alternative to simply split a string
One option would be to use strsplit with a negative lookahead which asserts that the underscore on which to split is the final one in the input:
input <- "ABC_Something_Filename"
parts <- strsplit(input, "_(?!.*_)", perl=TRUE)[[1]]
parts
[1] "ABC_Something" "Filename"
You can use str_match and capture data in two groups.
x <- 'ABC_Something_Filename'
stringr::str_match(x, '(.*)_(.*)')[, -1]
#[1] "ABC_Something" "Filename"

Replace a string with first few characters [duplicate]

This question already has answers here:
Regex group capture in R with multiple capture-groups
(9 answers)
Closed 2 years ago.
Let say I have a pattern like -
Str = "#sometext_any_character_including_&**(_etc_blabla\\s"
Now I want to replace above text with
"#some\\s"
i.e. I just want to retain first 4 characters and trailing space and beginning #. Is there any r way to do this?
Any pointer will be highly appreciated.
I would extract using regex. If you want all text following the \\s I would capture them with an ex:
import re
# Extract
pattern = re.compile("(#[a-z]{4}|\\\s)")
my_match = "".join(pattern.findall(my_string))
An option with sub
sub("^(#.{4}).*(\\\\s)$", "\\1\\2", Str)
#[1] "#some\\s"
str_replace(string, pattern, replacement)
or
str_replace_all(string, pattern, replacement)
You can use

How to use two conditions on string search using gsub in R? [duplicate]

This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 3 years ago.
I have a vector of characters and I want to search for everytime "RR" appears, and replace by "" empty space. But I can´t miss the "ANRR". I was wondering something like:
gsub("RR|!ANRR", "",charvector$vector)
But it doesn´t work. I was wondering how to include "OR" and "NOT" in the same expression?
Perhaps we need to have a word boundary (\\b) or a space (\\s) to make sure that it would only match the 'RR' and not 'ANRR'
gsub("\\bRR\\b", "",charvector$vector)
Or if we want to replace 'RR' on a substring of a word which doesn't precede with 'AN'
gsub("(?<!AN)RR", "", charvector$vector, perl = TRUE)
data
charvector <- data.frame(vector = c('hello RR sds ANRR dss RR',
'RR dds ANRR CNRR'), stringsAsFactors = FALSE)

Resources