Extract Text Starting and Ending with Punctuations in R [duplicate] - r

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I want to extract a group of strings between two punctuations using RStudio.
I tried to use str_extract command, but whenever I tried to use anchors (^ for starting char, and $ for ending char), it failed.
Here is the sample problem:
> text <- "Name : Dr. CHARLES DOWNING MAP ; POB : London; Age/DOB : 53 years / August 05, 1958;"
Here is the sample code I used:
> str_extract(text,"(Name : )(.+)?( ;)")
> str_match(str_extract(text,"(Name : )(.+)?( ;)"),"(Name : )(.+)?( ;)")[3]
But it seemed too verbose, and not flexible.
I only want to extract "Dr. CHARLES DOWNING MAP".
Anyone can help with my problem?
Can I tell the regex to start with any non-white-space character after "Name : " and ends before " ; POB"?

This seems to work.
> gsub(".*Name :(.*) ;.*", "\\1", text)
[1] " Dr. CHARLES DOWNING MAP"

With str_match
stringr::str_match(text, "^Name : (.*) ;")[, 2]
#[1] "Dr. CHARLES DOWNING MAP"
[, 2] is to get the contents from the capture group.
There is also qdapRegex::ex_between to extract string between left and right markers
qdapRegex::ex_between(text, "Name : ", ";")[[1]]
#[1] "Dr. CHARLES DOWNING MAP"

Related

R: Regex for Phone Numbers [duplicate]

This question already has an answer here:
How to use regex character class extensions in R?
(1 answer)
Closed 3 months ago.
I am working with the R programming language.
I have a column of data that looks something like this:
string = c("a1 123-456-7899 hh", "b 124-123-9999 b3")
I would like to remove the "phone numbers" so that the final result looks like this:
[1] "a1 hh" "b b3"
I tried to apply the answer provided here Regular expression to match standard 10 digit phone number to my question:
gsub("^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$", "", string, fixed = TRUE)
But I get the following error: Error: '\+' is an unrecognized escape in character string starting ""^(\+"
Can someone please show me how to fix this?
Thanks!
Try:
library(stringr)
s <- c("a1 123-456-7899 hh", "b 124-123-9999 b3")
result <- str_replace(s, "\\d+[-]\\d+[-]\\d+\\s", "")
print(result)
OUTPUT:
[1] "a1 hh" "b b3"
This will look for :
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
\\s : a space
And replace it with "" - nothing

remove returns in excel read in

I am reading a couple excel files in and merging them into one dataframe. Some of the address fields have returns in them. I came up with this to remove them but it does not work and RStudio says that there are invalid tokens in the line.
df$Primary.Street <- gsub("\r\n", " ", df$Primary.Street)
Any help would be much appreacited.
Sample of input row of how it looks in Excel:
"123 Main St
"Sam Jones" Apt A" "New York" "NY" "12345"
Desired output to csv:
"Sam Jones","123 Main St Apt A","New York","NY","12345"
Put your carriage return characters in square brackets to create a character class, which will match any character in the class:
> samp <- "120 Main st\nApt A"
> gsub("[\r\n]+", " ", samp)
[1] "120 Main st Apt A"
Your example without the brackets would only match a \r and \n in sequence. My example here will match any sequence of one or more of either (via the + quantifier).

Delete initial (matching) pattern from a string [duplicate]

This question already has answers here:
Regular expression to match characters at beginning of line only
(8 answers)
Closed 2 years ago.
I have the following vector of character strings:
v<-c("RT #name1: hello world", "Hi guys, how are you?", "Hello RT I have no text", "RT #name2: Hello!")
I would like to delete only those RT that are positioned at the beginning of strings and store the results in another vector, e.g., w:
> w
"#name1: hello world" "Hi guys, how are you?" "Hello RT I have no text" "#name2: Hello!"
Maybe I could use function str_extract_all from the package stringr, but I can't apply it to my problem.
Use gsub and the 'anchor' ^, which signifies the beginning of a string:
w <- gsub("^RT\\s", "", v)
<- str_replace(v,"^RT","")

Problems in a regular expression to extract names using stringr

I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,
It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1
I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!
You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>

R: compare and subset two strings [duplicate]

This question already has answers here:
Replace specific characters within strings
(7 answers)
Closed 7 years ago.
Is there a function in R that can respond at these requirements:
if string1 exists in string2 then remove string1 from string2
I passed a day searching on a such function. So, any help would be appreciated.
Edit:
I have a dataframe. Here's a part of it:
mark name ChekMark
Caudalie Caudalie Eau démaquillante 200ml TRUE
Mustela Mustela Bébé lait hydra corps 300ml TRUE
Lierac Lierac Phytolastil gel prévention TRUE
I want to create an new dataframe in witch the mark doesn't exist on the product name.
That's my final goal.
You can use gsub and work with regular expressions:
gsub(" this part ", " ", "A Text where this part should be removed")
# [1] "A Text where should be removed"
gsub(" this part ", " ", "A Text where this 1 part should be removed")
# [1] "A Text where this 1 part should be removed"
Are you looking for string2.replace(string1, '')?
or you could:
>>> R = lambda string1, string2: string2.replace(string1, '')
>>> R('abc', 'AAAabcBBB')
'AAABBB'
>>>

Resources