R separate words from numbers in string - r

I need to clean up some data strings that have words and numbers or just numbers.
below is a toy sample
library(tidyverse)
c("555","Word 123", "two words 123", "three words here 123") %>%
sub("(\\w+) (\\d*)", "\\1|\\2", .)
The result is this:
[1] "555" "Word|123" "two|words 123" "three|words here 123"
but I want to place the '|' before the last set of numbers like shown below
[1] "|555" "Word|123" "two words|123" "three words here|123"

We can use sub to match zero or more spaces (\\s*) followed by a digit we capture as a group ((\\d)) and in the replacement use the | followed by the backreference (\\1) of the captured group
sub("\\s*(\\d)", "|\\1", v1)
#[1] "|555" "Word|123"
#[3] "two words|123" "three words here|123"
data
v1 <- c("555","Word 123", "two words 123", "three words here 123")

You may use
^(.*?)\s*(\d*)$
Replace with \1|\2. See the regex demo.
In R:
sub("^(.*?)\\s*(\\d*)$", "\\1|\\2", .)
Details
^ - start of string
(.*?) - Capturing group 1: any 0+ chars, as few as possible
\s* - zero or more whitespaces
(\d*) - Capturing group 2: zero or more digits
$ - end of string.

Related

str_replace: replacement depending on wildcard value [A-Z]

I have a number of strings containing the pattern "of" followed by an uppercase letter without spaces (in regex: "of[A-Z]"). I want to add spaces, e.g. "PrinceofWales" should become "Prince of Wales" etc.). However, I couldn't find how to add the value of [A-Z] that was matched into the replacement value:
library(tidyverse)
str_replace("PrinceofWales", "of[A-Z]", " of [A-Z]")
# Gives: Prince of [A-Z]ales
# Expected: Prince of Wales
str_replace("DukeofEdinburgh", "of[A-Z]", " of [A-Z]")
# Gives: Duke of [A-Z]dinburgh
# Expected: Duke of Edinburgh
Can someone enlighten me? :)
It needs to be captured as a group (([A-Z])) and replace with the backreference (\\1) of the captured group i.e. regex interpretation is in the pattern and not in the replacement
stringr::str_replace("PrinceofWales", "of([A-Z])", " of \\1")
[1] "Prince of Wales"
According to ?str_replace
replacement - A character vector of replacements. Should be either length one, or the same length as string or pattern. References of the form \1, \2, etc will be replaced with the contents of the respective matched group (created by ()).
Or another option is a regex lookaround
stringr::str_replace("PrinceofWales", "of(?=[A-Z])", " of ")
[1] "Prince of Wales"

Regex: extract a number after a string that contains a number

Suppose I have a string:
str <- "England has 90 cases(1 discharged, 5 died); Scotland has 5 cases(2 discharged, 1 died)"
How can I grab the number of discharged cases in England?
I have tried
sub("(?i).*England has [\\d] cases(.*?(\\d+).*", "\\1", str),
It's returning the original string. Many Thanks!
We can use regmatches/gregexpr to match one or more digits (\\d+) followed by a space, 'discharged' to extract the number of discharges
as.integer(regmatches(str, gregexpr("\\d+(?= discharged)", str, perl = TRUE))[[1]])
#[1] 1 2
If it is specific only to 'England', start with the 'England' followed by characters tat are not a ( ([^(]+) and (, then capture the digits (\\d+) as a group, in the replacement specify the backreference (\\1) of the captured group
sub("England[^(]+\\((\\d+).*", "\\1", str)
#[1] "1"
Or if we go by the OP's option, the ( should be escaped as it is a metacharacter to capture group (after the cases). Also, \\d+ can be placed outside the square brackets
sub("(?i)England has \\d+ cases\\((\\d+).*", "\\1", str)
#[1] "1"
We can use str_match to capture number before "discharged".
stringr::str_match(str, "England.*?(\\d+) discharged")[, 2]
#[1] "1"
the regex is \d+(?= discharged) and get the first match

Finding second space after each comma

This is a follow up to this question: Concatenate previous and latter words to a word that match a condition in R
I am looking for a regex which splits the string at the second space that happens after comma. Look at the example below:
vector <- c("Paulsen", "Kehr,", "Diego",
"Schalper", "Sepúlveda,", "Alejandro",
"Von Housen", "Kush,", "Terry")
X <- paste(vector, collapse = " ")
X
## this is the string I am looking to split:
"Paulsen Kehr, Diego Schalper Sepúlveda, Diego Von Housen Kush, Terry"
Second space after each comma is the criterion for my regex. So, my output will be:
"Paulsen Kehr, Diego"
"Schalper Sepúlveda, Alejandro"
"Von Housen Kush, Terry"
I came up with a pattern but it is not quite working.
[^ ]+ [^ ]+, [^ ]+( )
Using it with strsplit removes all the words instead of splitting at group-1 (i.e. [^ ]+ [^ ]+, [^ ]+(group-1)) only. I think I just needs to exclude the full match and match with the space afterwards only. --
regex demo
strsplit(X, "[^ ]+ [^ ]+, [^ ]+( )")
# [1] "" [2] "" [3] "Von Housen Kush, Terry"
Can anyone think of a regex for finding the second space after each comma?
You may use
> strsplit(X, ",\\s+\\S+\\K\\s+", perl=TRUE)
[[1]]
[1] "Paulsen Kehr, Diego" "Schalper Sepúlveda, Alejandro" "Von Housen Kush, Terry"
See the regex demo
Details
, - a comma
\s+ - 1+ whitespaces
\S+ - 1+ non-whitespaces
\K - match reset operator discarding all text matched so far
\s+ - 1+ whitespaces

Remove string after first number using r regex

How to remove everything contained after the first number of a string?
x <- c("Hubert 208 apt 1", "Mass Av 300, block 3")
After this question, I succeeded in removing everything before the first number, the first number inclusive:
gsub( "^\\D*\\d+", "", x )
[1] " apt 1" ", block 3"
But the desired output looks like this:
[1] "Hubert 208" "Mass Av 300"
>
In the OP's current code, a minor change can make it work i.e. to capture the matching pattern as a group ((...)) and replace with backreference (\\1)
sub("^(\\D*\\d+).*", "\\1", x)
#[1] "Hubert 208" "Mass Av 300"
Here, the pattern from OP implies ("^\\D*\\d+") - zero or more characters that are not a digit (\\D*) from the start (^) of the string, followed by one or more digits (\\d+) and this is captured as a group with parens ((...)).
Also, instead of gsub (global substitution) we need only sub as we need to match only a single instance (from the beginning)
This expression might be slightly safer,
^\s*(.+?)([0-9]+)
Demo
Another option instead of replace is to take your expression and use the match instead.
Your pattern will match till after the first digits by matching from the start of the string ^ 0+ times not a digit \D* followed by 1+ times a digit \d+:
^\\D*\\d+
Regex demo
If you use sub with perl=TRUE you could make use of \K to forget what was matched.
Then you might use:
^\\D*\\d+\\K.*
Regex demo
In the replacement use an empty string.
sub("^\\D*\\d+\\K.*", "", x, perl=TRUE)
You could also use your current regex pattern with stringr::str_extract:
x <- c("Hubert 208 apt 1", "Mass Av 300, block 3")
stringr::str_extract(x, "^\\D*\\d+")
[1] "Hubert 208" "Mass Av 300"

Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")

Resources