R extract the first pattern from the end of string - r

I want to extract sizes from strings, which can be:
a <- c("xxxxxxx 2.5 oz (23488)",
"xxxxx /1.36oz",
"xxxxx/7 days /20 ml")
Result I want: 2.5 oz /1.36oz /20 ml
Because strings varies, so I want to extract patterns backward. That is, I want to extract the first appearance of \\/*(\\d+\\.*\\d*)\\s*[[:alpha:]]+ from the end of the string. It will avoid R from taking 23488 from the first string and /7 days from the third string.
Anyone knows how I can achieve this?
Thanks!

You may use
> a <- c("xxxxxxx 2.5 oz (23488)",
+ "xxxxx /1.36oz",
+ "xxxxx/7 days /20 ml")
> regmatches(a, regexpr("/?\\d+(?:\\.\\d+)?\\s*\\p{L}+(?!.*\\d(?:\\.\\d+)?\\s*\\p{L}+)", a, perl=TRUE))
[1] "2.5 oz" "/1.36oz" "/20 ml"
See the regex demo.
Details
/? - an optional /
\\d+ - 1+ digits
(?:\\.\\d+)? - an optional . and 1+ digits sequence
\\s* - 0+ whitespaces
\\p{L}+ - 1+ letters
(?!.*\\d(?:\\.\\d+)?\\s*\\p{L}+) - not followed with
.* - any 0+ chars, as many as possible
\\d - a digit
(?:\\.\\d+)? - an optional . and 1+ digits sequence
\\s* - 0+ whitespaces
\\p{L}+ - 1+ letters

If you know the name of the units(oz, ml, etc), you could try something like this:
((\d*|\d*\.\d{0,2})\s?(ml|oz|etc))
See working example.

Related

cleaning phonenumbers using regex

I have the following composition of phonenumbers where 33 is the area code:
+331234567
+3301234567
00331234567
003301234567
0331234567
033-123-456-7
0033.1234567
where Im expecting only 331234567
What I have tried to clean those numbers using R
R::tidyverse::str_replace_all(c("+331234567", "033-123-456-7", "0033.1234567"), pattern = "[^0-9.]", replacement = "") removing non-numeric characters
R::tidyverse::str_replace_all("0331234567", pattern = "^0", replacement = "") removing the leading 0
R::tidyverse::str_replace_all("00331234567", pattern = "^00", replacement = "") removing the leading 00
my question is how to remove the zeros in between: 3301234567 or 003301234567 or +3301234567 or 03301234567
Appreciate any help
You can use
gsub("^(?:00?|\\+)330?|\\W", "", x, perl=TRUE)
See the regex demo. See the R demo online.
If there can be more 0s after 33 before the number you need to extract, replace 0? with 0*.
Details
^ - start of string
(?:00?|\+) - 00, 0 or +
330? - 33 or 330
| - or
\W - any non-word char.
You can use ^\+?0*3*0*|[^\s\d]
Pattern explanation:
^ - match beginning of the string
\+? - match + literally, zero or one time.
0* - match zero or more 0
3* - match zero or more 3
| - alternation
[^\s\d] - negated character class - match any character other from whitespace and digit (you could remove \s if you handle one number at a time, it just prevents from matching newline in demo)
Regex demo
It will match unwanted parts separately. First part will clean beginning of a number if it starts with + or 0, second part will clean non-digits inside the number.

Extract substring between two colons / special characters

I want to extract "SUBSTRING" with sub() from the following string:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
I used the following code, but unfortunately it didn't work:
sub(".*: (.*) : .*", "\\1", attribute)
Does anyone know an answer for that?
You may use
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
See the regex demo
You need to rely on negated character classes, [^:] that matches any char but :, since .* matches greedily any 0 or more chars. Also, your pattern contains a space before : and it is missing in the string.
Details
^ - start of string
[^:]* - any 0+ chars other than :
: - a colon with a space
-([^:]*) - Capturing group 1 (\1 refers to this value): any 0+ chars other than :
.* - the rest of the string.
R Demo:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
## => [1] "SUBSTRING"

Extract digits after matching the certain string second time

I want to extract the digits after second occurance of under score _ from a pattern.
by following the similar posts here
Matching different digits after a lookahead
regex - return all before the second occurrence
I tried
library(stringr)
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
str_extract(pattern, "(^_){2}(\\d+\\.*\\d*)")
which outputs
[1] NA
instead of 1400. Could you help?
You may use a base R solution with regexpr/regmatches:
regmatches(x, regexpr("^(?:[^_]*_){2}[^_0-9]*\\K\\d*\\.?\\d+", x, perl=TRUE))
Or, with sub:
sub("^(?:[^_]*_){2}[^_0-9]*(\\d*\\.?\\d+).*", "\\1", x)
See the R demo online.
The regex is
^(?:[^_]*_){2}[^_0-9]*\K\d*\.?\d+
See the online regex demo.
Details
^ - start of string
(?:[^_]*_){2} - 2 repetitions of
[^_]* - any 0+ chars other than _
_ - an underscore
[^_0-9]* - any 0+ chars other than _ and digits
\K - match reset operator discarding all text matched so far
\d*\.?\d+ - a float or integer number pattern (0+ digits, an optional . and then 1+ digits).
In the sub regex variation, the \K is not necessary, the number pattern is captured into a capturing group and the rest of string is matched with .* pattern. The result is the contents of Group 1, referred to with the \1 placeholder.
One option could be as:
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
sub(".*_*_(\\d+).*","\\1", pattern, perl = TRUE)
[1] "1400"
The regex is:
".*_*_(\\d+).*"
Details:
.*_ anything before first _
.*_ anything after first _ and before 2nd _
\\d+ look for digits and take those as selection.
.* anything afterwards.
\\1 replaces matching strings with values found for 1st group.

Replace some text after a string with Regex and Gsub in R

It's a simple question, but I'm not good with Regex. (I tried many expressions without success)
I want to replace all the text (replace for nothing) after a pattern.
My pattern is something like this:
/canais/*/
My data is:
/canais/b3/conheca-o-pai-dos-indices-da-b3/
/canais/cpbs/cvm-abre-audiencia-publica-de-instruc
/canais/stocche-forbes/dividendo-controverso/
The desired result is:
/canais/b3/
/canais/cpbs/
/canais/stocche-forbes/
How can I do it with gsub?
Thanks
You may use the following sub:
x <- c("/canais/b3/conheca-o-pai-dos-indices-da-b3/","/canais/cpbs/cvm-abre-audiencia-publica-de-instruc","/canais/stocche-forbes/dividendo-controverso/")
sub("^(/canais/[^/]+/).*", "\\1", x)
See the online R demo
Details:
^ - start of string
(/canais/[^/]+/) - Group 1 (later referred to with \1) capturing:
/canais/ - a substring /canais/
[^/]+ - 1 or more chars other than /
/ - a slash
.* - any 0+ chars up to the end of string.

Remove all characters after the 2nd occurrence of "-" in each element of a vector

I would like to remove all characters after the 2nd occurrence of "-" in each element of a vector.
Initial string
aa-bbb-cccc => aa-bbb
aa-vvv-vv => aa-vvv
aa-ddd => aa-ddd
Any help?
Judging by the sample input and expected output, I assume you need to remove all beginning with the 2nd hyphen.
You may use
sub("^([^-]*-[^-]*).*", "\\1", x)
See the regex demo
Details:
^ - start of string
([^-]*-[^-]*) - Group 1 capturing 0+ chars other than -, - and 0+ chars other than -
.* - any 0+ chars (in a TRE regex like this, a dot matches line break chars, too.)
The \\1 (\1) is a backreference to the text captured into Group 1.
R demo:
x <- c("aa-bbb-cccc", "aa-vvv-vv", "aa-ddd")
sub("^([^-]*-[^-]*).*", "\\1", x)
## => [1] "aa-bbb" "aa-vvv" "aa-ddd"

Resources