I have the following string
string <- c('a - b - c - d',
'z - c - b',
'y',
'u - z')
I would like to subset it such that everything after the second occurrence of ' - ' is thrown away.
The result would be this:
> string
[1] "a - b" "z - c" "y" "u - z"
I used substr(x = string, 1, regexpr(string, pattern = '[^ - ]*$') - 4), but it excludes the last occurrence of ' - ', which is not what I want .
Note that you cannot use a negated character class to negate a sequence of characters. [^ - ]*$ matches any 0+ chars other than a space (yes, it matches -, too, because the - created a range between a space and a space) followed by the end of the string marker ($).
You may use a sub function with the following regex:
^(.*? - .*?) - .*
to replace with \1. See the regex demo.
R code:
> string <- c('a - b - c - d', 'z - c - b', 'y', 'u - z')
> sub("^(.*? - .*?) - .*", "\\1", string)
[1] "a - b" "z - c" "y" "u - z"
Details:
^ - start of a string
(.*? - .*?) - Group 1 (referred to with the \1 backreference in the replacement pattern) capturing any 0+ chars lazily up to the first space, hyphen, space and then again any 0+ chars up to the next leftmost occurrence of space, hyphen, space
- - a space, hyphen and a space
.* - any zero or more chars up to the end of the string.
try this (\w(?:\s+-\s+\w)?).*. For the explanation of the regex look this https://regex101.com/r/BbfsNQ/2.
That regex will retrieve the first tuple if exists or just the first caracter if there's not a tuple. So, the data is get into a "capturing group". Then to display the captured groups, it depends on the used language but in pure regex that will be \1 to get the first group (\2 to get second etc...). Look at the part "Substitution" on the regex101 if you wan't a graphic example.
Related
I want to extract "SUBSTRING" with sub() from the following string:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
I used the following code, but unfortunately it didn't work:
sub(".*: (.*) : .*", "\\1", attribute)
Does anyone know an answer for that?
You may use
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
See the regex demo
You need to rely on negated character classes, [^:] that matches any char but :, since .* matches greedily any 0 or more chars. Also, your pattern contains a space before : and it is missing in the string.
Details
^ - start of string
[^:]* - any 0+ chars other than :
: - a colon with a space
-([^:]*) - Capturing group 1 (\1 refers to this value): any 0+ chars other than :
.* - the rest of the string.
R Demo:
attribute <- "S4Q7b1_t1_r1: SUBSTRING: some explanation: some explanation - ..."
sub("^[^:]*: ([^:]*).*", "\\1", attribute)
## => [1] "SUBSTRING"
I want to extract sizes from strings, which can be:
a <- c("xxxxxxx 2.5 oz (23488)",
"xxxxx /1.36oz",
"xxxxx/7 days /20 ml")
Result I want: 2.5 oz /1.36oz /20 ml
Because strings varies, so I want to extract patterns backward. That is, I want to extract the first appearance of \\/*(\\d+\\.*\\d*)\\s*[[:alpha:]]+ from the end of the string. It will avoid R from taking 23488 from the first string and /7 days from the third string.
Anyone knows how I can achieve this?
Thanks!
You may use
> a <- c("xxxxxxx 2.5 oz (23488)",
+ "xxxxx /1.36oz",
+ "xxxxx/7 days /20 ml")
> regmatches(a, regexpr("/?\\d+(?:\\.\\d+)?\\s*\\p{L}+(?!.*\\d(?:\\.\\d+)?\\s*\\p{L}+)", a, perl=TRUE))
[1] "2.5 oz" "/1.36oz" "/20 ml"
See the regex demo.
Details
/? - an optional /
\\d+ - 1+ digits
(?:\\.\\d+)? - an optional . and 1+ digits sequence
\\s* - 0+ whitespaces
\\p{L}+ - 1+ letters
(?!.*\\d(?:\\.\\d+)?\\s*\\p{L}+) - not followed with
.* - any 0+ chars, as many as possible
\\d - a digit
(?:\\.\\d+)? - an optional . and 1+ digits sequence
\\s* - 0+ whitespaces
\\p{L}+ - 1+ letters
If you know the name of the units(oz, ml, etc), you could try something like this:
((\d*|\d*\.\d{0,2})\s?(ml|oz|etc))
See working example.
I have a vector of strings that look like this:
a - bc/def_g - A/mn/us/ww
opq - rs/ts_uf - BC/wx/yza
Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE
I'd like to get the text after 2nd dash (-) but before first flash (/), i.e. the result should look like
A
BC
XYZ
What is the best way to do it (the vector has more than 500K rows.)
Thanks
Suppose your string is defined like this:
string <- c("a - bc/def_g - A/mn/us/ww",
"opq - rs/ts_uf - BC/wx/yza",
"Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")
Then you can use sub
> sub(".*\\-\\s+([A-Z]+)/.*", "\\1", string)
[1] "A" "BC" "XYZ"
See regex in use here
^[^-]*-[^-]*-\s*\K[^/]+
^ Assert position at the start of the line
[^-]* Match any character except - any number of times
- Match this literally
[^-]* Match any character except - any number of times
- Match this literally
\s* Match any number of whitespace characters
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^/]+ Match any character except / one or more times
Alternatively, as suggested by Jan in the comments below (I believe it has since been deleted) ^(?:\[^-\]*-){2}\s*\K\[^/\]+ may be used. It's shorter and easily scalable, but more adds steps.
See code in use here
x <- c("a - bc/def_g - A/mn/us/ww", "opq - rs/ts_uf - BC/wx/yza", "Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")
m <- regexpr("^[^-]*-[^-]*-\\s*\\K[^/]+", x, perl=T)
regmatches(x, m)
Result: [1] "A" "BC" "XYZ"
I want to extract the digits after second occurance of under score _ from a pattern.
by following the similar posts here
Matching different digits after a lookahead
regex - return all before the second occurrence
I tried
library(stringr)
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
str_extract(pattern, "(^_){2}(\\d+\\.*\\d*)")
which outputs
[1] NA
instead of 1400. Could you help?
You may use a base R solution with regexpr/regmatches:
regmatches(x, regexpr("^(?:[^_]*_){2}[^_0-9]*\\K\\d*\\.?\\d+", x, perl=TRUE))
Or, with sub:
sub("^(?:[^_]*_){2}[^_0-9]*(\\d*\\.?\\d+).*", "\\1", x)
See the R demo online.
The regex is
^(?:[^_]*_){2}[^_0-9]*\K\d*\.?\d+
See the online regex demo.
Details
^ - start of string
(?:[^_]*_){2} - 2 repetitions of
[^_]* - any 0+ chars other than _
_ - an underscore
[^_0-9]* - any 0+ chars other than _ and digits
\K - match reset operator discarding all text matched so far
\d*\.?\d+ - a float or integer number pattern (0+ digits, an optional . and then 1+ digits).
In the sub regex variation, the \K is not necessary, the number pattern is captured into a capturing group and the rest of string is matched with .* pattern. The result is the contents of Group 1, referred to with the \1 placeholder.
One option could be as:
pattern <- c("1/2/3_500k/855kk_1400k/AVBB")
sub(".*_*_(\\d+).*","\\1", pattern, perl = TRUE)
[1] "1400"
The regex is:
".*_*_(\\d+).*"
Details:
.*_ anything before first _
.*_ anything after first _ and before 2nd _
\\d+ look for digits and take those as selection.
.* anything afterwards.
\\1 replaces matching strings with values found for 1st group.
I would like to remove all characters after the 2nd occurrence of "-" in each element of a vector.
Initial string
aa-bbb-cccc => aa-bbb
aa-vvv-vv => aa-vvv
aa-ddd => aa-ddd
Any help?
Judging by the sample input and expected output, I assume you need to remove all beginning with the 2nd hyphen.
You may use
sub("^([^-]*-[^-]*).*", "\\1", x)
See the regex demo
Details:
^ - start of string
([^-]*-[^-]*) - Group 1 capturing 0+ chars other than -, - and 0+ chars other than -
.* - any 0+ chars (in a TRE regex like this, a dot matches line break chars, too.)
The \\1 (\1) is a backreference to the text captured into Group 1.
R demo:
x <- c("aa-bbb-cccc", "aa-vvv-vv", "aa-ddd")
sub("^([^-]*-[^-]*).*", "\\1", x)
## => [1] "aa-bbb" "aa-vvv" "aa-ddd"