Extracting every nth element of vector of lists - r

I have the following ids.
ids <- c('a-000', 'b-001', 'c-002')
I want to extract the numeric part of them (001, 002, 003).
I tried this :
(str_split(ids, '-', n=2))[2]
returns the following :
[[1]]
[1] "b" "001"
I don't want the second element of the list. I want the second element of all elements in the vector. I know this is definitely a basic question, but how do I resolve the syntax conflict? Going through lambda function ?

The function is also available in base R.
sapply(strsplit(ids, "-"), `[`, 2)
# [1] "000" "001" "002"
You can also try gsub and substring.
gsub("\\D+", "", ids)
# [1] "000" "001" "002"
substring(ids, 3)
# [1] "000" "001" "002"

To continue with your attempt, you can use sapply :
sapply(stringr::str_split(ids, '-', n=2), `[`, 2)
#[1] "000" "001" "002"
It is better to use str_split_fixed though here.
stringr::str_split_fixed(ids, '-', n=2)[, 2]
#[1] "000" "001" "002"
Or in base R :
sub('.*?-(.*)-?.*', '\\1', ids)

You could try str_remove(ids, "\\D+")

With base R you can remove all the characters that are not digits:
ids <- c('a-000', 'b-001', 'c-002')
gsub("[^[:digit:]]", "", ids)
#> [1] "000" "001" "002"
[:digit:] is regex for digit and ^ means everything that is not a digit, so you basically replace every other characters with empty string "".
For more information see documentation for gsub() and regex in R.

An option with str_replace
library(stringr)
str_replace(ids, "\\D+", "")
#[1] "000" "001" "002"

Related

readr::parse_number with leading zero

I would like to parse numbers that have a leading zero.
I tried readr::parse_number, however, it omits the leading zero.
library(readr)
parse_number("thankyouverymuch02")
#> [1] 2
Created on 2022-12-30 with reprex v2.0.2
The desired output would be 02
The simplest and most naive would be:
gsub("\\D", "", "thankyouverymuch02")
[1] "02"
The regex special "\\d" matches a single 0-9 character only; the inverse is "\\D" which matches a single character that is anything except 0-9.
If you have strings with multiple patches of numbers and you want them to be distinct, neither parse_number nor this simple gsub is going to work.
gsub("\\D", "", vec)
# [1] "02" "0302"
For that, it must always return a list (since we don't necessarily know a priori how may elements have 0, 1 or more number-groups).
vec <- c("thankyouverymuch02", "thank03youverymuch02")
regmatches(vec, gregexpr("\\d+", vec))
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"
#### equivalently
stringr::str_extract_all(vec, "\\d+")
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"

How to extract words with exactly one vowel

I have strings like these:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
What I'm trying to do is extract those words that have exactly one vowel. I do get the correct result with this:
library(stringr)
str_extract_all(turns, "\\b[b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\\b")
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "i" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"
However, it feels cumbersome to define a consonant class. Is there a more elegant and more concise way?
We can use str_count on the words after splitting the 'turns' at the spaces
library(stringr)
lapply(strsplit(turns, "\\s+"), function(x) x[str_count(x, '[aeiou]') == 1])
-output
#[[1]]
#[1] "him" "to" "stir" "him" "up" "now" "and"
#[[2]]
#[1] "when" "when" "him" "he" "on" "the"
#[[3]]
#[1] "yes" "it" "for" "a" "long"
#[[4]]
#[1] "it" "was"
You can use a PCRE regex with character classes containing double negation:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
rx <- "\\b[^[:^alpha:]aeiou]*[aeiou][^[:^alpha:]aeiou]*\\b"
regmatches(turns, gregexpr(rx, turns, perl=TRUE, ignore.case=TRUE))
See the R demo online. The result is as in the question.
See the regex demo. Details:
\b - word boundary
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
[aeiou] - a vowel
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
\b - word boundary.
An equivalent expression:
(?i)\b[^\P{L}aeiou]*[aeiou][^\P{L}aeiou]*\b
See this regex demo. \P{L} matches any char but a letter. (?i) is equivalent of ignore.case=TRUE.
Here is a base R option using strsplit + nchar + gsub
lapply(
strsplit(turns, "\\s"),
function(v) v[nchar(gsub("[^aeiou]", "", v)) == 1]
)
which gives
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"

Regular expressions, extract specific parts of pattern

I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.
My expressions look like the following two:
crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt
Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):
cr*_*_g_*_*_*_f*_
Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".
Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?
You can use sub with captures and then strsplit to get a list of the separated elements:
str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b" "gdp" "100000" "16" "16" "tv"
#[[2]]
#[1] "t" "r" "25000" "20" "40" "lin"
Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.
We can also use regmatches and regexec to extract these values like this:
regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"
[3] "gdp" "100000"
[5] "16" "16"
[7] "tv"
[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t" "r"
[4] "25000" "20" "40"
[7] "lin"
Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["
lapply(regmatches(str,
regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
"[", -1)
[[1]]
[1] "b" "gdp" "100000" "16" "16" "tv"
[[2]]
[1] "t" "r" "25000" "20" "40" "lin"

Regexes works on their own, but not when used together in strsplit

I'm trying to split a string in R using strsplit and a perl regex. The string consists of various alphanumeric tokens separated by periods or hyphens, e.g "WXYZ-AB-A4K7-01A-13B-J29Q-10". I want to split the string:
wherever a hyphen appears.
wherever a period appears.
between the second and third character of a token that is exactly 3 characters long and consists of 2 digits followed by 1 capital letter, e.g "01A" produces ["01", "A"] (but "012A", "B1A", "0A1", and "01A2" are not split).
For example, "WXYZ-AB-A4K7-01A-13B-J29Q-10" should produce ["WXYZ", "AB", "01", "A", "13", "B", "J29Q", "10"].
My current regex is ((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-] and it works perfectly in this online regex tester.
Furthermore, the two parts of the alternative, ((?<=[-.]\\d{2})(?=[A-Z][-.])) and [.-], both serve to split the string as intended in R, when they are used separately:
#correctly splits on periods and hyphens
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
#correctly splits tokens where a letter follows two digits
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))", perl=T)
[[1]]
[1] "WXYZ-AB-A4K7-01" "A-13" "B-J29Q-10"
But when I try and combine them using an alternative, the second regex stops working, and the string is only split on periods and hyphens:
#only second alternative is used
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
Why is this happening? Is it a problem with my regex, or with strsplit? How can I achieve the desired behavior?
Desired output:
## [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
An alternative that prevents you from having to consider how the strsplit algorithm works, is to use your original regex with gsub to insert a simple splitting character in all the right places, then do use strsplit to do the straightforward splitting.
strsplit(
gsub("((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", "-", x, perl = TRUE),
"-",
fixed = TRUE)
#[[1]]
#[1] "XYZ" "02" "01" "C" "33" "D" "2285"
Of course, RichScriven's answer and Wiktor Stribiżew's comment are probably better since they only have one function call.
You may use a consuming version of a positive lookahead (a match reset operator \K) to make sure strsplit works correctly in R and avoid the problem of using a negative lookbehind inside a positive one.
"(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]"
See the R demo online (and a regex demo here).
strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "XYZ" "02" "01" "C" "33" "D" "2285"
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
Here, the pattern matches:
(?<![^.-])\d{2}\K(?=[A-Z](?:[.-]|$)) - a sequence of:
(?<![^.-])\d{2} - 2 digits (\d{2}) that are not preceded with a char other than . and - (i.e. that are preceded with . or - or start of string, it is a common trick to avoid alternation inside a lookaround)
\K - the match reset operator that makes the regex engine discard the text matched so far and go on matching the subsequent subpatterns if any
| - or
[.-] - matches . or -.
Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:
#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^
#match + left removed
"AB-A4K7-01A-13B-J29Q-10"
#further matches found and removed
"01A-13B-J29Q-10"
#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"
#algorithm continues
"13B-J29Q-10"
This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:
> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"

How should I split and retain elements using strsplit?

What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.
>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" "" "" "def"
But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.
>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def"
Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?
Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:
test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )
[[1]]
[1] "abc" "123" "def"
You could use lookaround assertions.
> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"
You can use strapply from gsubfn package.
test <- "abc123def"
strapply(X=test,
pattern="([^[:digit:]]*)(\\d+)(.+)",
FUN=c,
simplify=FALSE)
[[1]]
[1] "abc" "123" "def"

Resources