Parsing a factor string in R - r

I have a string ,
x = "[1,2,3]"
How can I get the elements 1 and 2 from the string?
I tried the strsplit but that seems a bit tricky. Then I tried splitting on "[", and that also did not seem easy.

You could use regex lookaround to extract the numbers
library(stringr)
str_extract_all(x, '(?<=\\[|,)\\d+(?=,)')[[1]]
#[1] "1" "2"

A base option, here we just remove the brackets and split by ,, though do note #MrFlick's comment.
strsplit(gsub("\\[|\\]", "", x), ",")[[1L]][1:2]
# [1] "1" "2"

Related

Finding last digits in text using stringr [duplicate]

This question already has an answer here:
stringr extract full number from string
(1 answer)
Closed 3 years ago.
Trying to use StringR to find all the digits which occur at the end of the text.
For example
x <- c("Africa-123-Ghana-2", "Oceania-123-Sydney-200")
and StringR operation should return
"2 200"
I believe there might be multiple methods, but what would be the best code for this?
Thanks.
You could use
sub(".*-(\\d+)$", "\\1", x)
#[1] "2" "200"
Or
stringr::str_extract(x, "\\d+$")
Or
stringi::stri_extract_last_regex(x, "\\d+")
We can use regexpr/regmatches in base R to match one or more digits (\\d+) at the end ($) of the string
regmatches(x, regexpr("\\d+$", x))
#[1] "2" "200"
Or with sub, we match characters until the last character that is not a digit and replace with blank ("")
sub(".*\\D+", "", x)
#[1] "2" "200"
Or using strsplit
sapply(strsplit(x, "-"), tail, 1)
#[1] "2" "200"
Or using stringr with str_match
library(stringr)
str_match(x, "(\\d+)$")[,1]
#[1] "2" "200"
Or with str_remove
str_remove(x, ".*\\D+")
#[1] "2" "200"

Convert an integer to a string in R

I have an integer
a <- (0:3)
And I would like to convert it to a character string that looks like this
b <- "(0:3)"
I have tried
as.character(a)
[1] "0" "1" "2" "3"
and
toString(a)
[1] "0, 1, 2, 3"
But neither do exactly what I need to do.
Can anyone help me get from a to b?
Many thanks in advance!
paste0("(", min(a), ":", max(a), ")")
"(0:3)"
Or more concisely with sprintf():
sprintf("(%d:%d)", min(a), max(a))
One option is deparse and paste the brackets
as.character(glue::glue('({deparse(a)})'))
#[1] "(0:3)"
Another option would be to store as a quosure and then convert it to character
library(rlang)
a <- quo((0:3))
quo_name(a)
#[1] "(0:3)"
it can be evaluated with eval_tidy
eval_tidy(a)
#[1] 0 1 2 3

Regexes works on their own, but not when used together in strsplit

I'm trying to split a string in R using strsplit and a perl regex. The string consists of various alphanumeric tokens separated by periods or hyphens, e.g "WXYZ-AB-A4K7-01A-13B-J29Q-10". I want to split the string:
wherever a hyphen appears.
wherever a period appears.
between the second and third character of a token that is exactly 3 characters long and consists of 2 digits followed by 1 capital letter, e.g "01A" produces ["01", "A"] (but "012A", "B1A", "0A1", and "01A2" are not split).
For example, "WXYZ-AB-A4K7-01A-13B-J29Q-10" should produce ["WXYZ", "AB", "01", "A", "13", "B", "J29Q", "10"].
My current regex is ((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-] and it works perfectly in this online regex tester.
Furthermore, the two parts of the alternative, ((?<=[-.]\\d{2})(?=[A-Z][-.])) and [.-], both serve to split the string as intended in R, when they are used separately:
#correctly splits on periods and hyphens
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
#correctly splits tokens where a letter follows two digits
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))", perl=T)
[[1]]
[1] "WXYZ-AB-A4K7-01" "A-13" "B-J29Q-10"
But when I try and combine them using an alternative, the second regex stops working, and the string is only split on periods and hyphens:
#only second alternative is used
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
Why is this happening? Is it a problem with my regex, or with strsplit? How can I achieve the desired behavior?
Desired output:
## [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
An alternative that prevents you from having to consider how the strsplit algorithm works, is to use your original regex with gsub to insert a simple splitting character in all the right places, then do use strsplit to do the straightforward splitting.
strsplit(
gsub("((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", "-", x, perl = TRUE),
"-",
fixed = TRUE)
#[[1]]
#[1] "XYZ" "02" "01" "C" "33" "D" "2285"
Of course, RichScriven's answer and Wiktor Stribiżew's comment are probably better since they only have one function call.
You may use a consuming version of a positive lookahead (a match reset operator \K) to make sure strsplit works correctly in R and avoid the problem of using a negative lookbehind inside a positive one.
"(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]"
See the R demo online (and a regex demo here).
strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "XYZ" "02" "01" "C" "33" "D" "2285"
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
Here, the pattern matches:
(?<![^.-])\d{2}\K(?=[A-Z](?:[.-]|$)) - a sequence of:
(?<![^.-])\d{2} - 2 digits (\d{2}) that are not preceded with a char other than . and - (i.e. that are preceded with . or - or start of string, it is a common trick to avoid alternation inside a lookaround)
\K - the match reset operator that makes the regex engine discard the text matched so far and go on matching the subsequent subpatterns if any
| - or
[.-] - matches . or -.
Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:
#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^
#match + left removed
"AB-A4K7-01A-13B-J29Q-10"
#further matches found and removed
"01A-13B-J29Q-10"
#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"
#algorithm continues
"13B-J29Q-10"
This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:
> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"

Delete pattern in string and semicolon before and/or after (R)

In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.
Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""
We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""

extract text from alphanumeric vector in R

i have a data like below and need to extract text comes before any number. or if we can separate the text and number then it would be great
df<-c("axz123","bww2","c334")
output
"axz", "bww", "c"
or
"axz","bww","c"
"123","2","334"
We can do:
df <- c("axz123","bww2","c334")
gsub("\\d+", "", df)
#[1] "axz" "bww" "c"
gsub("(\\D+)", "", df)
#[1] "123" "2" "334"
For your other example:
df <- "BAILEYS IRISH CREAM 1.75 LITERS REGULAR_NOT FLAVORED"
gsub("\\d.*", "", df)
#[1] "BAILEYS IRISH CREAM "
gsub("[A-Z_ ]*", "", df)
#[1] "1.75"
We can use [:alpha:] to match the alphabetic characters, and combine this with gsub() and a negation to remove all characters that are not alphabetic:
gsub("[^[:alpha:]]", "", df)
#[1] "axz" "bww" "c"
To obtain only the non-alphabetic characters we can drop the negation ^:
gsub("[[:alpha:]]", "", df)
#[1] "123" "2" "334"
Using str_extract and regex lookarounds. We match one or more characters before any number ((?=\\d)) and extract it.
library(stringr)
str_extract(df, "[[:alpha:]]+(?=\\d)")
#[1] "axz" "bww" "c"
If we need to separate the numeric and non-numeric, strsplit can be used
lst <- strsplit(df, "(?<=[^0-9])(?=[0-9])", perl=TRUE)

Resources