extract text from alphanumeric vector in R - r

i have a data like below and need to extract text comes before any number. or if we can separate the text and number then it would be great
df<-c("axz123","bww2","c334")
output
"axz", "bww", "c"
or
"axz","bww","c"
"123","2","334"

We can do:
df <- c("axz123","bww2","c334")
gsub("\\d+", "", df)
#[1] "axz" "bww" "c"
gsub("(\\D+)", "", df)
#[1] "123" "2" "334"
For your other example:
df <- "BAILEYS IRISH CREAM 1.75 LITERS REGULAR_NOT FLAVORED"
gsub("\\d.*", "", df)
#[1] "BAILEYS IRISH CREAM "
gsub("[A-Z_ ]*", "", df)
#[1] "1.75"

We can use [:alpha:] to match the alphabetic characters, and combine this with gsub() and a negation to remove all characters that are not alphabetic:
gsub("[^[:alpha:]]", "", df)
#[1] "axz" "bww" "c"
To obtain only the non-alphabetic characters we can drop the negation ^:
gsub("[[:alpha:]]", "", df)
#[1] "123" "2" "334"

Using str_extract and regex lookarounds. We match one or more characters before any number ((?=\\d)) and extract it.
library(stringr)
str_extract(df, "[[:alpha:]]+(?=\\d)")
#[1] "axz" "bww" "c"
If we need to separate the numeric and non-numeric, strsplit can be used
lst <- strsplit(df, "(?<=[^0-9])(?=[0-9])", perl=TRUE)

Related

How to extract unique letters among word of consecutive letters?

For example, there is character x = "AAATTTGGAA".
What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".
Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.
How should I get this?
Here is a useful regex trick approach:
x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out
[1] "AAA" "TTT" "GG" "AA"
The regex pattern used here says to split at any boundary where the preceding and following characters are different.
(?<=(.)) lookbehind and also capture preceding character in \1
(?!\\1) then lookahead and assert that following character is different
You can split each character in the string. Use rle to find consecutive runs and select only the unique ones.
x <- "AAATTTGGAA"
vec <- unlist(strsplit(x, ''))
rle(vec)$values
#[1] "A" "T" "G" "A"
paste0(rle(vec)$values, collapse = '')
#[1] "ATGA"
We can use regmatch with pattern (.)\\1+ like below
> regmatches(x,gregexpr("(.)\\1+",x))[[1]]
[1] "AAA" "TTT" "GG" "AA"
or if you need the unique letters only
> gsub("(.)\\1+", "\\1", x)
[1] "ATGA"

Finding last digits in text using stringr [duplicate]

This question already has an answer here:
stringr extract full number from string
(1 answer)
Closed 3 years ago.
Trying to use StringR to find all the digits which occur at the end of the text.
For example
x <- c("Africa-123-Ghana-2", "Oceania-123-Sydney-200")
and StringR operation should return
"2 200"
I believe there might be multiple methods, but what would be the best code for this?
Thanks.
You could use
sub(".*-(\\d+)$", "\\1", x)
#[1] "2" "200"
Or
stringr::str_extract(x, "\\d+$")
Or
stringi::stri_extract_last_regex(x, "\\d+")
We can use regexpr/regmatches in base R to match one or more digits (\\d+) at the end ($) of the string
regmatches(x, regexpr("\\d+$", x))
#[1] "2" "200"
Or with sub, we match characters until the last character that is not a digit and replace with blank ("")
sub(".*\\D+", "", x)
#[1] "2" "200"
Or using strsplit
sapply(strsplit(x, "-"), tail, 1)
#[1] "2" "200"
Or using stringr with str_match
library(stringr)
str_match(x, "(\\d+)$")[,1]
#[1] "2" "200"
Or with str_remove
str_remove(x, ".*\\D+")
#[1] "2" "200"

Split character string by forward slash or nothing

I want to split this vector
c("CC", "C/C")
to
[[1]]
[1] "C" "C"
[[2]]
[1] "C" "C"
My final data should look like:
c("C_C", "C_C")
Thus, I need some regex, but don't found how to solve the "non-space" part:
strsplit(c("CC", "C/C"),"|/")
You can use sub (or gsub if it occurs more than once in your string) to directly replace either nothing or a forward slash with an underscore (capturing one character words around):
sub("(\\w)(|/)(\\w)", "\\1_\\3", c("CC", "C/C"))
#[1] "C_C" "C_C"
We can split the string at every character, omit the "/" and paste them together.
sapply(strsplit(x, ""), function(v) paste0(v[v!= "/"], collapse = "_"))
#[1] "C_C" "C_C"
data
x <- c("CC", "C/C")
We can use
lapply(strsplit(v1, "/|"), function(x) x[nzchar(x)])
Or use a regex lookaround
strsplit(v1, "(?<=[^/])(/|)", perl = TRUE)
#[[1]]
#[1] "C" "C"
#[[2]]
#[1] "C" "C"
If the final output should be a vector, then
gsub("(?<=[^/])(/|)(?=[^/])", "_", v1, perl = TRUE)
#[1] "C_C" "C_C"

Extract string between spaces

I have this data frame:
df <-c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
[1] "AA AAAA 1B" "A BBB 1" "CC RR 1W3" "SS RGTYC 0"
and I want to extract what is between spaces.
Desired result:
[1] "AAAA" "BBB" "RR" "RGTYC"
df <- c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
lst <- strsplit(df," ")
sapply(lst, '[[', 2)
# [1] "AAAA" "BBB" "RR" "RGTYC"
Instead of splitting it first and then selecting the relevant split, you can also extract it straight away using the stringr-package:
library(stringr)
str_extract(df, "(?<=\\s)(.*)(?=\\s)")
# [1] "AAAA" "BBB" "RR" "RGTYC"
This solution uses regular expressions, and this pattern is built up like this:
(?<=\\s) checks whether there is whitespace before
(?=\\s) checks whether there is a whitespace after
(.*) extracts everything in between the white spaces
Here is a gsub based approach (from base R). We match one more non-white spaces from the start (^) of the string followed by one or more spaces or (|) one or more white spaces followed by non-white spaces at the end of the string ($) and replace it with blank ("")
gsub("^\\S+\\s+|\\s+\\S+$", "", df)
#[1] "AAAA" "BBB" "RR" "RGTYC"
There is also a convenient function word from stringr
stringr::word(df, 2)
#[1] "AAAA" "BBB" "RR" "RGTYC"

Parsing a factor string in R

I have a string ,
x = "[1,2,3]"
How can I get the elements 1 and 2 from the string?
I tried the strsplit but that seems a bit tricky. Then I tried splitting on "[", and that also did not seem easy.
You could use regex lookaround to extract the numbers
library(stringr)
str_extract_all(x, '(?<=\\[|,)\\d+(?=,)')[[1]]
#[1] "1" "2"
A base option, here we just remove the brackets and split by ,, though do note #MrFlick's comment.
strsplit(gsub("\\[|\\]", "", x), ",")[[1L]][1:2]
# [1] "1" "2"

Resources