Extract string between spaces - r

I have this data frame:
df <-c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
[1] "AA AAAA 1B" "A BBB 1" "CC RR 1W3" "SS RGTYC 0"
and I want to extract what is between spaces.
Desired result:
[1] "AAAA" "BBB" "RR" "RGTYC"

df <- c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
lst <- strsplit(df," ")
sapply(lst, '[[', 2)
# [1] "AAAA" "BBB" "RR" "RGTYC"

Instead of splitting it first and then selecting the relevant split, you can also extract it straight away using the stringr-package:
library(stringr)
str_extract(df, "(?<=\\s)(.*)(?=\\s)")
# [1] "AAAA" "BBB" "RR" "RGTYC"
This solution uses regular expressions, and this pattern is built up like this:
(?<=\\s) checks whether there is whitespace before
(?=\\s) checks whether there is a whitespace after
(.*) extracts everything in between the white spaces

Here is a gsub based approach (from base R). We match one more non-white spaces from the start (^) of the string followed by one or more spaces or (|) one or more white spaces followed by non-white spaces at the end of the string ($) and replace it with blank ("")
gsub("^\\S+\\s+|\\s+\\S+$", "", df)
#[1] "AAAA" "BBB" "RR" "RGTYC"
There is also a convenient function word from stringr
stringr::word(df, 2)
#[1] "AAAA" "BBB" "RR" "RGTYC"

Related

Shortest way to remove duplicate words from string

I have this string:
x <- c("A B B C")
[1] "A B B C"
I am looking for the shortest way to get this:
[1] "A B C"
I have tried this:
Removing duplicate words in a string in R
paste(unique(x), collapse = ' ')
[1] "A B B C"
# does not work
Background:
In a dataframe column I want to count only the unique word counts.
A regex based approach could be shorter - match the non-white space (\\S+) followed by a white space character (\\s), capture it, followed by one or more occurrence of the backreference, and in the replacement, specify the backreference to return only a single copy of the match
gsub("(\\S+\\s)\\1+", "\\1", x)
[1] "A B C"
Or may need to split the string with strsplit, unlist, get the unique and then paste
paste(unique(unlist(strsplit(x, " "))), collapse = " ")
# [1] "A B C"
Another possible solution, based on stringr::str_split:
library(tidyverse)
str_split(x, " ") %>% unlist %>% unique
#> [1] "A" "B" "C"
Just in case the duplicates are not following each other, also using gsub.
x <- c("A B B C")
gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", x, perl=TRUE)
#[1] "A B C"
gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", "A B B A ABBA", perl=TRUE)
#[1] "B A ABBA"
You can use ,
gsub("\\b(\\w+)(?:\\W+\\1\\b)+", "\\1", x)

How to extract unique letters among word of consecutive letters?

For example, there is character x = "AAATTTGGAA".
What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".
Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.
How should I get this?
Here is a useful regex trick approach:
x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out
[1] "AAA" "TTT" "GG" "AA"
The regex pattern used here says to split at any boundary where the preceding and following characters are different.
(?<=(.)) lookbehind and also capture preceding character in \1
(?!\\1) then lookahead and assert that following character is different
You can split each character in the string. Use rle to find consecutive runs and select only the unique ones.
x <- "AAATTTGGAA"
vec <- unlist(strsplit(x, ''))
rle(vec)$values
#[1] "A" "T" "G" "A"
paste0(rle(vec)$values, collapse = '')
#[1] "ATGA"
We can use regmatch with pattern (.)\\1+ like below
> regmatches(x,gregexpr("(.)\\1+",x))[[1]]
[1] "AAA" "TTT" "GG" "AA"
or if you need the unique letters only
> gsub("(.)\\1+", "\\1", x)
[1] "ATGA"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

R substring based on Regular Expression

I have a strings like :
myString = "2 word1 & 4 word2"
myString = "4 word2"
myString = "2 word1"
I would like to get the number before the word1 and the number before word2
number1 = 2
number2 = 4
How can i do with a regular expression in R
I tried something like this but it only get the first number
gsub("([0-9]+).*", "\\1", myString)
You may extract specific number before a specific string using a regex with a lookahead:
> word1_res <- str_extract_all(myString, "\\d+(?=\\s*word1)")
> word1_res
[[1]]
[1] "2"
[[2]]
character(0)
[[3]]
[1] "2"
The results for word2 can be retrieved similarly:
word2_res <- str_extract_all(myString, "\\d+(?=\\s*word2)")
Details
\d+ - 1 or more digits...
(?=\\s*word2) - if immediately followed with:
\s* - 0+ whitespaces
word2 - a literal word2 substring.
A base R equivalent is
regmatches(myString, gregexpr("\\d+(?=\\s*word1)", myString, perl=TRUE))
regmatches(myString, gregexpr("\\d+(?=\\s*word2)", myString, perl=TRUE))
A sub almost equivalent solution would be
> sub(".*?(\\d+)\\s*word1.*|.*","\\1",myString)
[1] "2" "" "2"
> sub(".*?(\\d+)\\s*word2.*|.*","\\1",myString)
[1] "4" "4" ""
Note that this implies there is only one result per string, while str_extract_all will get all occurrences from the string.
To extract any chunk of 1+ digits as a whole word using a stringr solution with str_extract_all
library(stringr)
str_extract_all(myString, "\\b\\d+\\b")
or a base R one with regmatches/gregexpr:
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
regmatches(myString, gregexpr("\\b\\d+\\b", myString))
See an online R demo. Output:
[[1]]
[1] "2" "4"
[[2]]
[1] "4"
[[3]]
[1] "2"
Details
\b - a word boundary
\d+ - 1 or more digits
\b - a word boundary.
try
myString = "2 word1 & 4 word2"
number1 = gsub("([0-9]+).*", "\\1", myString)
myString = "4 word2"
number2 = gsub("([0-9]+).*", "\\1", myString)
myString = "2 word1"
number3 = gsub("([0-9]+).*", "\\1", myString)
print(number1)
print(number2)
print(number3)
If you assign 3 times a string to myString, myString will only contain the last one.
This removes each occurrence of a letter or ampersand possibly followed by other non-space characters and then scans in what is left. The scan also converts them to numeric. No packages are used.
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
lapply(myString, function(x) scan(text = gsub("[[:alpha:]&]\\S*", "", x), quiet = TRUE))
giving:
[[1]]
[1] 2 4
[[2]]
[1] 4
[[3]]
[1] 2

extract text from alphanumeric vector in R

i have a data like below and need to extract text comes before any number. or if we can separate the text and number then it would be great
df<-c("axz123","bww2","c334")
output
"axz", "bww", "c"
or
"axz","bww","c"
"123","2","334"
We can do:
df <- c("axz123","bww2","c334")
gsub("\\d+", "", df)
#[1] "axz" "bww" "c"
gsub("(\\D+)", "", df)
#[1] "123" "2" "334"
For your other example:
df <- "BAILEYS IRISH CREAM 1.75 LITERS REGULAR_NOT FLAVORED"
gsub("\\d.*", "", df)
#[1] "BAILEYS IRISH CREAM "
gsub("[A-Z_ ]*", "", df)
#[1] "1.75"
We can use [:alpha:] to match the alphabetic characters, and combine this with gsub() and a negation to remove all characters that are not alphabetic:
gsub("[^[:alpha:]]", "", df)
#[1] "axz" "bww" "c"
To obtain only the non-alphabetic characters we can drop the negation ^:
gsub("[[:alpha:]]", "", df)
#[1] "123" "2" "334"
Using str_extract and regex lookarounds. We match one or more characters before any number ((?=\\d)) and extract it.
library(stringr)
str_extract(df, "[[:alpha:]]+(?=\\d)")
#[1] "axz" "bww" "c"
If we need to separate the numeric and non-numeric, strsplit can be used
lst <- strsplit(df, "(?<=[^0-9])(?=[0-9])", perl=TRUE)

Resources