Export multiple matching pattern - r

I am trying to extract AOB1 or AOB2 or AOB3 from the string below.
df <- data.frame(
id = c(1,2,3),
string = c("acv-32-AOB1", "osa-122-AOB2","cds-543-rr-AOB3")
)
> df
id string
1 1 acv-32-AOB1
2 2 osa-122-AOB2
3 3 cds-543-rr-AOB3
Any ideas?
Thanks!

We can use trimws from base R
trimws(df$string, whitespace =".*-")
[1] "AOB1" "AOB2" "AOB3"
Or use sub from base R
sub(".*-", "", df$string)
[1] "AOB1" "AOB2" "AOB3"
Or if we need to do extract the 'AOB' followed by digits
library(stringr)
str_extract(df$string, "AOB\\d+")
[1] "AOB1" "AOB2" "AOB3"

You can use regular expressions for this:
.* Match anything
(AOB[1-3]) then match AOB followed by a 1, 2 or 3
\\1 replace the entire string with the matched AOB1-3 slot
gsub(".*(AOB[1-3])", "\\1", df$string)

Here is one more dear friends:
m <- gregexpr("[a-zA-Z]{3}\\d{1}", df$string)
unlist(regmatches(df$string, m))
> unlist(regmatches(df$string, m))
[1] "AOB1" "AOB2" "AOB3"

I made some modification so that your desired pattern could be everywhere and you could use the following solution to extract it:
df$res <- gsub("(.*)?([A-Z]{3}\\d)(.*)?", "\\2", df$string, perl = TRUE)
df
id string res
1 1 acv-32-AOB1 AOB1
2 2 osa-122-AOB2 AOB2
3 3 cds-543-rr-AOB3 AOB3

Related

Str_split is returning only half of the string

I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你

regex commas not between two numbers

I am looking for a regex for gsub to remove all the unwanted commas:
Data:
,,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354
Desired result:
12345
12345,1345,1354
123
12345
12354
This is the progress I have made so far:
(,(?!\d+))
You seem to want to remove all leading and trailing commas.
You may do it with
gsub("^,+|,+$", "", x)
See the regex demo
The regex contans two alternations, ^,+ matches 1 or more commas at the start and ,+$ matches 1+ commas at the end, and gsub replaces these matches with empty strings.
See R demo
x <- c(",,,,,,,12345","12345,1345,1354","123,,,,,,","12345,",",12354")
gsub("^,+|,+$", "", x)
## [1] "12345" "12345,1345,1354" "123" "12345"
## [5] "12354"
You can also use str_extract from stringr. Thanks to greedy matching, you don't have to specify how many times a digit occurs, the longest match is automatically chosen:
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_extract(V1, "\\d.+\\d"))
or if you prefer base R:
df$V1 = regmatches(df$V1, gregexpr("\\d.+\\d", df$V1))
Result:
V1
1 12345
2 12345,1345,1354
3 123
4 12345
5 12354
Data:
df = read.table(text = ",,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354")

stringr: extract words containing a specific word

Consider this simple example
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
Here I want to extract the words containing WIFF, that is I want to end up with a dataframe like this
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
I tried to use
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
but this only retuns NAs. Any ideas?
Thanks!
A classic, non-regex approach via base R would be,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"
You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:
(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.

Count number of occurrences when string contains substring

I have string like
'abbb'
I need to understand how many times I can find substring 'bb'.
grep('bb','abbb')
returns 1. Therefore, the answer is 2 (a-bb and ab-bb). How can I count number of occurrences the way I need?
You can make the pattern non-consuming with '(?=bb)', as in:
length(gregexpr('(?=bb)', x, perl=TRUE)[[1]])
[1] 2
Here is an ugly approach using substr and sapply:
input <- "abbb"
search <- "bb"
res <- sum(sapply(1:(nchar(input)-nchar(search)+1),function(i){
substr(input,i,i+(nchar(search)-1))==search
}))
We can use stri_count
library(stringi)
stri_count_regex(input, '(?=bb)')
#[1] 2
stri_count_regex(x, '(?=bb)')
#[1] 0 1 0
data
input <- "abbb"
x <- c('aa','bb','ba')

Replace specific characters within strings

I would like to remove specific characters from strings within a vector, similar to the Find and Replace feature in Excel.
Here are the data I start with:
group <- data.frame(c("12357e", "12575e", "197e18", "e18947")
I start with just the first column; I want to produce the second column by removing the e's:
group group.no.e
12357e 12357
12575e 12575
197e18 19718
e18947 18947
With a regular expression and the function gsub():
group <- c("12357e", "12575e", "197e18", "e18947")
group
[1] "12357e" "12575e" "197e18" "e18947"
gsub("e", "", group)
[1] "12357" "12575" "19718" "18947"
What gsub does here is to replace each occurrence of "e" with an empty string "".
See ?regexp or gsub for more help.
Regular expressions are your friends:
R> ## also adds missing ')' and sets column name
R> group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947")) )
R> group
group
1 12357e
2 12575e
3 197e18
4 e18947
Now use gsub() with the simplest possible replacement pattern: empty string:
R> group$groupNoE <- gsub("e", "", group$group)
R> group
group groupNoE
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947
R>
Summarizing 2 ways to replace strings:
group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947"))
1) Use gsub
group$group.no.e <- gsub("e", "", group$group)
2) Use the stringr package
group$group.no.e <- str_replace_all(group$group, "e", "")
Both will produce the desire output:
group group.no.e
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947
You do not need to create data frame from vector of strings, if you want to replace some characters in it. Regular expressions is good choice for it as it has been already mentioned by #Andrie and #Dirk Eddelbuettel.
Pay attention, if you want to replace special characters, like dots, you should employ full regular expression syntax, as shown in example below:
ctr_names <- c("Czech.Republic","New.Zealand","Great.Britain")
gsub("[.]", " ", ctr_names)
this will produce
[1] "Czech Republic" "New Zealand" "Great Britain"
Use the stringi package:
require(stringi)
group<-data.frame(c("12357e", "12575e", "197e18", "e18947"))
stri_replace_all(group[,1], "", fixed="e")
[1] "12357" "12575" "19718" "18947"
> library(stringi)
> group <- c('12357e', '12575e', '12575e', ' 197e18', 'e18947')
> pattern <- "e"
> replacement <- ""
> group <- str_replace(group, pattern, replacement)
> group
[1] "12357" "12575" "12575" " 19718" "18947"
You can use chartr as well:
group$group.no.e <- chartr("e", "", group$group)

Resources