stringr: extract words containing a specific word - r

Consider this simple example
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
Here I want to extract the words containing WIFF, that is I want to end up with a dataframe like this
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
I tried to use
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
but this only retuns NAs. Any ideas?
Thanks!

A classic, non-regex approach via base R would be,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"

You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:
(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.

Related

Remove one number at position n of the number in a string of numbers separated by slashes

I have a character column with this configuration:
data <- data.frame(
id = 1:3,
codes = c("08001301001", "08002401002 / 08002601003 / 17134604034", "08004701005 / 08005101001"))
I want to remove the 6th digit of any number within the string. The numbers are always 10 characters long.
My code works. However I believe it might be done easier using RegEx, but I couldn't figure it out.
library(stringr)
remove_6_digit <- function(x){
idxs <- str_locate_all(x,"/")[[1]][,1]
for (idx in c(rev(idxs+7), 6)){
str_sub(x, idx, idx) <- ""
}
return(x)
}
result <- sapply(data$codes, remove_6_digit, USE.NAMES = F)
You can use
gsub("\\b(\\d{5})\\d", "\\1", data$codes)
See the regex demo. This will remove the 6th digit from the start of a digit sequence.
Details:
\b - word boundary
(\d{5}) - Capturing group 1 (\1): five digits
\d - a digit.
While word boundary looks enough for the current scenario, a digit boundary is also an option in case the numbers are glued to word chars:
gsub("(?<!\\d)(\\d{5})\\d", "\\1", data$codes, perl=TRUE)
where perl=TRUE enables the PCRE regex syntax and (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location.
And if you must only change numeric char sequences of 10 digits (no shorter and no longer) you can use
gsub("\\b(\\d{5})\\d(\\d{4})\\b", "\\1\\2", data$codes)
gsub("(?<!\\d)(\\d{5})\\d(?=\\d{4}(?!\\d))", "\\1", data$codes, perl=TRUE)
One remark though: your numbers consist of 11 digits, so you need to replace \\d{4} with \\d{5}, see this regex demo.
Another possible solution, using stringr::str_replace_all and lookaround :
library(tidyverse)
data %>%
mutate(codes = str_replace_all(codes, "(?<=\\d{5})\\d(?=\\d{5})", ""))
#> id codes
#> 1 1 0800101001
#> 2 2 0800201002 / 0800201003 / 1713404034
#> 3 3 0800401005 / 0800501001

Str_split is returning only half of the string

I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

Separating multiple value numbers (with characters) and text

I have a file in Excel that has, as an example, text such as this "4.56/505AB" in a cell. The numbers all vary, as does the length of text, so the text can be single or multiple characters, and the numbers can contain characters such as a decimal point or slash mark.
The ideal, separated format for this example would be: column 1 = 4.56/505, column 2 = AB.
What I've tried:
"Split_Text" in Excel, which removed the special characters from the number, and resulted in the following output: column 1 = 456505, column 2 = ./AB
R with the "G_sub" command, which resulted in: [1] " 4 . 56 / 505 AB"
Is there a way to take these methods further, or will this be a manual fix? Thank you!
Assuming the first uppercase letter is the beginning of the second column
df <- data.frame(c1 = c("4.56/505AB", "1.23/202CD"))
library(stringr)
df$c2 <- str_extract(df$c1, "[^[A-Z]]+")
df$c3 <- str_extract(df$c1, "[A-Z]+")
df
# c1 c2 c3
# 1 4.56/505AB 4.56/505 AB
# 2 1.23/202CD 1.23/202 CD
1) sub/read.table Match the leading characters and the trailing characters within the two capture groups and separate them with a semicolon. Then read that in using read.table. No packages are used.
x <- "4.56/505AB"
pat <- "^([0-9.,/]+)(.*)"
read.table(text = sub(pat, "\\1;\\2", x), sep = ";", as.is = TRUE)
## V1 V2
## 1 4.56/505 AB
The result has character columns but if you prefer factor then omit
the as.is = TRUE. Also we have assumed there are no semicolons in the input but if there are then use some other character that does not appear in the input in place of the semicolon in the two places where semicolon appears.
1a) If we can assume that the second column always starts with a letter then we could just replace the first letter encountered by semicolon followed by that letter and then read it in using read.table. This has the advantage of using a slghtly simpler pattern.
read.table(text = sub("([[:alpha:]])", ";\\1", x), sep = ";", as.is = TRUE)
2) read.pattern Using the same input x and pattern pat it is even shorter using read.pattern in the gsubfn package:
library(gsubfn)
read.pattern(text = x, pattern = pat, as.is = TRUE)
## V1 V2
## 1 4.56/505 AB
Update: revised.

add day to date using regex

Could someone show me how to add a day to a date using a regex?
Here is my starting code:
#Create data frame
a = c("01/2009","03/2006","","12/2003")
b = c("03/2016","05/2010","07/2011","")
df = data.frame(a,b)
Here's what I like to create:
#Create data frame
a = c("01/01/2009","03/01/006","","12/01/2003")
b = c("03/01/2016","05/01/2010","07/01/2011","")
df = data.frame(a,b)
I tried something like this:
df$c <- gsub("(/.*)","\\01/\\1", df$a, perl=TRUE)
But am obviously not getting the results I'm looking for. Am new to regex's and am looking for some help. Thank you.
You needn't use a regex if all you've got are values like dd/yyyy or empty ones. Just use a literal string replacement:
gsub("/","/01/", df$a, fixed=TRUE)
that just replaces all / symbols with /01/ substring.
If you have to make sure you only change strings falling under 2-digits/4-digits pattern, use
gsub("^(\\d{2})/(\\d{4})$", "\\1/01/\\2", df$a)
where the pattern matches:
^ - start of string
(\\d{2}) - capturing group #1 matching 2 digits
/ - a literal /
(\\d{4}) - capturing group #2 matching 4 digits
$ - end of string.
The replacement pattern contains \\1, a backreference to Group 1 captured value, /01/ as a literal substring and the \\2 backreference (i.e. the value captured into Group 2).
R demo:
> a = c("01/2009","03/2006","","12/2003")
> b = c("03/2016","05/2010","07/2011","")
> df = data.frame(a,b)
> gsub("/","/01/", df$a, fixed=TRUE)
[1] "01/01/2009" "03/01/2006" "" "12/01/2003"
> gsub("^(\\d{2})/(\\d{4})$", "\\1/01/\\2", df$a)
[1] "01/01/2009" "03/01/2006" "" "12/01/2003"

Resources