How to count number of particular pattern in a string in R? - r

I am trying to count the number of | in a string. This is my code but it is giving the incorrect answer of 32 instead of 2? Why is this happening and how do I get a function that returns 2? Thanks!
> levels
[1] "Completely|Partially|Not at all"
> str_count(levels, '|')
[1] 32
Also how do I separate the string by the | character? I would like the output to be a character vector of length 3: 'Completely', 'Partially', 'Not at all'.

The | is meaningful in regex as an "or"-like operator. Escape it with backslashes.
stringr::str_count("Completely|Partially|Not at all", "\\|")
# [1] 2
To show what | is normally used for, let's count the occurrences of el and al:
stringr::str_count("Completely|Partially|Not at all", "al")
# [1] 2
stringr::str_count("Completely|Partially|Not at all", "el")
# [1] 1
stringr::str_count("Completely|Partially|Not at all", "el|al")
# [1] 3
To look for the literal | symbol, it needs to be escaped.
To split the string by the | symbol, we can use strsplit (base R) or stringr::str_split:
strsplit("Completely|Partially|Not at all", "\\|")
# [[1]]
# [1] "Completely" "Partially" "Not at all"
It's returned as a list, because the argument may be a vector. For instance, it might be more clear if we do
vec <- c("Completely|Partially|Not at all", "something|else")
strsplit(vec, "\\|")
# [[1]]
# [1] "Completely" "Partially" "Not at all"
# [[2]]
# [1] "something" "else"

The pipe | character is a regex metacharacter and needs to be escaped:
levels <- "Completely|Partially|Not at all"
str_count(levels, '\\|')
Another general trick you can use here is to compare the length of the input against the same with all pipes stripped:
nchar(levels) - nchar(gsub("|", "", levels, fixed=TRUE))
[1] 2
Addendum: Use strsplit:
unlist(strsplit(levels, "\\|"))
[1] "Completely" "Partially" "Not at all"

Related

Using str_view with a list of words in R

I want to use str_view from stringr in R to find all the words that start with "y" and all the words that end with "x." I have a list of words generated by Corpora, but whenever I launch the code, it returns a blank view.
Common_words<-corpora("words/common")
#start with y
start_with_y <- str_view(Common_words, "^[y]", match = TRUE)
start_with_y
#finish with x
str_view(Common_words, "$[x]", match = TRUE)
Also, I would like to find the words that are only 3 letters long, but no
ideas so far.
I'd say this is not about programming with stringr but learning some regex. Here are some sites I have found useful for learning:
http://www.regular-expressions.info/tutorial.html
http://www.rexegg.com/
https://www.debuggex.com/
Here the \\w or short hand class for word characters (i.e., [A-Za-z0-9_]) is useful with quantifiers (+ and {3} in these 2 cases). PS here I use stringi because stringr is using that in the backend anyway. Just skipping the middle man.
x <- c("I like yax because the rock to the max!",
"I yonx & yix to pick up stix.")
library(stringi)
stri_extract_all_regex(x, 'y\\w+x')
stri_extract_all_regex(x, '\\b\\w{3}\\b')
## > stri_extract_all_regex(x, 'y\\w+x')
## [[1]]
## [1] "yax"
##
## [[2]]
## [1] "yonx" "yix"
## > stri_extract_all_regex(x, '\\b\\w{3}\\b')
## [[1]]
## [1] "yax" "the" "the" "max"
##
## [[2]]
## [1] "yix"
EDIT Seems like these may be of use too:
## Just y starting words
stri_extract_all_regex(x, 'y\\w+\\b')
## Just x ending words
stri_extract_all_regex(x, 'y\\w+x')
## Words with n or more characters
stri_extract_all_regex(x, '\\b\\w{4,}\\b')

R - look ahead regex [duplicate]

Common sense and a sanity-check using gregexpr() indicate that the look-behind and look-ahead assertions below should each match at exactly one location in testString:
testString <- "text XX text"
BB <- "(?<= XX )"
FF <- "(?= XX )"
as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5
strsplit(), however, uses those match locations differently, splitting testString at one location when using the lookbehind assertion, but at two locations -- the second of which seems incorrect -- when using the lookahead assertion.
strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"
strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text" " " "XX text"
I have two questions: (Q1) What's going on here? And (Q2) how can one get strsplit() to be better behaved?
Update: Theodore Lytras' excellent answer explains what's going on, and so addresses (Q1). My answer builds on his to identify a remedy, addressing (Q2).
I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
Note that this means that if there is a match at the beginning of
a (non-empty) string, the first element of the output is ‘""’, but
if there is a match at the end of the string, the output is the
same as with the match removed.
The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:
FF <- "(?=funky)"
testString <- "take me to funky town"
gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
What happens is that the lonely lookahead (?=funky) matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.
Now the remaining string is funky town, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strspliting with an empty regex (when argument split=""). After this the remaining string is unky town, which is returned as the last split since there's no match.
Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.
Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit algorithm is documented, I belive this does not meet the definition of a bug.
Based on Theodore Lytras' careful explication of substr()'s behavior, a reasonably clean workaround is to prefix the to-be-matched lookahead assertion with a positive lookbehind assertion that matches any single character:
testString <- "take me to funky town"
FF2 <- "(?<=.)(?=funky)"
strsplit(testString, FF2, perl=TRUE)
# [[1]]
# [1] "take me to " "funky town"
Looks like a bug to me. This doesn't appear to just be related to spaces, specifically, but rather any lonely lookahead (positive or negative):
FF <- "(?=funky)"
testString <- "take me to funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
FF <- "(?=funky)"
testString <- "funky take me to funky funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "f" "unky take me to " "f" "unky "
# [5] "f" "unky town"
FF <- "(?!y)"
testString <- "xxxyxxxxxxx"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "xxx" "y" "xxxxxxx"
Seems to work fine if given something to capture along with the zero-width assertion, such as:
FF <- " (?=XX )"
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
FF <- "(?= XX ) "
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
Perhaps something like that might function as a workaround.

Extract text in parentheses in R

Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.
Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.
The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"

R qdap::mgsub, how to pass a pattern with a regular expression?

In a previous question (replace string in R giving a vector of patterns and vector of replacements) y found that mgsub does have as pattern a string that does not need to br escape. That is good when you want to replace text like '[%.+%]' as a literal string, but then is a bad thing if you need to pass a real regular expression like:
library('stringr')
library('qdap')
tt_ori <- 'I have VAR1 and VAR2'
ttl <- list(ttregex='VAR([12])', val="val-\\1")
ttl
# OK
stringr::str_replace_all( tt_ori, perl( ttl$ttregex), ttl$val)
# [1] "I have val-1 and val-2"
# OK
mapply(gsub, ttl$ttregex, ttl$val, tt_ori, perl=T)
# [1] "I have val-1 and val-2"
# FAIL
qdap::mgsub(ttl$ttregex, ttl$val, tt_ori)
# [1] "I have VAR1 and VAR2"
How can I pass a regular expression to mgsub?
[UPDATE]
#BondeDust is rigth, with this oversimplyfied example the question does not make sense. The reason of wanting to use mgsub is for its ability for using a vector of patterns and a vector of replaces with a single string and make all substitutions in this string.
For example in the next example
> tt_ori <- 'I have VAR1 and VAR2 at CARTESIAN'
> ttl <- list( ttregex=c('VAR([12])', 'CARTESIAN')
+ , valregex=c("val-\\1", "XY")
+ , tt=c('VAR1', 'VAR2', 'CARTESIAN')
+ , val=c('val-1', 'val-2', 'XY')
+ )
> ttl
$ttregex
[1] "VAR([12])" "CARTESIAN"
$valregex
[1] "val-\\1" "XY"
$tt
[1] "VAR1" "VAR2" "CARTESIAN"
$val
[1] "val-1" "val-2" "XY"
# str_replace and gsub return multiple strings with partial substitutions
> stringr::str_replace_all( tt_ori, perl( ttl$ttregex), ttl$valregex)
[1] "I have val-1 and val-2 at CARTESIAN" "I have VAR1 and VAR2 at XY"
> mapply(gsub, ttl$ttregex, ttl$valregex, tt_ori, perl=T)
VAR([12]) CARTESIAN
"I have val-1 and val-2 at CARTESIAN" "I have VAR1 and VAR2 at XY"
# qdap (passing regexes) FAIL
> qdap::mgsub(ttl$ttregex, ttl$valregex, tt_ori)
[1] "I have VAR1 and VAR2 at XY"
# qdap (passing strings) is OK
> qdap::mgsub(ttl$tt, ttl$val, tt_ori)
[1] "I have val-1 and val-2 at XY"
I want to take advantage of using regexes when possible and not write all the possible strings (sometimes I don't know them in advance).
Change fixed = TRUE to fixed = FALSE

Get indices of all character elements matches in string in R

I want to get indices of all occurences of character elements in some word. Assume these character elements I look for are: l, e, a, z.
I tried the following regex in grep function and tens of its modifications, but I keep receiving not what I want.
grep("/([leazoscnz]{1})/", "ylaf", value = F)
gives me
numeric(0)
where I would like:
[1] 2 3
To use grep work with individual characters of a string, you first need to split the string into separate character vectors. You can use strsplit for this:
strsplit("ylaf", split="")[[1]]
[1] "y" "l" "a" "f"
Next you need to simplify your regular expression, and try the grep again:
strsplit("ylaf", split="")[[1]]
grep("[leazoscnz]", strsplit("ylaf", split="")[[1]])
[1] 2 3
But it is easier to use gregexpr:
gregexpr("[leazoscnz]", "ylaf")
[[1]]
[1] 2 3
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

Resources