Split string WITHOUT regex - r

I'm sure I used to know this, and I'm sure this is covered somewhere but since I can't find any Google/SO hits for this title search there probably should be one..
I want to split a string without using regex, e.g.
str = "abcx*defx*ghi"
Of course we can use stringr::str_split or strsplit with argument 'x[*]', but how can we just suppress regex entirely?

The argument fixed=TRUE can be useful in this instance
strsplit(str, "x*", fixed=TRUE)[[1]]
#[1] "abc" "def" "ghi"

Since the question also mentions a stringr::str_split, a stringr way might be of help, too.
You may use str_split with fixed(<YOUR_DELIMITER_STRING_HERE>, ignore_case = FALSE) or coll(pattern, ignore_case = FALSE, locale = "en", ...). See the stringr docs:
fixed: Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.
coll Compare strings respecting standard collation rules
See the following R demo:
> str_split(str, fixed("x*"))
[[1]]
[1] "abc" "def" "ghi"
Collations are better illustrated with a letter that can have two representations:
> x <- c("Str1\u00e1Str2", "Str3a\u0301Str4")
> str_split(x, fixed("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3áStr4" ""
> str_split(x, coll("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3" "Str4"
A note about fixed():
fixed(x) only matches the exact sequence of bytes specified by x. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using fixed() with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent.
...
coll(x) looks for a match to x using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a locale parameter.

Simply wrap the regex inside fixed() to stop it being treated as a regex inside stringr::str_split()
Example
Normally, stringr::str_split() will treat the pattern as a regular expression, meaning certain characters have special meanings, which can cause errors if those regular expressions are not valid, e.g.:
library(stringr)
str_split("abcdefg[[[klmnop", "[[[")
Error in stri_split_regex(string, pattern, n = n, simplify = simplify, :
Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)
But if we simply wrap the pattern we are splitting by inside fixed(), it treat's it as a string literal, rather than a regular expression:
str_split("abcdefg[[[klmnop", fixed("[[["))
[[1]]
[1] "abcdefg" "klmnop"

Related

Replace Strings in R with regular expression in it dynamically [duplicate]

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

subset strings without a pattern stringr

I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.
In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"

Prevent grep in R from treating "." as a letter

I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .:
> "xYz" "ge" "qrstu"
However, the grep function seems to be treating . as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep to behave as expected?
grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.
grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression
You may try str_extract function from stringr package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.
Your pattern does work, the problem is that grep does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all from the package stringr.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

Removing a character from within a vector element

I have a vector of strings:
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
str.vect
[1] "abcR.1" "abcL.1" "abcR.2" "abcL.2"
How can I remove the third character from the right in each vector element?
Here is the desired result:
"abc.1" "abc.1" "abc.2" "abc.2"
Thank you very much in advance
You can use nchar to find the length of each element of the vector
> nchar(str.vect)
[1] 6 6 6 6
Then you combine this with strtrim to get the beginning of each string
> strtrim(str.vect, nchar(str.vect)-3)
[1] "abc" "abc" "abc" "abc"
To get the end of the word you can then use substr (actually, you could use substr to get the beginning too...)
> substr(str.vect, nchar(str.vect)-1, nchar(str.vect))
[1] ".1" ".1" ".2" ".2"
And finally you use paste0 (which is paste with sep="") to stick them together
> paste0(strtrim(str.vect, nchar(str.vect)-3), # Beginning
substr(str.vect, nchar(str.vect)-1, nchar(str.vect))) # End
[1] "abc.1" "abc.1" "abc.2" "abc.2"
There are easier ways if you know your strings have some special characteristics
For instance, if the length is always 6 you can directly substitute the nchar calls with the appropriate value.
EDIT: alternatively, R also supports regular expressions, which make this task much easier.
> gsub(".(..)$", "\\1", str.vect)
[1] "abc.1" "abc.1" "abc.2" "abc.2"
The syntax is a bit more obscure, but not that difficult once you know what you are looking at.
The first parameter (".(..)$") is what you want to match
. matches any character, $ denotes the end of the string.
So ...$ indicates the last 3 characters in the string.
We put the last two in parenthesis, so that we can store them in memory.
The second parameter tells us what you want to substitute the matched substring with. In our case we put \\1 which means "whatever was in the first pair of parenthesis".
So essentially this command means: "find the last three characters in the string and change them with the last two".
The solution provided by #nico seems fine, but a simpler alternative might be to use sub:
sub('.(.{2})$', '\\1', str.vect)
This searches for the pattern of: "any character (represented by .) followed by 2 of any character (represented by .{2}), followed by the end of the string (represented by $)". By wrapping the .{2} in parentheses, R captures whatever those last two characters were. The second argument is the string to replace the matched substrings with. In this case, we refer to the first string captured in the matched pattern. This is represented by \\1. (If you captured multiple parts of the pattern, with multiple sets of parentheses, you would refer to subsequent captured regions with, e.g. \\2, \\3, etc.)
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
a <- strsplit(str.vect,split="")
a <- strsplit(str.vect,split="")
b <- unlist(lapply(a,FUN=function(x) {x[4] <- ""
paste(x,collapse="")}
))
If you want to parameterize it further change 4 to a variable and put the index of the character you want to remove there.
Not sure how general or efficient this is, but it seems to work with your example string:
(This seems very similar to nico's answer although I am not using the strtrim function.)
my.string <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
n.char <- nchar(my.string)
the.beginning <- substr(my.string, n.char-(n.char-1), n.char-3)
the.end <- substr(my.string, n.char-1, n.char)
new.string <- paste0(the.beginning, the.end)
new.string
# [1] "abc.1" "abc.1" "abc.2" "abc.2"
The 3rd character from the right of each element is removed.
sapply(str.vec, function(x) gsub(substr(x, nchar(x)-2,nchar(x)-2), "", x))
This is a very quick and dirty answer, but thats what is needed sometimes:
#Define vector
str.vect <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
#Use gsub to remove both 'R' and 'L' independently.
str.vect2 <- gsub("R", '', str.vect )
str.vect_final <- gsub("L", '', str.vect2 )
>str.vect_final
[1] "abc.1" "abc.1" "abc.2" "abc.2"

Extract websites links from a text in R

I have multiple texts that each may consist references to one or more web links. for example:
text1= "s#1212a as www.abcd.com asasa11".
How do I extract:
"www.abcd.com"
from this text in R? In other words I am looking to extract patterns that start with www and end with .com
regmatches This approach uses regexpr/grepgexpr and regmatches. I expanded the test data to include more examples.
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"blargwww.test.comasdf")
# Regular expressions take some practice.
# check out ?regex or the wikipedia page on regular expressions
# for more info on creating them yourself.
pattern <- "www\\..*?\\.com"
# Get information about where the pattern matches text1
m <- gregexpr(pattern, text1)
# Extract the matches from text1
regmatches(text1, m)
Which gives
> regmatches(text1, m) ##
[[1]]
[1] "www.abcd.com" "www.cats.com"
[[2]]
[1] "www.boo.com"
[[3]]
character(0)
[[4]]
[1] "www.test.com"
Notice it returns a list. If we want a vector you can just use unlist on the result. This is because we used gregexpr which implies there could be multiple matches in our string. If we know there is at most one match we could use regexpr instead
> m <- regexpr(pattern, text1)
> regmatches(text1, m)
[1] "www.abcd.com" "www.boo.com" "www.test.com"
Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the gsub method because that way will return the entire input if there is no result found. For example take a look:
> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1")
[1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com"
And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern.
Shameless silly self promotion: regmatches was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R.
strapplyc. An alternative which gives the same result as ## above is:
library(gsubfn)
strapplyc(test1, pattern)
The regular expression Here is some explanation on how to decipher the regular expression:
pattern <- "www\\..*?\\.com"
Explanation:
www matches the www portion
\\. We need to escape an actual 'dot' character using \\ because a plain . represents "any character" in regular expressions.
.*? The . represents any character, the * tells to match 0 or more times, and the ? following the * tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there.
\\. Once again we need to escape an actual dot character
com This part matches the ending 'com' that we want to match
Putting it all together it says: start with www. then match any characters until you reach the first ".com"
Check out the gsub function:
x = "s#1212a as www.abcd.com asasa11"
gsub(x=x, pattern=".*(www.*com).*", replace="\\1")
The basic idea is to surround the txt you want to retain in parenthesis, then replace the entire line with it. The replace parameter of gsub "\\1" refers to what was found in the parenthesis.
The solutions here are great and in base. For those that want a quick solution you can use qdap's genXtract. This functions basically takes a left and a right element(s) and it will extract everything in between. By setting with = TRUE it will include those elements:
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"http://www.talkstats.com/ and http://stackoverflow.com/",
"blargwww.test.comasdf")
library(qdap)
genXtract(text1, "www.", ".com", with=TRUE)
## > genXtract(text1, "www.", ".com", with=TRUE)
## $`www. : .com1`
## [1] "www.abcd.com" "www.cats.com"
##
## $`www. : .com2`
## [1] "www.boo.com"
##
## $`www. : .com3`
## character(0)
##
## $`www. : .com4`
## [1] "www.talkstats.com"
##
## $`www. : .com5`
## [1] "www.test.com"
PS if you look at code for the function it is a wrapper for Dason's solution.

Resources