Why does stringr::str_replace() match every character in a string as "."?

Why does stringr::str_replace() match every character in a string as "."? - r

library(stringr)
y4=c("yes i do")
str_replace_all(y4,".","_")
[1] "________"
str_replace_all(y4," ","_")
[1] "yes_i_do"
y4=c("yes i do.")
str_replace_all(y4," ","_")
[1] "yes_i_do."
If you attempt to replace "." in a string, every character is replaced.

stringr by default uses regular expressions (regex), a powerful searching tool. The . is a regex wildcard for any character except a new line. If you want a literal . you have to escape it with a backslash like so \. in regex, but as R interprets the string we need another backslash to escape the first backslash so you use \\.
Obligatory xkcd
For your example:
library(stringr)
y4 <- c("yes i do.") #added a period so we can see the replacement.
str_replace_all(y4,"\\.","_")
[1] "yes i do_"
Alternately, if you wanted to use a fixed expression without regex syntax you could use:
str_replace_all(y4, fixed("."),"_")
[1] "yes i do_"

Related

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.

You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"

We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"

You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.

I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

capitalize the first letter of two words separated by underscore using stringr

I have a string like word_string. What I want is Word_String. If I use the function str_to_title from stringr, what I get is Word_string. It does not capitalize the second word.
Does anyone know any elegant way to achieve that with stringr? Thanks!

Here is a base R option using sub:
input <- "word_string"
output <- gsub("(?<=^|_)([a-z])", "\\U\\1", input, perl=TRUE)
output
[1] "Word_String"
The regex pattern used matches and captures any lowercase letter [a-z] which is preceded by either the start of the string (i.e. it's the first letter) or an underscore. Then, we replace with the uppercase version of that single letter. Note that the \U modifier to change to uppercase is a Perl extension, so we must use sub in Perl mode.

Can also use to_any_case from snakecase
library(snakecase)
to_any_case(str1, "title", sep_out = "_")
#[1] "Word_String"
data
str1 <- "word_string"

This is obviously overly complicating but another base possibility:
test <- "word_string"
paste0(unlist(lapply(strsplit(test, "_"),function(x)
paste0(toupper(substring(x,1,1)),
substring(x,2,nchar(x))))),collapse="_")
[1] "Word_String"

You could first use gsub to replace "_" by " " and apply the str_to_title function
Then use gsub again to change it back to your format
x <- str_to_title(gsub("_"," ","word_string"))
gsub(" ","_",x)

R / stringr: split string, but keep the delimiters in the output

I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.
x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")
Normally I would get:
[[1]]
[1] "Foobar foobar," "oobar foobar"
The output I would like to get, however, should include the letter from the delimiter:
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.

You may split with 1+ whitespaces that are followed with an uppercase letter:
> str_split(x, "\\s+(?=[[:upper:]])")
[[1]]
[1] "Foobar foobar," "Foobar foobar"
Here,
\\s+ - 1 or more whitespaces
(?=[[:upper:]]) - a positive lookahead (a non-consuming pattern) that only checks for an uppercase letter immediately to the right of the current location in string without adding it to the match value, thus, preserving it in the output.
Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).

We could use a regex lookaround to split at the space between a , and upper case character
str_split(x, "(?<=,) (?=[A-Z])")[[1]]
#[1] "Foobar foobar," "Foobar foobar"

R Regex: removing only the immediate following character after >

I have the following string in R:
string1 = "A((..A>B)A"
I would like to remove all punctation, and the letter immediately after >, i.e. >B
Here is the output I desire:
output = "AAA"
I tried using gsub() as follows:
output = gsub("[[:punct:]]","", string1)
But this gives AABA, which keeps the immediately following character.

This would work using your work plus a leading lookbehind first to look for what comes after the > character.
gsub('(?<=>).|[[:punct:]]', '', "A((..A>B)A", perl=TRUE)
## [1] "AAA"

A slightly less complex regex without the use of perl seems to work for this example as well:
gsub("[[:punct:]]|>(.)", "", "A((..A>B)A")
[1] "AAA"

You say
remove all punctation, and the letter immediately after >
Punctuation is matched with [[:punct:]] and a letter can be matched with [[:alpha:]], thus, you may use a TRE regex with gsub:
string1 = "A((..A>B)A"
gsub(">[[:alpha:]]|[[:punct:]]", "", string1)
# => [1] "AAA"
See the online R demo
Note that > is also a char matched with [[:punct:]], thus, you do not need any lookarounds here, just remove it with a letter after it.
Pattern details:
>[[:alpha:]] - a > and any letter
| - or
[[:punct:]] - a punctuation or symbol.

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.

We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))

An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.

You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why does stringr::str_replace() match every character in a string as "."? - r

library(stringr) y4=c("yes i do") str_replace_all(y4,".","_") [1] "________" str_replace_all(y4," ","_") [1] "yes_i_do" y4=c("yes i do.") str_replace_all(y4," ","_") [1] "yes_i_do." If you attempt to replace "." in a string, every character is replaced.

Related

Extract all text after last occurrence of a special character

capitalize the first letter of two words separated by underscore using stringr

R / stringr: split string, but keep the delimiters in the output

R Regex: removing only the immediate following character after >

Removing the second "|" on the last position

Categories

Resources