Replacing a special character does not work with gsub - r

I have a table with many strings that contain some weird characters that I'd like to replace with the "original" ones. Ä became ä, ö became ö, so I replace each ö with an ö in the text. It works, however, ß became à < U+009F> and I am unable to replace it...
# Works just fine:
gsub('ö', 'REPLACED', "Testing string ö")
# this does not work
gsub("Ã<U+009F>", "REPLACED", "Testing string Ã<U+009F> ")
# this does not work as well...
gsub("â<U+0080><U+0093>", "REPLACED", "Testing string â<U+0080><U+0093> ")
How do I tell R to replace These parts with some letter I want to insert?

As there are metacharacters (+ - to signify one or more), in order to evaluate it literally either escape (as #boski mentioned in the solution) or use fixed = TRUE
sub("Ã<U+009F>", "REPLACED", "Testing string Ã<U+009F> ", fixed = TRUE)
#[1] "Testing string REPLACED "

You have to escape the + symbol, as it is a regex command.
> gsub("Ã<U\\+009F>", "REPLACED", "Testing string Ã<U+009F> ")
[1] "Testing string REPLACED "
> gsub("â<U\\+0080><U\\+0093>", "REPLACED", "Testing string â<U+0080><U+0093> ")
[1] "Testing string REPLACED "

Related

Splitting a comma- and semicolon-delimited string in R

I'm trying to split a string containing two entries and each entry has a specific format:
Category (e.g. active site/region) which is followed by a :
Term (e.g. His, Glu/nucleotide-binding motif A) which is followed by a ,
Here's the string that I want to split:
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
This is what I have tried so far. Except for the two empty substrings, it produces the desired output.
unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))
[1] "active site: His, Glu" "" "region: nucleotide-binding motif A"
[4] ""
How do I get rid of the empty substrings?
You get the empty strings because .*? can also match an empty string where this assertion (?=,(?:\\w+|$)) is true
You can exclude matching a colon or comma using a negated character class before matching :
[^:,\n]+:.*?(?=,(?:\w|$))
Explanation
[^:,\n]+ Match 1+ chars other than : , or a newline
: Match the colon
.*? Match any char as least as possbiel
(?= Positive lookahead, assert that what is directly to the right from the current position:
, Match literally
(?:\w|$) Match either a single word char, or assert the end of the string
) Close the lookahead
Regex demo | R demo
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))
Output
[1] "active site: His, Glu" "region: nucleotide-binding motif A"
Much longer and not as elegant as #The fourth bird +1,
but it works:
library(stringr)
string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,$', '')
> string
[1] "active site: His, Glu"
[2] "region: nucleotide-binding motif A"

Insert characters when a string changes its case R

I would like to insert characters in the places were a string change its case. I tried this to insert a '\n' after a fixed number of characters and then a ' ', as I don't figure out how to detect the case change
s <-c("FloridaIslandE7", "FloridaIslandE9", "Meta")
gsub('^(.{7})(.{6})(.*)$', '\\1\\\n\\2 \\3', s )
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
This works because the positions are fixed but I would like to know how to do it for the general case.
Surely there's a less convoluted regex for this, but you could try:
gsub('([A-Z][0-9])', ' \\1', gsub('([a-z])([A-Z])', '\\1\n\\2', s))
Output:
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
Here is an option
str_replace_all(s, "(?<=[a-z])(?=[A-Z])", "\n")
#[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"
If you really want to insert \n, try this:
gsub("([a-z])([A-Z])", "\\1\\\n\\2", s)
[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

Regex for matching between a colon and last newline prior to next colon

I am trying to parse a string with regex to pull out information between a colon and the last newline prior to the next colon. How can I do this?
string <- "Name: Al's\nPlace\nCountry:\nState\n/ Province: RI\n"
stringr::str_extract_all(string, "(?<=:)(.*)(?:\\n)")
but I get:
[[1]]
[1] " Al's\n" " \n" " RI\n"
when I want:
[[1]]
[1] " Al's\nPlace\n" " \n" " RI\n"
I'm not sure if this is what you're after as your wanted output looks a bit different.
:((?:.*\\n?)+?)(?=.*:|$)
: match a colon
((?:.*\n?)+?) match and capture lazily any lines (to optional \n)
(?=.*:|$) until there is a line with colon ahead
See this demo at regex101

Splitting character string in R - Extracting the timestamp

Thank you in advance for any feedback.
I am attempting to clean some data in R where a time stamp and a text string are included together in the same cell. I am not getting the expected result. I know the regex needs validation work, but just testing out this particular function
Expected:
"04/05/2018 17:14:35" " -(Additional comments) update"
Actual:
"04/05/2018 17:14:35 -(Additional comments) update"
What I tried:
string <- "04/05/2018 17:14:35 -(Additional comments) update"
pattern <- "[:digit:][:digit:][:punct:]
[:digit:][:digit:][:punct:]
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]"
strsplit(string, pattern)
I also tried this variation, same result
pattern <- "[:digit:][:digit:]\\/
[:digit:][:digit:]\\/
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
\\:
[:digit:][:digit:]
\\:
[:digit:][:digit:]"
You can try :
string <- "04/05/2018 17:14:35 -(Additional comments) update"
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2}).*","\\1", string)
#[1] "04/05/2018 17:14:35"
#RHS part
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)","\\2", string)
#" -(Additional comments) update"
Regex explanation:
\\d{2} - 2 digits
\\d{4} - 4 digits
/ - separator
: - separator
() - Group for selection
.* - Followed by anything
Seems OP is very keen on using strsplit. One option could be as:
strsplit(gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)",
paste("\\1","####","\\2",sep=""), string), split = "####")
# [[1]]
# [1] "04/05/2018 17:14:35" " -(Additional comments) update"
Try this:
sub('-.*','',string)
[1] "04/05/2018 17:14:35 "

How to completely remove head and tail white spaces or punctuation characters?

I have string_a, such that
string_a <- " ,A thing, something, . ."
Using regex, how can I just retain "A thing, something"?
I have tried the following and got such output:
sub("[[:punct:]]$|^[[:punct:]]","", trimws(string_a))
[1] "A thing, something, . ."
We can use gsub to match one or more punctuation characters including spaces ([[:punct:] ] +) from the start (^) or | those characters until the end ($) of the string and replace it with blank ("")
gsub("^[[:punct:] ]+|[[:punct:] ]+$", "", string_a)
#[1] "A thing, something"
Note: sub will replace only a single instance
Or as #Cath mentioned [[:punct:] ] can be replaced with \\W

Resources