Students commonly paste assignment questions from a pdf or word document into Rmarkdown. However, the pasted text often has non-ascii characters for bullets, quotes, etc. I have used gsub in the past as part of a function to replace such characters and that seemed to work fine but I'm running into problems now again.
The first line in each pair shown below works on macOS, Linux, and Windows. However, non-ascii characters are not allowed in code to be included in an R package. The 2nd line in each pair works on macOS and Linux but not on Windows.
It would be great to have a general approach to deal with these type of characters that does not involve simply deleting them.
gsub("•", "*", "A big dot •")
gsub("\xE2\x80\xA2", "*", "A big dot •")
gsub("…", "...", "Some small dots …")
gsub("\xE2\x80\xA6", "...", "Some small dots …")
gsub("–", "-", "A long-dash –")
gsub("\xE2\x80\x93", "-", "A long-dash –")
gsub("’", "'", "A curly single quote ’")
gsub("\xE2\x80\x99", "'", "A curly single quote ’")
gsub("‘", "'", "A curly single quote ‘")
gsub("\xE2\x80\x98", "'", "A curly single quote ‘")
gsub("”", '"', "A curly double quote ”")
gsub("\xE2\x80\x9D", '"', "A curly double quote ”")
gsub("“", '"', "A curly double quote “")
gsub("\xE2\x80\x9C", '"', "A curly double quote “")
We can check the hex encoding of a character using the Encoding function:
x <- c("•", "…", "–", "’", "‘", "”", "“")
y <- x
Encoding(y) <- "bytes"
> x
[1] "•" "…" "–" "’" "‘" "”" "“"
> cat(y)
\x95 \x85 \x96 \x92 \x91 \x94 \x93
We can then include the hex codes in your gsub's:
gsub("•", "*", "A big dot •")
gsub("[\x95\xE2\x80\xA2]", "*", "A big dot •")
gsub("…", "...", "Some small dots …")
gsub("[\x85\xE2\x80\xA6]", "...", "Some small dots …")
gsub("–", "-", "A long-dash –")
gsub("[\x96\xE2\x80\x93]", "-", "A long-dash –")
gsub("’", "'", "A curly single quote ’")
gsub("[\x92\xE2\x80\x99]", "'", "A curly single quote ’")
gsub("‘", "'", "A curly single quote ‘")
gsub("[\x91\xE2\x80\x98]", "'", "A curly single quote ‘")
gsub("”", '"', "A curly double quote ”")
gsub("[\x94\xE2\x80\x9D]", '"', "A curly double quote ”")
gsub("“", '"', "A curly double quote “")
gsub("[\x93\xE2\x80\x9C]", '"', "A curly double quote “")
Also with stri_trans_general from stringi:
library(stringi)
stri_trans_general(x, "ascii")
# [1] "•" "..." "-" "'" "'" "\"" "\""
This seems to not work for "•", but works for the rest.
Note that I have only tested this solution on Windows and not other OS.
It seems that on systems with non-US language settings gsub("[\x95\xE2\x80\xA2]", "*", "A big dot •") can cause errors (see e.g., below).
> gsub("[\x95\xE2\x80\xA2]", "*", "A big dot •")
Error in gsub("[曗€", "*", "A big dot <U+2022>") :
invalid regular expression '[曗€', reason 'Missing ']''
The following, however, does work well.
gsub("\u2022", "*", "A big dot •")
gsub("\u2026", "...", "Some small dots …")
gsub("\u2013", "-", "A long-dash –")
gsub("\u2019", "'", "A curly single quote ’")
gsub("\u2018", "'", "A curly single quote ‘")
gsub("\u201D", '"', "A curly double quote ”")
gsub("\u201C", '"', "A curly double quote “")
Also, stringi::stri_trans_general works well on systems with US language settings but on a system with Chinese language settings the code shown below does not return the desired result which is just 夹. Not sure what the solution is.
stringi::stri_trans_general("夹", "ascii")
> stringi::stri_trans_general("夹", "ascii")
[1] " 1/4D"
Related
I have textual data (storytellings) and my aim is to extract certain words that are defined by a co-occurrence pattern, namely that they occur immediately prior to overlap, which is indicated by square brackets. The data are like this:
who <- c("Sue:", NA, "Carl:", "Sue:", NA, NA, NA, "Carl:", "Sue:","Carl:", "Sue:","Carl:")
story <- c("That’s like your grand:ma. did that with::=erm ",
"with Ju:ne (.) once or [ twice.] ",
" [ Yeah. ] ",
"And June wanted to go out and yo- your granny said (0.8)",
"“make sure you're ba(hh)ck before midni(hh)ght.” ",
"[Mm.] ",
"[There] she was (.) a ma(h)rried woman with a(h)- ",
"She’s a right wally. ",
"mm [kids as well ] ",
" [They assume] an awful lot man¿ ",
"°°ye:ah,°° ",
"°°the elderly do.°° ")
CAt <- data.frame(who, story)
Now, defining the pattern:
pattern <- "\\w.*\\s\\[[^]].*]"
and using grep():
grep(pattern, CAt$story, value = T)
[1] "with Ju:ne (.) once or [ twice.] "
[2] "mm [kids as well ] "
I get the two strings that contain the target matches but what I'm really after are the target words only, in this case the words "or" and "mm". This, to me, seems to call for positive lookahead. So I redefined the pattern thus:
pattern <- "\\w.*(?=\\s\\[[^]].*])"
which says something along the lines: "match the word iff you see a space followed by square brackets with some content on the right of that word". Now to extract only the exact matches, I normally use this code, which works fine as long as no lookaround is involved, but here it throws an error:
unlist(regmatches(CAt$story, gregexpr(pattern, CAt$story)))
Error in gregexpr(pattern, CAt$story) :
invalid regular expression, reason 'Invalid regexp'
Why is this? And how can the exact matches be extracted?
In your code, you could add perl=TRUE to gregexpr.
In your pattern \w.* will match a single word char followed by matching any char 0+ times.
This part \[[^]].*] will match [, then 1 char which is not ] and then .* which will match any char 0+ times followed by ].
You could update your pattern to repeating the word char and the character class itself instead.
\w+(?=\s\[[^]]*])
Explanation
\w+ Match 1+ word chars
(?= Positive lookahead, assert what is directly to the right is
\s Match single whitespace char
\[[^]]*] Match from opening[ to closing ] using a negated character class
) Close positive lookahead
Regex demo
Using doubled backslashes:
\\w+(?=\\s\\[[^]]*])
As an alternative you could use a capturing group instead of using a lookahead
(\w+)\s\[[^]]*]
Regex demo
How do I remove an interpunct (aka interpoint, middle dot, middot) from a string? I am looking for something like trimws, but trimws doesn't work on the interpunct.
Cheers
I believe this is what you're looking for.
string <- c("· interpunct", "interpunct · interpunct", "interpunct · ")
#[1] "· interpunct" "interpunct · interpunct" "interpunct · "
sub("(?:\\s?)+·(?:\\s?)+", "", string)
#[1] "interpunct" "interpunctinterpunct" "interpunct"
I want to combine words in one string having spaces in between, which are similar to words in another string without spaces in between them (In R).
eg
s1 = 'this is an example of an undivided string case here'
s2 = 'Please note th is is an un di vid ed case right he r e for you!'
s2 needs to be converted into
s2 = 'Please note this is an undivided case right here for you!'
based on combined words in s1 which are same as non combined successive/continuous words in s2(with spaces in between)
I am new to R and tried with gsub, and different combinations of '\s', but not able to get the desired result.
You may achieve what you need by
removing all whitespaces from the string you want to search for (s1) (with gsub("\\s+", "", x)), then
insert whitespace patterns (\s*) in between each char (use something like sapply(strsplit(unspace(s1), ""), paste, collapse="\\s*")), and then
replace all the matches with the replacement with gsub(pattern, s1, s2).
See the R demo:
s2 = 'Please note th is is an un di vid ed case right he r e for you!'
s1 = 'this is an undivided case right here'
unspace <- function(x) { gsub("\\s+", "", x) }
pattern <- sapply(strsplit(unspace(s1), ""), paste, collapse="\\s*")
gsub(pattern, s1, s2)
## => [1] "Please note this is an undivided case right here for you!"
I have strings like:
\n A vs B \n
\n C vs D (EF) \n
\n GH ( I vs J) \n
in a vector called myData.
The following is myData.
c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
I want to select A vs B from 1, C vs D from 2 and I vs J from 3.
I have the following code:
loc = regexpr(".*vs.*|\\(.*vs.*\\)",myData,ignore.case=TRUE,perl=T)
end = loc + attr(loc,"match.length")-1
substr(myData,loc,end)
which gives three output:
[1] " A vs B " " C vs D (EF) " " GH ( I vs J)"
The last match is incorrect. How can I fix this?
We can use str_extract
library(stringr)
str_extract(str1, "[A-Za-z]\\s*vs\\s*[A-Za-z]")
#[1] "A vs B" "C vs D" "I vs J"
Or if there are other lower case characters in place of 'vs'
str_extract(str1, "[A-Z]\\s*[a-z]+\\s*[A-Z]")
#[1] "A vs B" "C vs D" "I vs J"
Or with sub from base R
sub(".*([A-Z]\\s*[a-z]+\\s*[A-Z]).*", "\\1", str1)
#[1] "A vs B" "C vs D" "I vs J"
data
str1 <- c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
You may use the base R regmatches / gregexpr solution using a PCRE regex like yours, but using lookarounds, changing . to [^()] (to avoid the overflow across parentheses) and placing the longer alternative before the smaller one:
> myData <- c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
> res <- regmatches(myData, gregexpr("(?<=\\()[^()]*vs[^()]*(?=\\))|[^()]*vs[^()]*", myData, perl=TRUE))
> trimws(res)
[1] "A vs B" "C vs D" "I vs J"
See the R online demo
Details:
(?<=\\() - positive lookbehind making sure there is a ( immediately to the left of the current location
[^()]* - 0+ chars other than ( and )
vs - a literal substring
[^()]* - 0+ chars other than ( and )
(?=\\)) - positive lookahead making sure there is a ) immediately to the right of the current location
| - or
[^()]*vs[^()]* - a vs enclosed with 0+ chars other than ( and )
NOTE: If you need to prevent the overflow across lines, you need to add \r\n to the [^()] -> [^()\r\n].
See this regex demo.
Throwing a non-regex approach in the mix. Basically we split at vs and paste tha last character of the first element with the first character of the second element.
sapply(strsplit(x, ' vs '), function(i)
paste0(substr(i[1], nchar(i), nchar(i)), ' Vs ', substr(i[2], 1, 1)))
#[1] "A Vs B" "C Vs D" "I Vs J"
Let's say I have the following string:
input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
I need to replace the white-spaces with underscores, but only in the substrings that match a pattern. (In this case the pattern would be a semi-colon before and after.)
The expected output should be:
output = "askl jmsp wiqp;THIS_IS_A_MATCH; dlkasl das, fm"
Any ideas how to achieve that, preferably using regular expressions, and without splitting the string?
I tried:
gsub("(.*);(.*);(.*)", "\\2", input) # Pattern matching and
gsub(" ", "_", input) # Naive gsub
Couldn't put them both together though.
Regarding the original question:
Substitute character in a matching substring
You may do it easily with gsubfn:
> library(gsubfn)
> input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
> gsubfn(";([^;]+);", function(g1) paste0(";",gsub(" ", "-", g1, fixed=TRUE),";"), input)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
The ;([^;]+); matches any string starting with ; and up to the next ; capturing the text in-between and then replacing the whitespaces with hyphens only inside the captured part.
Another approach is to use a PCRE regex with a \G based regex with gsub:
p = "(?:\\G(?!\\A)|;)(?=[^;]*;)[^;\\s]*\\K\\s"
> gsub(p, "-", input, perl=TRUE)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
See the online regex demo
Pattern details:
(?:\\G(?!\\A)|;) - a custom boundary: either the end of the previous successful match (\\G(?!\\A)) or (|) a semicolon
(?=[^;]*;) - a lookahead check: there must be a ; after 0+ chars other than ;
[^;\\s]* - 0+ chars other than ; and whitespaces
\\K - omitting the text matched so far
\\s - 1 single whitespace character (if multiple whitespaces are to be replaced with 1 hyphen, add + after it).