vector of punctuation - r

For digits I can write a vector like this:
digits <- c("0","1","2","3","4","5","6","7","8","9")
How can I get an analogous vector of punctuation marks?

You could convert numbers to punctuation using Unicode code points (thanks Konrad, for point that out).
strsplit(intToUtf8(c(33:47, 58:64, 91:96)), "")[[1]]
# [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "."
#[15] "/" ":" ";" "<" "=" ">" "?" "#" "[" "\\" "]" "^" "_" "`"
some Ethiopian punctuation (0x1361:0x1367):
strsplit(intToUtf8(0x1361:0x1367), "")[[1]]
[1] "፡" "።" "፣" "፤" "፥" "፦" "፧"
If this is missing punctuation you want to use, you can look up the unicode points associated with the punctuation you want, and use it (e.g. somewhere like http://www.fileformat.info/info/unicode/category/Po/list.htm). You can also get the integers from utf8ToInt. For instance "~" isn't included above:
utf8ToInt("~")
#[1] 126

Related

How do I replace "/" with "\" in r?

I am trying do that "C:/Users/Vitor/Documents" become "C:\Users\Vitor\Documents".
I tried :
gsub("//", "\", file)
paste(dirname(file),basename(file),sep="\")
normalizePath(file,"\",mustWork=FALSE)
But didn't work!
We can escape the \ with another set and use it in gsub as the \\ is just one character
gsub("/", "\\\\", "C:/Users/Vitor/Documents")
which would print correctly with cat
cat(gsub("/", "\\\\", "C:/Users/Vitor/Documents"))
#C:\Users\Vitor\Documents
and can check the number of characters
nchar("\\")
#[1] 1

Searching and replacing characters with classes in R

I am trying to replace text in R. I want to find spaces between letters and numbers only and delete them, but when I search using [:alpha:] and [:alnum:] it replaces with that class operator.
> string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
> str_replace_all(string,
+ "[:alpha:] & [:alnum:]",
+ "[:alpha:]&[:alnum:]")
[1] "WORD = 500 * WORD + ((WOR[:alpha:]&[:alnum:]00) - (WOR[:alpha:]&[:alnum:]00))"
How can I use the function so that it returns-
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"
str_replace_all(string, "([:alpha:]) & ([:alnum:])", "\\1&\\2")
Your requirement is easy enough to handle using sub with lookarounds:
string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
output <- gsub("(?<=\\w) & (?=\\w)", "&", string, perl=TRUE)
output
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"
Here is a brief explanation of the regex:
(?<=\\w) assert that what precedes is a word character
[ ]&[ ] then match a space, followed by `&`, followed by another space
(?=\\w) assert that what follows is also a word character
Then, we replace with just a single &, with no spaces on either side.
Here is one option where we match regex lookarounds to match one or more spaces (\\s+) either preceding or succeeding a & and replace with blank ("")
gsub("(?<=&)\\s+|\\s+(?=&)", "", string, perl = TRUE)
#[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"

Regex for matching between a colon and last newline prior to next colon

I am trying to parse a string with regex to pull out information between a colon and the last newline prior to the next colon. How can I do this?
string <- "Name: Al's\nPlace\nCountry:\nState\n/ Province: RI\n"
stringr::str_extract_all(string, "(?<=:)(.*)(?:\\n)")
but I get:
[[1]]
[1] " Al's\n" " \n" " RI\n"
when I want:
[[1]]
[1] " Al's\nPlace\n" " \n" " RI\n"
I'm not sure if this is what you're after as your wanted output looks a bit different.
:((?:.*\\n?)+?)(?=.*:|$)
: match a colon
((?:.*\n?)+?) match and capture lazily any lines (to optional \n)
(?=.*:|$) until there is a line with colon ahead
See this demo at regex101

How to completely remove head and tail white spaces or punctuation characters?

I have string_a, such that
string_a <- " ,A thing, something, . ."
Using regex, how can I just retain "A thing, something"?
I have tried the following and got such output:
sub("[[:punct:]]$|^[[:punct:]]","", trimws(string_a))
[1] "A thing, something, . ."
We can use gsub to match one or more punctuation characters including spaces ([[:punct:] ] +) from the start (^) or | those characters until the end ($) of the string and replace it with blank ("")
gsub("^[[:punct:] ]+|[[:punct:] ]+$", "", string_a)
#[1] "A thing, something"
Note: sub will replace only a single instance
Or as #Cath mentioned [[:punct:] ] can be replaced with \\W

Non alphanumeric characters in R

For uppercase, lowercase letters and 10-digits I can generate a vector that contains all letters or 10-digit number as follow:
A <- LETTERS[0:26]
B <- letters[0:26]
C <- seq(0,9)
I wonder whether there is a similar function for non-alphanumeric characters.
~!##$%^&*_-+=`|\(){}[]:;"'<>,.?/
I tried
D <- c("~","!","#","#","$","%","^", "&","*","_","-","+","=","`","|","\","(",")","{","}","[","]",":",";",""","'","<",">",",",".","?","/")
Thanks
This is another option. Generate all ascii characters, then filter out the non punctuation with regular expressions.
ascii <- rawToChar(as.raw(0:127), multiple=TRUE)
ascii[grepl('[[:punct:]]', ascii)]
# [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "<" "=" ">" "?" "#"
# [23] "[" "\\" "]" "^" "_" "`" "{" "|" "}" "~"
This might be useful . . The ASCII character set is arranged in ranges of similar types of characters (letters, etc).
http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html
It's a bit drawn out, and there's probably a better website (and a better way to get the same result), but
library(XML); library(RCurl)
doc <- htmlParse(getURL("https://wci.llnl.gov/codes/basis/manual/node161.html"))
xp <- xpathSApply(doc, "//tr/td", xmlValue, trim = TRUE)
xp[nzchar(xp) & nchar(xp) == 1]
# [1] "!" "[" "%" "," "]" "&" "-" "|" "'" "." "=" "~" "("
# [14] "/" ")" "*" "=" "{" "?" "`" "}" "#" ":" ";" "^" " "
Also, using the website from the other answer yields a more complete result
> URL <- "http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html"
> r <- readLines(URL, warn = FALSE)[780:874]
> s <- sapply(strsplit(r, "\\s+"), "[", 1)
> s[!s %in% c(letters, LETTERS, 0:9)]
# [1] "" "!" "\"" "#" "$" "%" "&" "'" "("
# [10] ")" "*" "+" "," "-" "." "/" ":" ";"
# [19] "<" "=" ">" "?" "#" "[" "\\\\" "]" "^"
# [28] "_" "`" "{" "|" "}" "~"
...or yeah, just use rawToChar(as.raw(...)) like MrFlick said :-)
This answer is only for amusement, list the characters you want and use strsplit to generate your vector.
> D <- strsplit('!"#$%&\'()*+,-./\\:;<=>?#[]^_`{|}~', '(?=.)', perl=T)[[1]]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] "\\" ":" ";" "<" "=" ">" "?" "#" "[" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"
Or filter the characters you want.
> D <- gsub('[^\\pP\\pS]', '', rawToChar(as.raw(1:127), multiple=T), perl=T)
> D[D != ""]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] ":" ";" "<" "=" ">" "?" "#" "[" "\\" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"

Resources