Non alphanumeric characters in R

Non alphanumeric characters in R - r

For uppercase, lowercase letters and 10-digits I can generate a vector that contains all letters or 10-digit number as follow:
A <- LETTERS[0:26]
B <- letters[0:26]
C <- seq(0,9)
I wonder whether there is a similar function for non-alphanumeric characters.
~!##$%^&*_-+=`|\(){}[]:;"'<>,.?/
I tried
D <- c("~","!","#","#","$","%","^", "&","*","_","-","+","=","`","|","\","(",")","{","}","[","]",":",";",""","'","<",">",",",".","?","/")
Thanks

This is another option. Generate all ascii characters, then filter out the non punctuation with regular expressions.
ascii <- rawToChar(as.raw(0:127), multiple=TRUE)
ascii[grepl('[[:punct:]]', ascii)]
# [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "<" "=" ">" "?" "#"
# [23] "[" "\\" "]" "^" "_" "`" "{" "|" "}" "~"

This might be useful . . The ASCII character set is arranged in ranges of similar types of characters (letters, etc).
http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html

It's a bit drawn out, and there's probably a better website (and a better way to get the same result), but
library(XML); library(RCurl)
doc <- htmlParse(getURL("https://wci.llnl.gov/codes/basis/manual/node161.html"))
xp <- xpathSApply(doc, "//tr/td", xmlValue, trim = TRUE)
xp[nzchar(xp) & nchar(xp) == 1]
# [1] "!" "[" "%" "," "]" "&" "-" "|" "'" "." "=" "~" "("
# [14] "/" ")" "*" "=" "{" "?" "`" "}" "#" ":" ";" "^" " "
Also, using the website from the other answer yields a more complete result
> URL <- "http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html"
> r <- readLines(URL, warn = FALSE)[780:874]
> s <- sapply(strsplit(r, "\\s+"), "[", 1)
> s[!s %in% c(letters, LETTERS, 0:9)]
# [1] "" "!" "\"" "#" "$" "%" "&" "'" "("
# [10] ")" "*" "+" "," "-" "." "/" ":" ";"
# [19] "<" "=" ">" "?" "#" "[" "\\\\" "]" "^"
# [28] "_" "`" "{" "|" "}" "~"
...or yeah, just use rawToChar(as.raw(...)) like MrFlick said :-)

This answer is only for amusement, list the characters you want and use strsplit to generate your vector.
> D <- strsplit('!"#$%&\'()*+,-./\\:;<=>?#[]^_`{|}~', '(?=.)', perl=T)[[1]]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] "\\" ":" ";" "<" "=" ">" "?" "#" "[" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"
Or filter the characters you want.
> D <- gsub('[^\\pP\\pS]', '', rawToChar(as.raw(1:127), multiple=T), perl=T)
> D[D != ""]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] ":" ";" "<" "=" ">" "?" "#" "[" "\\" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"

Related

R case insensitive capturing group

This regex :
str_extract_all("This is a Test , ' ' " , "[a-z]+")
returns :
[1] "his" "is" "a" "est"
How to modify so this is case insensitive ?
`[1] "This" "is" "a" "Test"`
should instead be returned
Should /i remove case sensitive ?
Trying str_extract_all("This is a Test , ' ' " , "[a-z]+/i")
returns
[[1]]
character(0)

There is a special notation for stringr functions:
regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE,
dotall = FALSE, ...)
You may use
> str_extract_all("This is a Test , ' ' " , regex("[a-z]+", ignore_case=TRUE))
[[1]]
[1] "This" "is" "a" "Test"
Alternatively, use an inline i modifier (?i):
str_extract_all("This is a Test , ' ' " , "(?i)[a-z]+")

You could try including the capital letters in the set you're searching for.
str_extract_all("This is a Test , ' ' " , "[A-Za-z]+")
If you only want the first letter to be capitalized you could try the code below. It lets the first letter be case insensitive and then have only lowercase afterward.
str_extract_all("This is a Test , ' ' " , "[A-Za-z][a-z]*")

vector of punctuation

For digits I can write a vector like this:
digits <- c("0","1","2","3","4","5","6","7","8","9")
How can I get an analogous vector of punctuation marks?

You could convert numbers to punctuation using Unicode code points (thanks Konrad, for point that out).
strsplit(intToUtf8(c(33:47, 58:64, 91:96)), "")[[1]]
# [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "."
#[15] "/" ":" ";" "<" "=" ">" "?" "#" "[" "\\" "]" "^" "_" "`"
some Ethiopian punctuation (0x1361:0x1367):
strsplit(intToUtf8(0x1361:0x1367), "")[[1]]
[1] "፡" "።" "፣" "፤" "፥" "፦" "፧"
If this is missing punctuation you want to use, you can look up the unicode points associated with the punctuation you want, and use it (e.g. somewhere like http://www.fileformat.info/info/unicode/category/Po/list.htm). You can also get the integers from utf8ToInt. For instance "~" isn't included above:
utf8ToInt("~")
#[1] 126

How to print double quotes (") in R

I want to print to the screen double quotes (") in R, but it is not working. Typical regex escape characters are not working:
> print('"')
[1] "\""
> print('\"')
[1] "\""
> print('/"')
[1] "/\""
> print('`"')
[1] "`\""
> print('"xml"')
[1] "\"xml\""
> print('\"xml\"')
[1] "\"xml\""
> print('\\"xml\\"')
[1] "\\\"xml\\\""
I want it to return:
" "xml" "
which I will then use downstream.
Any ideas?

Use cat:
cat("\" \"xml\" \"")
OR
cat('" "','xml','" "')
Output:
" "xml" "
Alternative using noqoute:
noquote(" \" \"xml\" \" ")
Output :
" "xml" "
Another option using dQoute:
dQuote(" xml ")
Output :
"“ xml ”"

With the help of the print parameter quote:
print("\" \"xml\" \"", quote = FALSE)
> [1] " "xml" "
or
cat('"')

Replacing + by 5 in R

I have a dataset called Price which is supposed to be numeric but is generated as a string because all 5 is replaced by +.
It looks like this:
"99000" "98300" "98300" "98290" "98310" " 9831+ " "98310" " 9830+ " " 9830+ " " 9830+ " " 9829+ " " 9828+ " " 9827+ " "98270"
I used the gsub function in R to try and replace + by 5. The code I wrote is:
finalPrice<-gsub("+",5,Price)
However, the output is just a bunch of numbers which doesn't make sense for what I intended:
"59595050505,5 59585350505,5 59585350505,5 59585259505,5 59585351505,5 5 5 595853515+5 5,5 59585351505,5 5 5 595853505+5 5,5 5 5 595853505+5
How can I fix this?

The + sign should be escaped. Try this:
finalPrice<-gsub("\\+",5, Price)

Besides using double-escapes to force a literal-x to be matched by the pattern argument, you can also use either the fixed=TRUE parameter or use a character-class defined by the "[.]"-operation. See the ?regex page for more details:
> gsub("+", "5", txt, fixed=TRUE)
[1] "99000" "98300" "98300" "98290" "98310"
[6] " 98315 " "98310" " 98305 " " 98305 " " 98305 "
[11] " 98295 " " 98285 " " 98275 " "98270"
> gsub("[+]", "5", txt)
[1] "99000" "98300" "98300" "98290" "98310"
[6] " 98315 " "98310" " 98305 " " 98305 " " 98305 "
[11] " 98295 " " 98285 " " 98275 " "98270"

When writing regex, + means match the preceeding group one or more times. As the preceeding character is in your regex before the + is empty, gsub matches every empty string in the target.
The result is that 5 is inserted into each of these positions.
To avoid this, escape the +, which needs to be done with double backslash in R:
finalPrice<-gsub("\\+",5,Price)

List of valid characters for the fragment identifier in an URL?

I'm using the fragment identifier to create a permalink for AJAX events in my web app similar to this guy. Something like:
http://www.myapp.com/calendar#filter:year/2010/month/5
I've done quite a bit of searching but can't find a list of valid characters for the fragment idenitifer. The W3C spec doesn't offer anything.
Do I need to encode the characters the same as the URL in has in general?
There doesn't seem to be any good information on this anywhere.

See the RFC 3986.
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So you can use !, $, &, ', (, ), *, +, ,, ;, =, something matching %[0-9a-fA-F]{2}, something matching [a-zA-Z0-9], -, ., _, ~, :, #, /, and ?

https://www.rfc-editor.org/rfc/rfc3986#section-3.5:
fragment = *( pchar / "/" / "?" )
and
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pct-encoded = "%" HEXDIG HEXDIG
So, combined, the fragment cannot contain #, a raw %, ^, [, ], {, }, \, ", < and > according to the RFC.

One other RFC speak of that: RFC-1738
URL schemeparts for ip based protocols:
HTTP
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "#" | "&" | "=" ]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Non alphanumeric characters in R - r

This might be useful . . The ASCII character set is arranged in ranges of similar types of characters (letters, etc). http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html

Related

R case insensitive capturing group

vector of punctuation

How to print double quotes (") in R

Replacing + by 5 in R

List of valid characters for the fragment identifier in an URL?

Categories

Resources