Backslash escaped characters in JavaCC token - javacc

I'm writing JavaCC parser for a character stream like this
Abc \(Def\) Gh (Ij; Kl); Mno (Pqr)
and should get it tokenized like this
Abc \(Def\) Gh
LPAREN
Ij
SEMICOLON
Kl
RPAREN
SEMICOLON
Mno
LPAREN
Pqr
RPAREN
The current token definition is
TOKEN:
{
< WORDCHAR : (~[";", "(", ")"])+ >
| <LPAREN: "(">
| <RPAREN: ")">
| <SEMICOLON: ";">
}
How should I change the WORDCHAR token to include backslash escaped parentheses but not parentheses without leading backslash?

TOKEN:
{
< WORDCHAR : (~[";", "(", ")"] | "\\(" | "\\)")+ >
| <LPAREN: "(">
| <RPAREN: ")">
| <SEMICOLON: ";">
}

Related

What is the meaning of SP and HT in separators defention

In the the HTTP headers RFC I need to understand the definition of token:
token = 1*
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
I do not get what is the meaning of SP and HT at the end of the separators list? How to write this in a regex?
Both are defined in the very same RFC:
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>

StartTag: invalid element name Error: 1: StartTag: invalid element name

I have an xml which I am trying to parse using xmlParse in R. I have a number of xml's which are very similar to what I am trying below and I have no issues, however when trying the exact same process using one of my xml's, I get the below error message.
a = "productlist1374.xml"
b = xmlParse(a)
StartTag: invalid element name
Error: 1: StartTag: invalid element name
Only certain characters are permitted in XML names by the W3C XML BNF for component names:
Name ::= NameStartChar (NameChar)*
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] |
[#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
[#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
[#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
[#x10000-#xEFFFF]
NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] |
[#x203F-#x2040]
You've not posted your XML, but clearly one or more of your start tags uses a character or characters that are not allowed.

gsub with "|" character in R

I have a data frame with strings under a variable with the | character. What I want is to remove anything downstream of the | character.
For example, considering the string
heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding
I wish to have only:
heat-shock protein hsp70, putative
Do I need any escape character for the | character?
If I do:
a <- c("foo_5", "bar_7")
gsub("*_.", "", a)
I get:
[1] "foo" "bar"
i.e. I am removing anything downstream of the _ character.
However, If I repeat the same task with a | instead of the _:
b <- c("foo|5", "bar|7")
gsub("*|.", "", a)
I get:
[1] "" ""
You have to scape | by adding \\|. Try this
> gsub("\\|.*$", "", string)
[1] "heat-shock protein hsp70, putative "
where string is
string <- "heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding"
This alternative remove the space at the end of line in the output
gsub("\\s+\\|.*$", "", string)
[1] "heat-shock protein hsp70, putative"
Maybe a better job for strsplit than for a gsub
And yes, it looks like the pipe does need to be escaped.
string <- "heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding"
strsplit(string, ' \\| ')[[1]][1]
That outputs
"heat-shock protein hsp70, putative"
Note that I'm assuming you only want the text from before the first pipe, and that you want to drop the space that separates the pipe from the piece of the string you care about.

Including ASCII art in R

I'm writing a small program and wanted to know if there is a way to include ASCII art in R. I was looking for an equivalent of three quotes (""" or ''') in python.
I tried using cat or print with no success.
Unfortunately R can only represent literal strings by using single quotes or double quotes and that makes representing ascii art awkward; however, you can do the following to get a text representation of your art which can be output using R's cat function.
1) First put your art in a text file:
# ascii_art.txt is our text file with the ascii art
# For test purposes we use the output of say("Hello") from cowsay package
# and put that in ascii_art.txt
library(cowsay)
writeLines(capture.output(say("Hello"), type = "message"), con = "ascii_art.txt")
2) Then read the file in and use dput:
art <- readLines("ascii_art.txt")
dput(art)
which gives this output:
c("", " -------------- ", "Hello ", " --------------", " \\",
" \\", " \\", " |\\___/|", " ==) ^Y^ (==",
" \\ ^ /", " )=*=(", " / \\",
" | |", " /| | | |\\", " \\| | |_|/\\",
" jgs //_// ___/", " \\_)", " ")
3) Finally in your code write:
art <- # copy the output of `dput` here
so your code would contain this:
art <-
c("", " -------------- ", "Hello ", " --------------", " \\",
" \\", " \\", " |\\___/|", " ==) ^Y^ (==",
" \\ ^ /", " )=*=(", " / \\",
" | |", " /| | | |\\", " \\| | |_|/\\",
" jgs //_// ___/", " \\_)", " ")
4) Now if we simply cat the art variable it shows up:
> cat(art, sep = "\n")
--------------
Hello
--------------
\
\
\
|\___/|
==) ^Y^ (==
\ ^ /
)=*=(
/ \
| |
/| | | |\
\| | |_|/\
jgs //_// ___/
\_)
Added
This is an addition several years later. In R 4.0 there is a new syntax that makes this even easier. See ?Quotes
Raw character constants are also available using a syntax similar
to the one used in C++: ‘r"(...)"’ with ‘...’ any character
sequence, except that it must not contain the closing sequence
‘)"’. The delimiter pairs ‘[]’ and ‘{}’ can also be used, and ‘R’
can be used in place of ‘r’. For additional flexibility, a number
of dashes can be placed between the opening quote and the opening
delimiter, as long as the same number of dashes appear between the
closing delimiter and the closing quote.
For example:
hello <- r"{
--------------
Hello
--------------
\
\
\
|\___/|
==) ^Y^ (==
\ ^ /
)=*=(
/ \
| |
/| | | |\
\| | |_|/\
jgs //_// ___/
\_)
}"
cat(hello)
giving:
--------------
Hello
--------------
\
\
\
|\___/|
==) ^Y^ (==
\ ^ /
)=*=(
/ \
| |
/| | | |\
\| | |_|/\
jgs //_// ___/
\_)
Alternative approach: Use an API
URL - artii
Steps:
Fetch the data ascii_request <- httr::GET("http://artii.herokuapp.com/make?text=this_is_your_text&font=ascii___")
Retrieve the response - ascii_response <- httr::content(ascii_request,as = "text", encoding = "UTF-8")
cat it out - cat(ascii_response)
If not connected to the web, you can set up your own server. Read more here
Thanks to #johnnyaboh for setting up this amazing service
Try the cat function, something like this should work:
cat(" \"\"\" ")

Valid characters in cookie string?

Is this cookie string valid? Specifically this bit I0=; []scayt_verLang=6; I cant find a simple breakdown on the spec or an online validator.
Cookie JavascriptEnabled=true; Cms_User_Id=removed6CYjfBVknUjmvf9Pp/uSVYoemoQOXCcB0SOg3kZWX9/KZfo9v5C8O7MmLg1Xz0qXf94Wf86p4rLi2lxxminXfnP/16p6pzmwIU5qz7Of4plcQkK6JM6XiU/zbyZb3gksDOz2s8xjhfzWg0ekjgTZUx76/kFuW10/Rf7O8n05aIZzhUX0Gd9UNjk40zLA1DkJ02uNGtMbnil9P9iqVARhE0CNjCZFxc9qoLpyyRXtqG8nv0V/3k175KXzzg6iW6j9jH/DuGH8ko5YZoo6TxiIcW3ViRnFVfoiMK49iatauD2nF6xOtRV6LLH57RV3DhkhTTb/MQurw8bHYbsZWJRIuSnFwKeFUEOoxvRG4friI6d4Qug11F1oM3ECSdbDeKKPXuq5+IUImt8XXZUtBFUeakqWT4oXgnsToeNoI0=; []scayt_verLang=6; ASP.NET_SessionId=removed0l4mhioft0uavblzdeq; last_msg_check=1425606361000
Thanks,
Joe
Cookie and Set-Cookie HTTP headers are defined in RFC 6265 Section 4 with RFC 2616 Section 2.2 providing the basic types.
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
token = <token, defined in [RFC2616], Section 2.2>
Token as defined in RFC 2616...
token = 1*<any CHAR except CTLs or separators>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
Let's look at your cookie (I've stripped out most of the junk).
JavascriptEnabled=true; Cms_User_Id=removedlotsoftextI0=; []scayt_verLang=6; ASP.NET_SessionId=removed0l4mhioft0uavblzdeq; last_msg_check=1425606361000
You have a bunch of cookie-pairs...
JavascriptEnabled=true
Cms_User_Id=removedlotsoftextI0=
[]scayt_verLang=6
ASP.NET_SessionId=removed0l4mhioft0uavblzdeq
last_msg_check=1425606361000
The cookie-name []scayt_verLang is invalid because it contains separators which are not allowed in a token.
I0= is not its own pair, but the tail end of the very long value of Cms_User_Id. = is allowed in a cookie-value so it's valid.

Resources