pyparsing how to pass identifiers to the parser - pyparsing

I am trying to pass a list of valids identifiers to the parser. That is to say: I have a list with the identifiers and the parser should use them, I'm passing them as a parameter into the constructor.
Instead of identifiers = Literal('identifier1') | Literal('identifier2') | Literal('identifier whatever') I have an array of identifiers identifiers = ['identifier1', 'identifier2', 'identifier whatever', ... 'identifier I can not what'] that I need to tell pyparsing to use as identifiers.
This is what I've done so far:
def __init__(self, idents):
if isinstance(idents, list) and idents:
for identifier in idents:
// and this is where I got stuck
// I tried:
// identifiers = Literal(identifier) but this keeps only the lastone
How can I achieve this?

The easiest way to convert a list of strings to a list of alternative parse expressions is to use oneOf:
import pyparsing as pp
color_expr = pp.oneOf(["red", "orange", "yellow", "green", "blue", "purple"])
# for convenience could also write as pp.oneOf("red orange yellow green blue purple")
# but since you are working with a list, I am show code using a list
parsed_colors = pp.OneOrMore(color_expr).parseString("blue orange yellow purple green green")
# use pprint() to list out results because I am lazy
parsed_colors.pprint()
sum(color_expr.searchString("blue 1000 purple, red red swan okra kale 5000 yellow")).pprint()
Prints:
['blue', 'orange', 'yellow', 'purple', 'green', 'green']
['blue', 'purple', 'red', 'red', 'yellow']
So oneOf(["A", "B", "C"]) and the easy-button version oneOf("A B C") are the same as Literal("A") | Literal("B") | Literal("C")
One thing to be careful of with oneOf is that it does not enforce word boundaries
pp.OneOrMore(color_expr).parseString("redgreen reduce").pprint()
will print:
['red', 'green', 'red']
even though the initial 'red' and 'green' are not separate words, and the final 'red' is just the first part of 'reduce'. This is exactly the behavior you would get with using an explicit expression built up with Literals.
To enforce word boundaries, you must use the Keyword class, and now you have to use a bit more Python to build this up.
You will need to build up an Or or MatchFirst expression for your alternatives. Usually you build these up using '^' or '|' operators, respectively. But to create one of these using a list of expressions, then you would call the constructor form Or(expression_list) or MatchFirst(expression_list).
If you have a list of strings, you could just create Or(list_of_identifiers), but this would default to converting the strings to Literals, and we've already seen you don't want that.
Instead, use your strings to create Keyword expressions using a Python list comprehension or generator expression, and pass that to the MatchFirst constructor (MatchFirst will be more efficient than Or, and Keyword matching will be safe to use with MatchFirst's short-circuiting logic). The following will all work the same, with slight variations in how the sequence of Keywords is built and passed to the MatchFirst constructor:
# list comprehension
MatchFirst([Keyword(ident) for ident in list_of_identifiers])
# generator expression
MatchFirst(Keyword(ident) for ident in list_of_identifiers)
# map built-in
MatchFirst(map(Keyword, list_of_identifiers))
Here is the color matching example, redone using Keywords. Note how colors embedded in larger words are not matched now:
colors = ["red", "orange", "yellow", "green", "blue", "purple"]
color_expr = pp.MatchFirst(pp.Keyword(color) for color in colors)
sum(color_expr.searchString("redgreen reduce skyblue boredom purple 100")).pprint()
Prints:
['purple']

Related

How can I dynamically get words surrounding a keyword?

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?
You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"
Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"
Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

Is there a list of all available color options for `col` in R plot? [duplicate]

Short question, if I have a string, how can I test if that string is a valid color representation in R?
Two things I tried, first uses the function col2rgb() to test if it is a color:
isColor <- function(x)
{
res <- try(col2rgb(x),silent=TRUE)
return(!"try-error"%in%class(res))
}
> isColor("white")
[1] TRUE
> isColor("#000000")
[1] TRUE
> isColor("foo")
[1] FALSE
Works, but doesn't seem very pretty and isn't vectorized. Second thing is to just check if the string is in the colors() vector or a # followed by a hexadecimal number of length 4 to 6:
isColor2 <- function(x)
{
return(x%in%colors() | grepl("^#(\\d|[a-f]){6,8}$",x,ignore.case=TRUE))
}
> isColor2("white")
[1] TRUE
> isColor2("#000000")
[1] TRUE
> isColor2("foo")
[1] FALSE
Which works though I am not sure how stable it is. But it seems that there should be a built in function to make this check?
Your first idea (using col2rgb() to test color names' validity for you) seems good to me, and just needs to be vectorized. As for whether it seems pretty or not ... lots/most R functions aren't particularly pretty "under the hood", which is a major reason to create a function in the first place! Hides all those ugly internals from the user.
Once you've defined areColors() below, using it is easy as can be:
areColors <- function(x) {
sapply(x, function(X) {
tryCatch(is.matrix(col2rgb(X)),
error = function(e) FALSE)
})
}
areColors(c(NA, "black", "blackk", "1", "#00", "#000000"))
# <NA> black blackk 1 #00 #000000
# TRUE TRUE FALSE TRUE FALSE TRUE
Update, given the edit
?par gives a thorough description of the ways in which colours can be specified in R. Any solution to a valid colour must consider:
A named colour as listed in colors()
A hexademical representation, as a character, of the form "#RRGGBBAA specifying the red, green, blue and alpha channels. The Alpha channel is for transparency, which not all devices support and hence whilst it is valid to specify a colour in this way with 8 hex values it may not be valid on a specific device.
NA is a valid "colour". It means transparent, but as far as R is concerned it is a valid colour representation.
Likewise "transparent" is also valid, but not in colors(), so that needs to be handled as well
1 is a valid colour representation as it is the index of a colour in a small palette of colours as returned by palette()
> palette()
[1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow"
[8] "gray"
Hence you need to cope with 1:8. Why is this important, well ?par tells us that it is also valid to represent the index for these colours as a character hence you need to capture "1" as a valid colour representation. However (as noted by #hadley in the comments) this is just for the default palette. Another palette may be used by a user, in which case you will have to consider a character index to an element of a vector of the maximum allowed length for your version of R.
Once you've handled all those you should be good to go ;-)
To the best of my knowledge there isn't a user-visible function that does this. All of this in buried away inside the C code that does the plotting; very quickly you end up in .Internal(....) land and there be dragons!
Original
[To be pedantic #000000 isn't a colour name in R.]
The only colour names R knows are those returned by colors(). Yes, #000000 is one of the colour representations that R understands but you specifically ask about a name and the definitive list or solution is x %in% colors() as you have in your second example.
This is about as stable as it gets. When you use a colour like col = "goldenrod", internally R matches this with a "proper" representation of the colour for whichever device you are plotting on. color() returns the list of colour names that R can do this looking up for. If it isn't in colors() then it isn't a colour name.

How to specify a repeated pattern within one string using `sub()` instead of `gsub()` in R

I am aware of the many answer showing how to match multiple occurrences within a single string. However, I couldn't yet find an answer that would provide context as to why the following doesn't work:
## A string for which I want to replace `red` and `Red` with `RED`
x <- c("redflag flagred red and Red")
## This one works using `gsub()`
gsub("\\b(?:red|Red)\\b", "RED", x)
#[1] "redflag flagred RED and RED"
But is there a way to use sub() instead? The following doesn't work. It only matches the first occurrence and then stops:
sub("\\b(?:red|Red)\\b", "RED", x)
#[1] "redflag flagred RED and Red"
When checking the actual pattern it should match: https://regex101.com/r/X7DSB0/1 I am assuming this has something to do with the "global flag"?
I also tried adding a + or {1,} to get multiple matches but that doesn't work either:
## using a `+` doesn't work either
sub("\\b(?:red|Red)+\\b", "RED", x)
#[1] "redflag flagred RED and Red"
## using `{1,}` doesn't work either
sub("\\b(?:red|Red){1,}\\b", "RED", x)
#[1] "redflag flagred RED and Red"
What am I not understanding? How could I use sub() instead of gsub() for such an operation?
The g in gsub stands for "global," which means that you are telling the regex engine to apply the substitution to the entire string. On the other hand, sub just does the first replacement it encounters.
So the answer to your question is that you should use gsub if you intend to make every possible replacement:
gsub("\\b(?:red|Red)\\b", "RED", x)
[1] "redflag flagred RED and RED"

Use the grep functie to find words which contain either blue of red

Im using the grep function to select certain column heads. The heads I want to select should contain exactly "red" or "blue"
I got the red thing to work using (I stored the columnnames in a variable called x) ->
x <- c("Red", "Blue", "blue", "green")
grep("^red$", x, varnames=TRUE)
But i cant figure out how to look for red OR blue... Any thoughts?
grep("^(red|blue)$", x, varnames=TRUE)
This doesn't seem to work...
If the search is not supposed to be case-sensitive, then I'd suggest the following:
> x <- c("Red", "Blue", "blue", "green")
> grep("^(red|blue)$",tolower(x))
[1] 1 2 3
grep("red|blue", x, ignore.case=T, value=T) # returns [1] "Red" "Blue" "blue"
If you require the match to be case-sensitive, remove the ignore.case=T.
If you require a case-sensitive match to the entire string (which is what you get when you use the assertions ^ and $) then you are basically asking for x[x=="blue"|x=="red"], which may be more efficient than a regex.

How to refer to a variable name with spaces?

In ggplot2, how do I refer to a variable name with spaces?
Why do qplot() and ggplot() break when used on variable names with quotes?
For example, this works:
qplot(x,y,data=a)
But this does not:
qplot("x","y",data=a)
I ask because I often have data matrices with spaces in the name. Eg, "State Income". ggplot2 needs data frames; ok, I can convert. So I'd want to try something like:
qplot("State Income","State Ideology",data=as.data.frame(a.matrix))
That fails.
Whereas in base R graphics, I'd do:
plot(a.matrix[,"State Income"],a.matrix[,"State Ideology"])
Which would work.
Any ideas?
Answer: because 'x' and 'y' are considered a length-one character vector, not a variable name. Here you discover why it is not smart to use variable names with spaces in R. Or any other programming language for that matter.
To refer to variable names with spaces, you can use either hadleys solution
a.matrix <- matrix(rep(1:10,3),ncol=3)
colnames(a.matrix) <- c("a name","another name","a third name")
qplot(`a name`, `another name`,data=as.data.frame(a.matrix)) # backticks!
or the more formal
qplot(get('a name'), get('another name'),data=as.data.frame(a.matrix))
The latter can be used in constructs where you pass the name of a variable as a string in eg a loop construct :
for (i in c("another name","a third name")){
print(qplot(get(i),get("a name"),
data=as.data.frame(a.matrix),xlab=i,ylab="a name"))
Sys.sleep(5)
}
Still, the best solution is not to use variable names with spaces.
Using get is not more "formal", actually I would argue the opposite. As the R help says (help("`")), you can almost always use a variable name that contains spaces, provided it's quoted. (Normally, with a backtick, as already suggested.)
Something similar was asked on ggplot2 mailing list and Mehmet Gültaş linked to this post. Another way of using strings to construct your ggplot call is through the aes_strings function. Note that you still have to put backticks inside the quotes for the thing to work for variables with spaces.
library(ggplot2)
names(mtcars)[1] <- "em pi dzi"
ggplot(mtcars, aes_string(x = "cyl", y = "`em pi dzi`")) +
theme_bw() +
geom_jitter()

Resources